From vonwyl at EIG.UNIGE.CH  Mon Nov  1 02:39:32 2004
From: vonwyl at EIG.UNIGE.CH (von Wyl)
Date: Mon, 01 Nov 2004 11:39:32 +0100
Subject: [openib-general] Failed limiting maximum outstanding PCI reads
Message-ID: <41861264.2020305@eig.unige.ch>

Hi,

I get this problem after installing the gen2 roland-merge stack (for 
linux kernel) and the gen1 trunk (for useraccess) :
vapi:    Inspecting PCI chipset:                                [ECHEC ]

Failed limiting maximum outstanding PCI reads

and when I try lspci -s 3 (port number 3...) I get :
04:03.0 PCI bridge: Mellanox Technology: Unknown device 5a46 (rev a1)

If anyone has an idea, please e-mail me...


From kcm at psc.edu  Mon Nov  1 04:44:07 2004
From: kcm at psc.edu (Ken MacInnis)
Date: Mon, 01 Nov 2004 07:44:07 -0500
Subject: [openib-general] Problem with 2.4.24 and gen1
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com>
Message-ID: <41862F97.3080301@psc.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Right.  I've had this machine (and OS running a much more vanilla
configuration) and HBA using the OpenIB and MTI stacks just fine in the
past.  Dual Opteron, 8GB RAM, PCI-X MT23108.  This same problem happens
with this kernel on fairly different hardware we're using too, though..

It is Fedora Core 1, vanilla 2.4.24-based with Lustre 1.2.6
patches/mods.  Almost nothing is modular in the kernel.. it is either
off or compiled in.  In fact, ACPI is turned off.. perhaps enabling it
would be beneficial?  I have attached the config file if that helps.
Perhaps there is something critical I have unknowingly disabled.

Also, another question I have is fairly naive -- at what point are the
Lion Cub (PCI Express) cards supported in the OpenIB stack?  I seem to
remember the Tavor code supporting them inherently but in a
non-efficient manner if native code wasn't used.

Ken

Tziporet Koren wrote:

| The problem is that the driver does not get the interrupt for the command
| completion,
| and thus you get the error: "Command not completed after timeout".
|
| It is related to the OS & system you are using. What is the
distribution you
| are using? We once saw such problems with older versions of SuSE.
|
| Try to add append="acpi=off" to the lilo you are using or add also
| disableapic in the same append line.
|
|
| Tziporet
|
|
| -----Original Message-----
| From: Ken MacInnis [mailto:kcm at psc.edu]
| Sent: Sunday, October 31, 2004 8:20 PM
| To: openib-general at openib.org
| Subject: [openib-general] Problem with 2.4.24 and gen1

| I've got a fairly modified kernel here I'm trying to get a OpenIB stack
| running on.  It's a vanilla 2.4.24 kernel with Lustre and other patches
| in it, but I'm seeing this when I modprobe ib_tavor:
|
| Oct 31 13:13:05 samwise kernel:  THH(1): cmdif.c[1190]: Command not
| completed after timeout: cmd=TAV
| OR_IF_CMD_MAD_IFC (0x24), token=0x1400, pid=0x8E1, go=0
| Oct 31 13:13:05 samwise kernel:  THH(1): CMD ERROR DUMP. opcode=0x24,
| opc_mod = 0x1, exec_time_micro
| =300000000
| .
| .
| Oct 31 13:13:06 samwise kernel:  THH(1): cmdif.c[842]: Failed command
| 0x24 (TAVOR_IF_CMD_MAD_IFC): s
| tatus=0x103 (0x0103 - unexpected error - fatal)
| Oct 31 13:13:06 samwise kernel:
| Oct 31 13:13:06 samwise kernel:  THH(1): thh_hob.c[2790]:
| THH_hob_query_port_prop: cmdif returned FA
| TAL
| Oct 31 13:13:06 samwise kernel:  VIPKL(1): qpm.c[278]: QPM_new:
| HOBKL_query_port_prop returned with
| error: -254 = VAPI_EFATAL
| Oct 31 13:13:06 samwise kernel:  VIPKL(1): qpm.c[302]: QPM_new:
| returned with error: -254 = VAPI_EF
| ATAL
| Oct 31 13:13:06 samwise kernel:  THH(1): thh_hob.c[3474]:
| THH_hob_fatal_err_thread: RECEIVED FATAL E
| RROR WAKEUP
| Oct 31 13:13:06 samwise kernel:  THH(1): thh_hob.c[4490]:
| THH_hob_halt_hca: HALT HCA returned 0x103
| Oct 31 13:13:06 samwise kernel:  THH(1): thh_hob.c[1620]:
| THH_hob_destroy: FATAL ERROR
| Oct 31 13:13:06 samwise kernel:  THH(1): thh_hob.c[1627]:
| THH_hob_destroy: PERFORMING SW RESET. pa=0
| xFE9F0010 va=0xF8A01010
| Oct 31 13:13:06 samwise kernel:
| Oct 31 13:13:06 samwise kernel: Mellanox Tavor Device Driver is creating
| device "InfiniHost0" (bus=0
| 4, devfn=00)
| Oct 31 13:13:06 samwise kernel:
| Oct 31 13:13:06 samwise kernel:
| [KERNEL_IB][_tsIbTavorInitOne][tavor_main.c:86]InfiniHost0: VAPI_ope
| n_hca failed, status -254 (Fatal error (Local Catastrophic Error))
| Oct 31 13:13:06 samwise kernel:
| [SRPTP][srp_host_init][srp_host.c:1495]SRP Host using indirect addre
| ssing
|
|
| This occurs with an older openib rev (200-ish) as well as one up-to-date
| as of today.
|
| Everything else (modules.conf, etc.) is set up as it has been when I was
| messing with 2.4 kernels and OpenIB a few months ago, so I'm not
| thinking it's related to such.
|
| Any ideas?  Yes, I know it's 2.4 as well as a fairly older 2.4, but I
| have no choice here. :)  lspci -vvv bits follow.
|
| 03:01.0 PCI bridge: Mellanox Technology: Unknown device 5a46 (rev a1)
| (prog-if 00 [Normal decode])
|          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
| ParErr- Stepping- SERR+ FastB2B-
|          Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
| <TAbort- <MAbort- >SERR- <P
| ERR-
|          Latency: 64, cache line size 10
|          Bus: primary=03, secondary=04, subordinate=04, sec-latency=64
|          I/O behind bridge: 0000f000-00000fff
|          Memory behind bridge: fe700000-fe9fffff
|          Prefetchable memory behind bridge:
| 00000000eb200000-00000000fc200000
|          BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
|          Capabilities: [70] PCI-X non-bridge device.
|                  Command: DPERE+ ERO+ RBC=0 OST=4
|                  Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
| DC=simple, DMMRBC=0, DMOST=0, D
| MCRS=0, RSCEM-
| 04:00.0 InfiniBand: Mellanox Technology: Unknown device 5a44 (rev a1)
|          Subsystem: Mellanox Technology: Unknown device 5a44
|          Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
| ParErr- Stepping- SERR+ FastB2B-
|          Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
| <TAbort- <MAbort- >SERR- <P
| ERR-
|          Latency: 64, cache line size 10
|          Interrupt: pin A routed to IRQ 25
|          Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
|          Region 2: Memory at fb800000 (64-bit, prefetchable) [size=8M]
|          Region 4: Memory at f0000000 (64-bit, prefetchable) [size=128M]
|          Capabilities: [40] #11 [001f]
|          Capabilities: [60] Message Signalled Interrupts: 64bit+
| Queue=0/5 Enable-
|                  Address: 0000000000000000  Data: 0000
|          Capabilities: [70] PCI-X non-bridge device.
|                  Command: DPERE- ERO- RBC=3 OST=1
|                  Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
| DC=simple, DMMRBC=0, DMOST=0, D
| MCRS=0, RSCEM-
|
|
| Ken
|


- --
Ken MacInnis - Systems Engineer, PSC - http://www.psc.edu/~kcm/
kcm at psc dot edu - +1 412 268 9833 (w) - +1 412 268 5832 (f)
Pittsburgh Supercomputing Center - 4400 Fifth Ave - Pittsburgh, PA 15213
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFBhi9mnT0C17PQhv4RAvckAKComYvuQ8dZ+B3tZBuBvkH6q+MDSgCfe3Bz
DtsqzV39ekgtfzWIGx6vNzk=
=zkFD
-----END PGP SIGNATURE-----


From halr at voltaire.com  Mon Nov  1 07:23:33 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 10:23:33 -0500
Subject: [openib-general] [PATCH]spinlock shouldn't be held while	calling
	ib_post_send()
In-Reply-To: <20041029170917.3faa58e3.mshefty@ichips.intel.com>
References: <OFD54F9E0F.F11E2D99-ON87256F3C.0083C89E-88256F3D.00009F45@us.ibm.com>
	<20041029170917.3faa58e3.mshefty@ichips.intel.com>
Message-ID: <1099322613.12249.25.camel@hpc-1>

On Fri, 2004-10-29 at 20:09, Sean Hefty wrote:
> On Fri, 29 Oct 2004 18:06:47 -0600
> Shirley Ma <xma at us.ibm.com> wrote:
> 
> > Here is the patch.
> 
> Note that my patch removes the lock when calling ib_post_send.  But,
> holding the lock when calling ib_post_send() should be fine.  Also, the
> current completion code assumes that the work requests are queued in the
> same order that the sends are posted in.  Releasing the lock after
> queuing the request, but before calling ib_psot_send() allows work
> requests to be posted out of order from the order that they are queued
> on the send posted list.

So should this patch be applied or is it superceeded by your pending
patch (and I should wait for that) ?

Thanks.

-- Hal


From roland at topspin.com  Mon Nov  1 07:27:27 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 01 Nov 2004 07:27:27 -0800
Subject: [openib-general] [PATCH]spinlock shouldn't be held while
	calling ib_post_send()
In-Reply-To: <1099322613.12249.25.camel@hpc-1> (Hal Rosenstock's message of
	"Mon, 01 Nov 2004 10:23:33 -0500")
References: <OFD54F9E0F.F11E2D99-ON87256F3C.0083C89E-88256F3D.00009F45@us.ibm.com>
	<20041029170917.3faa58e3.mshefty@ichips.intel.com>
	<1099322613.12249.25.camel@hpc-1>
Message-ID: <52pt2xvbc0.fsf@topspin.com>

    Hal> So should this patch be applied or is it superceeded by your
    Hal> pending patch (and I should wait for that) ?

sounds like the patch is not needed and actively breaks things, so my
guess would be that it's better not to apply.

 - R.


From roland at topspin.com  Mon Nov  1 07:29:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 01 Nov 2004 07:29:38 -0800
Subject: [openib-general] Problem with 2.4.24 and gen1
In-Reply-To: <41862F97.3080301@psc.edu> (Ken MacInnis's message of "Mon, 01
	Nov 2004 07:44:07 -0500")
References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com>
	<41862F97.3080301@psc.edu>
Message-ID: <52lldlvb8d.fsf@topspin.com>

    Ken> Also, another question I have is fairly naive -- at what
    Ken> point are the Lion Cub (PCI Express) cards supported in the
    Ken> OpenIB stack?  I seem to remember the Tavor code supporting
    Ken> them inherently but in a non-efficient manner if native code
    Ken> wasn't used.

Lion Cub aka Arbel aka PCI Ex HCA is supported in Tavor compatibility
mode right now by the mthca driver.  I just received firmware and
documentation for native mode last week, so support for that will be
"coming soon."  However Tavor mode is not really much of a performance
hit -- on a suitable motherboard you should still be able to hit (bus
limited) 20 Gb/sec of throughput.

 - Roland


From roland at topspin.com  Mon Nov  1 07:30:01 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 01 Nov 2004 07:30:01 -0800
Subject: [openib-general] Failed limiting maximum outstanding PCI reads
In-Reply-To: <41861264.2020305@eig.unige.ch> (von Wyl's message of "Mon, 01
	Nov 2004 11:39:32 +0100")
References: <41861264.2020305@eig.unige.ch>
Message-ID: <52hdo9vb7q.fsf@topspin.com>

    von> Hi, I get this problem after installing the gen2 roland-merge
    von> stack (for linux kernel) and the gen1 trunk (for useraccess)

gen1 userspace won't work with gen2 kernel side, unfortunately.

 - Roland


From kcm at psc.edu  Mon Nov  1 07:48:57 2004
From: kcm at psc.edu (Ken MacInnis)
Date: Mon, 01 Nov 2004 10:48:57 -0500
Subject: [openib-general] Problem with 2.4.24 and gen1
In-Reply-To: <52lldlvb8d.fsf@topspin.com>
References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com>	<41862F97.3080301@psc.edu>
	<52lldlvb8d.fsf@topspin.com>
Message-ID: <41865AE9.20808@psc.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Roland Dreier wrote:
|     Ken> Also, another question I have is fairly naive -- at what
|     Ken> point are the Lion Cub (PCI Express) cards supported in the
|     Ken> OpenIB stack?  I seem to remember the Tavor code supporting
|     Ken> them inherently but in a non-efficient manner if native code
|     Ken> wasn't used.
|
| Lion Cub aka Arbel aka PCI Ex HCA is supported in Tavor compatibility
| mode right now by the mthca driver.  I just received firmware and
| documentation for native mode last week, so support for that will be
| "coming soon."  However Tavor mode is not really much of a performance
| hit -- on a suitable motherboard you should still be able to hit (bus
| limited) 20 Gb/sec of throughput.

Does this extend to the older non-mthca code?  I assume this Tavor
compatibility mode is wrt. the HBA, so that it does.  :)

Thanks!
Ken

- --
Ken MacInnis - Systems Engineer, PSC - http://www.psc.edu/~kcm/
kcm at psc dot edu - +1 412 268 9833 (w) - +1 412 268 5832 (f)
Pittsburgh Supercomputing Center - 4400 Fifth Ave - Pittsburgh, PA 15213
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFBhlrpnT0C17PQhv4RAsUJAJ0drjCY0G6UDeztXJDPIHJqA8NUuQCfarLj
xeIisjQe2XGV9GQ755KaU+c=
=pe9I
-----END PGP SIGNATURE-----


From tziporet at mellanox.co.il  Mon Nov  1 08:00:38 2004
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 1 Nov 2004 18:00:38 +0200 
Subject: [openib-general] Problem with 2.4.24 and gen1
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BE96E@mtlex01.yok.mtl.com>

Tavor mode and native Arbel mode have the same performance.
The main change for the native mode is the ability to work without attached
DDR and use the system memory instead.

Tziporet

-----Original Message-----
From: Roland Dreier [mailto:roland at topspin.com]
Sent: Monday, November 01, 2004 5:30 PM
To: Ken MacInnis
Cc: Tziporet Koren; Tech_Support; openib-general at openib.org
Subject: Re: [openib-general] Problem with 2.4.24 and gen1


    Ken> Also, another question I have is fairly naive -- at what
    Ken> point are the Lion Cub (PCI Express) cards supported in the
    Ken> OpenIB stack?  I seem to remember the Tavor code supporting
    Ken> them inherently but in a non-efficient manner if native code
    Ken> wasn't used.

Lion Cub aka Arbel aka PCI Ex HCA is supported in Tavor compatibility
mode right now by the mthca driver.  I just received firmware and
documentation for native mode last week, so support for that will be
"coming soon."  However Tavor mode is not really much of a performance
hit -- on a suitable motherboard you should still be able to hit (bus
limited) 20 Gb/sec of throughput.

 - Roland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041101/dcfbf690/attachment.html>

From halr at voltaire.com  Mon Nov  1 08:17:38 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 11:17:38 -0500
Subject: [openib-general] [PATCH]code optimization in
	ib_register_mad_agent()
In-Reply-To: <20041029171345.1d01e8a3.mshefty@ichips.intel.com>
References: <OFC4A14FF6.B9230992-ON87256F3C.0080C3B9-88256F3C.00819BCA@us.ibm.com>
	<20041029171345.1d01e8a3.mshefty@ichips.intel.com>
Message-ID: <1099325858.3074.1.camel@hpc-1>

On Fri, 2004-10-29 at 20:13, Sean Hefty wrote: 
> On Fri, 29 Oct 2004 17:35:40 -0600
> Shirley Ma <xma at us.ibm.com> wrote:
> 
> > I am starting to look at the access layer code. Here is a code 
> > optimization patch in ib_register_mad_agent().
> 
> ib_mad_client_id must be incremented while holding the spinlock (or
> converted into an atomic).  The rest of the initialization looks fine
> moved upwards.

Thanks. Applied with moving the ib_mad_client_id increment down under
holding the registration lock.

-- Hal


From halr at voltaire.com  Mon Nov  1 08:24:35 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 11:24:35 -0500
Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap	from	2
	verb routines.
In-Reply-To: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com>
References: <Pine.LNX.4.44.0410291247460.17190-200000@DYN318430BLD>
	<20041029131437.6f1d0cf6.mshefty@ichips.intel.com>
Message-ID: <1099326275.3074.3.camel@hpc-1>

On Fri, 2004-10-29 at 16:14, Sean Hefty wrote:
> On Fri, 29 Oct 2004 13:01:03 -0700 (PDT)
> Krishna Kumar <krkumar at us.ibm.com> wrote:
> 
> > Hi,
> > 
> > I know this changes the verbs interface a bit, but ...
> > 
> > I don't see a value in the qp_cap being passed to different routines,
> > when either ib_qp_attr or ib_qp_init_attr, both of which contain a
> > qp_cap, are being passed at the same time.
> 
> The parameter is there to separate input/output parameters, and resulted
> from the original VAPI evolution of the code.  There's no strong
> technical reason that it cannot be removed.

Should this patch be applied ? If so, I will test this and then it can
also be merged to roland's branch.

-- Hal


From kcm at psc.edu  Mon Nov  1 09:40:15 2004
From: kcm at psc.edu (Ken MacInnis)
Date: Mon, 01 Nov 2004 12:40:15 -0500
Subject: [openib-general] Problem with 2.4.24 and gen1
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com>
Message-ID: <418674FF.7050209@psc.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

ACPI was already not in the kernel.  Appending 'noapic disableapic' did
work to load the Tavor code. :)  Thanks for the hint!

However, now OpenSM is still misbehaving:

- -------------------------------------------------
OpenSM Rev:B1-rc1
Command Line Arguments:
~ Log File: /tmp/osm.log
- -------------------------------------------------

Error from osm_opensm_init (1)

Error from osm_opensm_bind (0x2A)


[1099330621:000868906][4000] -> OpenSM Rev:B1-rc1
[1099330621:000868958][4000] -> osm_opensm_init: Forcing single threaded
dispatcher.
[1099330621:000869383][4000] -> osm_report_notice: Received Generic
Notice type:3 num:66 from LID:0x
0000 GUID:0xfe80000000000000,0x0000000000000000
[1099330621:000869402][4000] -> osm_report_notice: Received Generic
Notice type:3 num:66 from LID:0x
0000 GUID:0xfe80000000000000,0x0000000000000000
[1099330621:000869445][4000] -> __osm_vendor_get_ca_ids: ERR 3D09: No
available channel adapters.
[1099330621:000869456][4000] -> osm_vendor_get_all_port_attr: ERR 3D13:
Fail to get CA Ids .
[1099330621:000869484][4000] -> __osm_vendor_get_ca_ids: ERR 3D11: : Bad
parameter in calling: EVAPI
_list_hcas.
[1099330621:000869493][4000] -> osm_vendor_get_guid_ca_and_port: ERR
3D16: Fail to get CA Ids .
[1099330621:000869503][4000] -> osm_vendor_bind: ERR 5005: Fail to find
port number of port guid:0x0
000000000000000
[1099330621:000869515][4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor
specific bind() failed.
[1099330621:000869526][4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller
bind() failed (IB_ERROR).


Any ideas on this?  I did make very sure to check that userland and
opensm was in sync with the kernel bits I'm using.  The 0s in the LID
and GUID are concerning me.

I may end up trying the newer OpenIB stack for fun (ha), and see if that
works better.

Ken

Tziporet Koren wrote:
| Hi,
|
| The problem is that the driver does not get the interrupt for the command
| completion,
| and thus you get the error: "Command not completed after timeout".
|
| It is related to the OS & system you are using. What is the
distribution you
| are using? We once saw such problems with older versions of SuSE.
|
| Try to add append="acpi=off" to the lilo you are using or add also
| disableapic in the same append line.

| -----Original Message-----
| From: Ken MacInnis [mailto:kcm at psc.edu]
| Sent: Sunday, October 31, 2004 8:20 PM
| To: openib-general at openib.org
| Subject: [openib-general] Problem with 2.4.24 and gen1

| I've got a fairly modified kernel here I'm trying to get a OpenIB stack
| running on.  It's a vanilla 2.4.24 kernel with Lustre and other patches
| in it, but I'm seeing this when I modprobe ib_tavor:
|
| Oct 31 13:13:05 samwise kernel:  THH(1): cmdif.c[1190]: Command not
| completed after timeout: cmd=TAV


- --
Ken MacInnis - Systems Engineer, PSC - http://www.psc.edu/~kcm/
kcm at psc dot edu - +1 412 268 9833 (w) - +1 412 268 5832 (f)
Pittsburgh Supercomputing Center - 4400 Fifth Ave - Pittsburgh, PA 15213
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFBhnT/nT0C17PQhv4RAicqAJ9hRiudNE1Bfof+BDrG09XfA5jD/wCcDH/D
UT/E1V7i0yO6pPPOx9oobNQ=
=R5wl
-----END PGP SIGNATURE-----


From mshefty at ichips.intel.com  Mon Nov  1 09:40:03 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 09:40:03 -0800
Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap	from
	2 verb routines.
In-Reply-To: <1099326275.3074.3.camel@hpc-1>
References: <Pine.LNX.4.44.0410291247460.17190-200000@DYN318430BLD>
	<20041029131437.6f1d0cf6.mshefty@ichips.intel.com>
	<1099326275.3074.3.camel@hpc-1>
Message-ID: <20041101094003.6c7bc3e0.mshefty@ichips.intel.com>

On Mon, 01 Nov 2004 11:24:35 -0500
Hal Rosenstock <halr at voltaire.com> wrote:

> On Fri, 2004-10-29 at 16:14, Sean Hefty wrote:
> > On Fri, 29 Oct 2004 13:01:03 -0700 (PDT)
> > Krishna Kumar <krkumar at us.ibm.com> wrote:
> > 
> > > Hi,
> > > 
> > > I know this changes the verbs interface a bit, but ...
> > > 
> > > I don't see a value in the qp_cap being passed to different
> > > routines, when either ib_qp_attr or ib_qp_init_attr, both of which
> > > contain a qp_cap, are being passed at the same time.
> > 
> > The parameter is there to separate input/output parameters, and
> > resulted from the original VAPI evolution of the code.  There's no
> > strong technical reason that it cannot be removed.
> 
> Should this patch be applied ? If so, I will test this and then it can
> also be merged to roland's branch.

I'm fine with applying this patch - just wanted to let others provide
input.  We should probably modify ipoib before committing the changes.

- Sean


From mshefty at ichips.intel.com  Mon Nov  1 09:41:46 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 09:41:46 -0800
Subject: [openib-general] [PATCH]spinlock shouldn't be held while
	calling ib_post_send()
In-Reply-To: <52pt2xvbc0.fsf@topspin.com>
References: <OFD54F9E0F.F11E2D99-ON87256F3C.0083C89E-88256F3D.00009F45@us.ibm.com>
	<20041029170917.3faa58e3.mshefty@ichips.intel.com>
	<1099322613.12249.25.camel@hpc-1> <52pt2xvbc0.fsf@topspin.com>
Message-ID: <20041101094146.59996de5.mshefty@ichips.intel.com>

On Mon, 01 Nov 2004 07:27:27 -0800
Roland Dreier <roland at topspin.com> wrote:

>     Hal> So should this patch be applied or is it superceeded by your
>     Hal> pending patch (and I should wait for that) ?
> 
> sounds like the patch is not needed and actively breaks things, so my
> guess would be that it's better not to apply.

Correct - I would not apply this patch.

- Sean


From halr at voltaire.com  Mon Nov  1 10:40:42 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 13:40:42 -0500
Subject: [openib-general] SM and smi
Message-ID: <1099334442.3074.45.camel@hpc-1>

The Get/Set(SMInfo) is one aspect related to the MAD layer for SM
support which has been discussed on the list (and the changes are still
on my TODO list).

I was wondering how others saw SMI support for the SM. It seems to me
that it makes sense to expose the routines that the agent is using so
that they do not need to be reinvented for the SM. Does that make sense
? Better yet might be exposing a routine for SM class sending.

Thanks.

-- Hal


From halr at voltaire.com  Mon Nov  1 12:42:16 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 15:42:16 -0500
Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap	from	2
	verb routines.
In-Reply-To: <20041101094003.6c7bc3e0.mshefty@ichips.intel.com>
References: <Pine.LNX.4.44.0410291247460.17190-200000@DYN318430BLD>
	<20041029131437.6f1d0cf6.mshefty@ichips.intel.com>
	<1099326275.3074.3.camel@hpc-1>
	<20041101094003.6c7bc3e0.mshefty@ichips.intel.com>
Message-ID: <1099341735.3074.97.camel@hpc-1>

On Mon, 2004-11-01 at 12:40, Sean Hefty wrote:
> On Mon, 01 Nov 2004 11:24:35 -0500
> Hal Rosenstock <halr at voltaire.com> wrote:
> 
> > On Fri, 2004-10-29 at 16:14, Sean Hefty wrote:
> > > On Fri, 29 Oct 2004 13:01:03 -0700 (PDT)
> > > Krishna Kumar <krkumar at us.ibm.com> wrote:
> > > 
> > > > Hi,
> > > > 
> > > > I know this changes the verbs interface a bit, but ...
> > > > 
> > > > I don't see a value in the qp_cap being passed to different
> > > > routines, when either ib_qp_attr or ib_qp_init_attr, both of which
> > > > contain a qp_cap, are being passed at the same time.
> > > 
> > > The parameter is there to separate input/output parameters, and
> > > resulted from the original VAPI evolution of the code.  There's no
> > > strong technical reason that it cannot be removed.
> > 
> > Should this patch be applied ? If so, I will test this and then it can
> > also be merged to roland's branch.
> 
> I'm fine with applying this patch - just wanted to let others provide
> input.  We should probably modify ipoib before committing the changes.

Thanks. Applied (excepting the change to mthca_provider.c).

Attached is the remaining patch for roland's branch. Note mad.c will
need to be moved over as well.

-- Hal

Index: include/ib_verbs.h
===================================================================
--- include/ib_verbs.h	(revision 1106)
+++ include/ib_verbs.h	(working copy)
@@ -709,12 +709,10 @@
 					       struct ib_ah_attr *ah_attr);
 	int                        (*destroy_ah)(struct ib_ah *ah);
 	struct ib_qp *             (*create_qp)(struct ib_pd *pd,
-						struct ib_qp_init_attr *qp_init_attr,
-						struct ib_qp_cap *qp_cap);
+						struct ib_qp_init_attr *qp_init_attr);
 	int                        (*modify_qp)(struct ib_qp *qp,
 						struct ib_qp_attr *qp_attr,
-						int qp_attr_mask,
-						struct ib_qp_cap *qp_cap);
+						int qp_attr_mask);
 	int                        (*query_qp)(struct ib_qp *qp,
 					       struct ib_qp_attr *qp_attr,
 					       int qp_attr_mask,
@@ -851,13 +849,11 @@
 int ib_destroy_ah(struct ib_ah *ah);
 
 struct ib_qp *ib_create_qp(struct ib_pd *pd,
-			   struct ib_qp_init_attr *qp_init_attr,
-			   struct ib_qp_cap *qp_cap);
+			   struct ib_qp_init_attr *qp_init_attr);
 
 int ib_modify_qp(struct ib_qp *qp,
 		 struct ib_qp_attr *qp_attr,
-		 int qp_attr_mask,
-		 struct ib_qp_cap *qp_cap);
+		 int qp_attr_mask);
 
 int ib_query_qp(struct ib_qp *qp,
 		struct ib_qp_attr *qp_attr,
Index: core/verbs.c
===================================================================
--- core/verbs.c	(revision 1106)
+++ core/verbs.c	(working copy)
@@ -105,12 +105,11 @@
 /* Queue pairs */
 
 struct ib_qp *ib_create_qp(struct ib_pd *pd,
-			   struct ib_qp_init_attr *qp_init_attr,
-			   struct ib_qp_cap *qp_cap)
+			   struct ib_qp_init_attr *qp_init_attr)
 {
 	struct ib_qp *qp;
 
-	qp = pd->device->create_qp(pd, qp_init_attr, qp_cap);
+	qp = pd->device->create_qp(pd, qp_init_attr);
 
 	if (!IS_ERR(qp)) {
 		qp->device     	  = pd->device;
@@ -133,10 +132,9 @@
 
 int ib_modify_qp(struct ib_qp *qp,
 		 struct ib_qp_attr *qp_attr,
-		 int qp_attr_mask,
-		 struct ib_qp_cap *qp_cap)
+		 int qp_attr_mask)
 {
-	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, qp_cap);
+	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask);
 }
 EXPORT_SYMBOL(ib_modify_qp);
 
Index: hw/mthca/mthca_provider.c
===================================================================
--- hw/mthca/mthca_provider.c	(revision 1106)
+++ hw/mthca/mthca_provider.c	(working copy)
@@ -287,8 +287,7 @@
 }
 
 static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
-				     struct ib_qp_init_attr *init_attr,
-				     struct ib_qp_cap *qp_cap)
+				     struct ib_qp_init_attr *init_attr)
 {
 	struct mthca_qp *qp;
 	int err;
@@ -347,8 +346,7 @@
 		return ERR_PTR(err);
 	}
 
-	*qp_cap = init_attr->cap;
-	qp_cap->max_inline_data = 0;
+        init_attr->cap.max_inline_data = 0;
 
 	return &qp->ibqp;
 }
Index: ulp/ipoib/ipoib_verbs.c
===================================================================
--- ulp/ipoib/ipoib_verbs.c	(revision 1106)
+++ ulp/ipoib/ipoib_verbs.c	(working copy)
@@ -27,7 +27,6 @@
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr *qp_attr;
-	struct ib_qp_cap qp_cap;
 	int attr_mask;
 	int ret;
 	u16 pkey_index;
@@ -47,7 +46,7 @@
 	/* set correct QKey for QP */
 	qp_attr->qkey = priv->qkey;
 	attr_mask = IB_QP_QKEY;
-	ret = ib_modify_qp(priv->qp, qp_attr, attr_mask, &qp_cap);
+	ret = ib_modify_qp(priv->qp, qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret);
 		goto out;
@@ -98,7 +97,6 @@
 		.rq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type     = IB_QPT_UD
 	};
-	struct ib_qp_cap qp_cap;
 
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
@@ -115,7 +113,7 @@
 	}
 	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 
-	priv->qp = ib_create_qp(priv->pd, &init_attr, &qp_cap);
+	priv->qp = ib_create_qp(priv->pd, &init_attr);
 	if (IS_ERR(priv->qp)) {
 		ipoib_warn(priv, "failed to create QP\n");
 		return PTR_ERR(priv->qp);
@@ -137,7 +135,7 @@
 	    IB_QP_PORT |
 	    IB_QP_PKEY_INDEX |
 	    IB_QP_STATE;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask, &qp_cap);
+	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret);
 		goto out_fail;
@@ -146,7 +144,7 @@
 	qp_attr.qp_state = IB_QPS_RTR;
 	/* Can't set this in a INIT->RTR transition */
 	attr_mask &= ~IB_QP_PORT;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask, &qp_cap);
+	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret);
 		goto out_fail;
@@ -156,7 +154,7 @@
 	qp_attr.sq_psn = 0;
 	attr_mask |= IB_QP_SQ_PSN;
 	attr_mask &= ~IB_QP_PKEY_INDEX;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask, &qp_cap);
+	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret);
 		goto out_fail;


From roland at topspin.com  Mon Nov  1 14:15:33 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 01 Nov 2004 14:15:33 -0800
Subject: [openib-general] SM and smi
In-Reply-To: <1099334442.3074.45.camel@hpc-1> (Hal Rosenstock's message of
	"Mon, 01 Nov 2004 13:40:42 -0500")
References: <1099334442.3074.45.camel@hpc-1>
Message-ID: <52lldltdve.fsf@topspin.com>

    Hal> I was wondering how others saw SMI support for the SM. It
    Hal> seems to me that it makes sense to expose the routines that
    Hal> the agent is using so that they do not need to be reinvented
    Hal> for the SM. Does that make sense ? Better yet might be
    Hal> exposing a routine for SM class sending.

I think SMI processing should be applied to all DR SMPs passed to
ib_post_send_mad().  This is what the Topspin stack does and I believe
it is what OpenSM expects.

 - Roland


From halr at voltaire.com  Mon Nov  1 14:23:36 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 17:23:36 -0500
Subject: [openib-general] SM and smi
In-Reply-To: <52lldltdve.fsf@topspin.com>
References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com>
Message-ID: <1099347815.3270.5.camel@localhost.localdomain>

On Mon, 2004-11-01 at 17:15, Roland Dreier wrote:
> I think SMI processing should be applied to all DR SMPs passed to
> ib_post_send_mad().  This is what the Topspin stack does and I believe
> it is what OpenSM expects.

That works for me. In doing that, the 0 hop outgoing case should call
process_mad and return the response appropriately. Is the same thing
true for DLID = local LID or does the driver or HCA handle this case as
well ?

-- Hal


From roland at topspin.com  Mon Nov  1 14:26:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 01 Nov 2004 14:26:53 -0800
Subject: [openib-general] SM and smi
In-Reply-To: <1099347815.3270.5.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Mon, 01 Nov 2004 17:23:36 -0500")
References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com>
	<1099347815.3270.5.camel@localhost.localdomain>
Message-ID: <52hdo9tdci.fsf@topspin.com>

    Hal> That works for me. In doing that, the 0 hop outgoing case
    Hal> should call process_mad and return the response
    Hal> appropriately.

This behavior should probably be set by a flag.  It's required for
Tavor/Arbel, but we found on Anafa2 that performance was much better
if we just posted zero-hop DR SMPs to the send queue.

    Hal> Is the same thing true for DLID = local LID or does the
    Hal> driver or HCA handle this case as well ?

No, it's not required for LID-routed loopbacks.

 - R.


From mashirle at us.ibm.com  Mon Nov  1 15:01:26 2004
From: mashirle at us.ibm.com (Shirley Ma)
Date: Mon, 1 Nov 2004 15:01:26 -0800
Subject: [openib-general] [PATCH]remove redundant assignment in
	ib_post_send_mad()
Message-ID: <200411011501.26812.mashirle@us.ibm.com>

I am using my unix account to send the patch. Hope it works.

diff -urN access/mad.c access.patch2/mad.c
--- access/mad.c	2004-11-01 14:51:41.356902216 -0800
+++ access.patch2/mad.c	2004-11-01 14:53:37.003321288 -0800
@@ -368,16 +368,15 @@
 	struct ib_mad_agent_private	*mad_agent_priv;
 	struct ib_mad_port_private	*port_priv;
 
-	cur_send_wr = send_wr;
 	/* Validate supplied parameters */
 	if (!mad_agent || !send_wr) {
-		*bad_send_wr = cur_send_wr;
+		*bad_send_wr = send_wr;
 		return -EINVAL;
 	}
 
 	if (!mad_agent->send_handler ||
 	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) {
-		*bad_send_wr = cur_send_wr;
+		*bad_send_wr = send_wr;
 		return -EINVAL;
 	}
 

-- 
Thanks
Shirley Ma
IBM Linux Technology Center


From mshefty at ichips.intel.com  Mon Nov  1 15:06:20 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 15:06:20 -0800
Subject: [openib-general] [PATCH]remove redundant assignment in
	ib_post_send_mad()
In-Reply-To: <200411011501.26812.mashirle@us.ibm.com>
References: <200411011501.26812.mashirle@us.ibm.com>
Message-ID: <20041101150620.686aad29.mshefty@ichips.intel.com>

On Mon, 1 Nov 2004 15:01:26 -0800
Shirley Ma <mashirle at us.ibm.com> wrote:

> I am using my unix account to send the patch. Hope it works.
> 
> diff -urN access/mad.c access.patch2/mad.c
> --- access/mad.c	2004-11-01 14:51:41.356902216 -0800
> +++ access.patch2/mad.c	2004-11-01 14:53:37.003321288 -0800
> @@ -368,16 +368,15 @@
>  	struct ib_mad_agent_private	*mad_agent_priv;
>  	struct ib_mad_port_private	*port_priv;
>  
> -	cur_send_wr = send_wr;
>  	/* Validate supplied parameters */
>  	if (!mad_agent || !send_wr) {
> -		*bad_send_wr = cur_send_wr;
> +		*bad_send_wr = send_wr;
>  		return -EINVAL;
>  	}
>  
>  	if (!mad_agent->send_handler ||
>  	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) {
> -		*bad_send_wr = cur_send_wr;
> +		*bad_send_wr = send_wr;
>  		return -EINVAL;
>  	}

Patch looks good to me, and should be applied

It raises an issue with the current code, though.  There are checks for
a valid mad_agent, valid_wr, but not a valid *bad_send_wr.  I'm
wondering if we should convert these checks to BUG_ON, or add in a check
for a *bad_send_wr.  As a minor optimization, we could make bad_send_wr
optional for cases where only a single work request is being posted.

- Sean


From halr at voltaire.com  Mon Nov  1 15:32:21 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 18:32:21 -0500
Subject: [openib-general] [PATCH]remove redundant assignment in
	ib_post_send_mad()
In-Reply-To: <200411011501.26812.mashirle@us.ibm.com>
References: <200411011501.26812.mashirle@us.ibm.com>
Message-ID: <1099351941.3270.17.camel@localhost.localdomain>

Thanks. Applied.


From mashirle at us.ibm.com  Mon Nov  1 15:37:02 2004
From: mashirle at us.ibm.com (Shirley Ma)
Date: Mon, 1 Nov 2004 15:37:02 -0800
Subject: [openib-general] [PATCH]return the wrong bad_send_wr in
	ib_post_send_mad()
Message-ID: <200411011537.02928.mashirle@us.ibm.com>

Here is the patch for return the correct bad_send_wr value after calling 
ib_send_mad() in ib_post_send_mad().

diff -urN access/mad.c access.patch3/mad.c
--- access/mad.c	2004-11-01 14:51:41.000000000 -0800
+++ access.patch3/mad.c	2004-11-01 15:31:05.013571784 -0800
@@ -389,7 +389,6 @@
 	cur_send_wr = send_wr;
 	while (cur_send_wr) {
 		unsigned long			flags;
-		struct ib_send_wr		*bad_wr;
 		struct ib_mad_send_wr_private	*mad_send_wr;
 
 		next_send_wr = (struct ib_send_wr *)cur_send_wr->next;
@@ -423,7 +422,7 @@
 
 		cur_send_wr->next = NULL;
 		ret = ib_send_mad(mad_agent_priv, mad_send_wr,
-				  cur_send_wr, &bad_wr);
+				  cur_send_wr, bad_send_wr);
 		if (ret) {
 			/* Handle QP overrun separately... -ENOMEM */
 
@@ -432,7 +431,6 @@
 			list_del(&mad_send_wr->agent_list);
 			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
-			*bad_send_wr = cur_send_wr;
 			atomic_dec(&mad_agent_priv->refcount);
 			return ret;		
 		}

-- 
Thanks
Shirley Ma
IBM Linux Technology Center


From halr at voltaire.com  Mon Nov  1 15:39:59 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 18:39:59 -0500
Subject: [openib-general] [PATCH]remove redundant assignment in
	ib_post_send_mad()
In-Reply-To: <20041101150620.686aad29.mshefty@ichips.intel.com>
References: <200411011501.26812.mashirle@us.ibm.com>
	<20041101150620.686aad29.mshefty@ichips.intel.com>
Message-ID: <1099352398.3270.25.camel@localhost.localdomain>

On Mon, 2004-11-01 at 18:06, Sean Hefty wrote:
> It raises an issue with the current code, though.  There are checks for
> a valid mad_agent, valid_wr, but not a valid *bad_send_wr.  I'm
> wondering if we should convert these checks to BUG_ON, or add in a check
> for a *bad_send_wr.  

I don't think this is an "or". A check for *bad_send_wr should be added
(which might be changed based on the below question). I will post a
patch for this. IMO these should be BUG_ON but just errors as these are
localized coding errors in some client.

> As a minor optimization, we could make bad_send_wr
> optional for cases where only a single work request is being posted.

If *bad_send_wr is to be validated, the only time when NULL is allowed
is when there is only one send_wr. Wouldn't this nullify any savings
(unless the validation is removed) ?

-- Hal


From mashirle at us.ibm.com  Mon Nov  1 15:47:47 2004
From: mashirle at us.ibm.com (Shirley Ma)
Date: Mon, 1 Nov 2004 15:47:47 -0800
Subject: [openib-general] [PATCH]return the wrong bad_send_wr in
	ib_send_mad()
In-Reply-To: <200411011537.02928.mashirle@us.ibm.com>
References: <200411011537.02928.mashirle@us.ibm.com>
Message-ID: <200411011547.47539.mashirle@us.ibm.com>

Another patch to fix wrong bad_send_wr in ib_send_mad()

diff -urN access/mad.c access.patch4/mad.c
--- access/mad.c	2004-11-01 14:51:41.000000000 -0800
+++ access.patch4/mad.c	2004-11-01 15:44:08.173513376 -0800
@@ -347,10 +347,8 @@
 		list_add_tail(&mad_send_wr->send_list,
 			      &port_priv->send_posted_mad_list);
 		port_priv->send_posted_mad_count++;
-	} else {
+	} else 
 		printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret);
-		*bad_send_wr = send_wr;
-	}
 	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 	return ret;
 }


-- 
Thanks
Shirley Ma
IBM Linux Technology Center


From halr at voltaire.com  Mon Nov  1 15:59:08 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 18:59:08 -0500
Subject: [openib-general] [PATCH] mad: Validate *bad_send_wr in
	ib_post_send_mad()
Message-ID: <1099353548.2628.1.camel@hpc-1>

mad: Validate *bad_send_wr in ib_post_send_mad()

Index: mad.c
===================================================================
--- mad.c	(revision 1109)
+++ mad.c	(working copy)
@@ -369,7 +369,7 @@
 	struct ib_mad_port_private	*port_priv;
 
 	/* Validate supplied parameters */
-	if (!mad_agent || !send_wr) {
+	if (!mad_agent || !send_wr || !*bad_send_wr) {
 		*bad_send_wr = send_wr;
 		return -EINVAL;
 	}


From mshefty at ichips.intel.com  Mon Nov  1 15:52:41 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 15:52:41 -0800
Subject: [openib-general] [PATCH] mad: Validate *bad_send_wr in
	ib_post_send_mad()
In-Reply-To: <1099353548.2628.1.camel@hpc-1>
References: <1099353548.2628.1.camel@hpc-1>
Message-ID: <20041101155241.4d058ef4.mshefty@ichips.intel.com>

On Mon, 01 Nov 2004 18:59:08 -0500
Hal Rosenstock <halr at voltaire.com> wrote:

> mad: Validate *bad_send_wr in ib_post_send_mad()
> 
> Index: mad.c
> ===================================================================
> --- mad.c	(revision 1109)
> +++ mad.c	(working copy)
> @@ -369,7 +369,7 @@
>  	struct ib_mad_port_private	*port_priv;
>  
>  	/* Validate supplied parameters */
> -	if (!mad_agent || !send_wr) {
> +	if (!mad_agent || !send_wr || !*bad_send_wr) {
>  		*bad_send_wr = send_wr;

We can't set bad_send_wr if it's invalid.

- Sean


From mshefty at ichips.intel.com  Mon Nov  1 15:58:05 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 15:58:05 -0800
Subject: [openib-general] [PATCH]remove redundant assignment in
	ib_post_send_mad()
In-Reply-To: <1099352398.3270.25.camel@localhost.localdomain>
References: <200411011501.26812.mashirle@us.ibm.com>
	<20041101150620.686aad29.mshefty@ichips.intel.com>
	<1099352398.3270.25.camel@localhost.localdomain>
Message-ID: <20041101155805.515aa53b.mshefty@ichips.intel.com>

On Mon, 01 Nov 2004 18:39:59 -0500
Hal Rosenstock <halr at voltaire.com> wrote:

> I don't think this is an "or". A check for *bad_send_wr should be
> added(which might be changed based on the below question). I will post
> a patch for this. IMO these should be BUG_ON but just errors as these
> are localized coding errors in some client.
> 
> > As a minor optimization, we could make bad_send_wr
> > optional for cases where only a single work request is being posted.
> 
> If *bad_send_wr is to be validated, the only time when NULL is allowed
> is when there is only one send_wr. Wouldn't this nullify any savings
> (unless the validation is removed) ?

Yes, I was thinking of the case where we removed the validation, but the
savings is to make it easier on clients that always send a single MAD at
a time.

- Sean


From krkumar at us.ibm.com  Mon Nov  1 16:12:18 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 1 Nov 2004 16:12:18 -0800 (PST)
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
Message-ID: <Pine.LNX.4.44.0411011601350.26235-200000@DYN318430BLD>

I believe the recent changes to catch all atomic_dec races with unregister
failed to catch one spot in ib_post_send_mad. This routine increments
mad_agent_priv->refcnt, and while unregister can run, if the ib_send_mad()
fails, we drop the refcnt without checking if the refcnt has dropped to
zero. The unregister would block indefinitely waiting to be woken up. I
think the rest of the atomic_dec's looks good though.

Patch included as attachment as well as inline ...

Thanks,

- KK

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-01 16:01:05.000000000 -0800
+++ 2/mad.c	2004-11-01 16:01:09.000000000 -0800
@@ -432,7 +432,8 @@ int ib_post_send_mad(struct ib_mad_agent
 			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);

 			*bad_send_wr = cur_send_wr;
-			atomic_dec(&mad_agent_priv->refcount);
+			if (atomic_dec_and_test(&mad_agent_priv->refcount))
+				wake_up(&mad_agent_priv->wait);
 			return ret;
 		}
 		cur_send_wr= next_send_wr;
-------------- next part --------------
diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-01 16:01:05.000000000 -0800
+++ 2/mad.c	2004-11-01 16:01:09.000000000 -0800
@@ -432,7 +432,8 @@ int ib_post_send_mad(struct ib_mad_agent
 			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 			*bad_send_wr = cur_send_wr;
-			atomic_dec(&mad_agent_priv->refcount);
+			if (atomic_dec_and_test(&mad_agent_priv->refcount))
+				wake_up(&mad_agent_priv->wait);
 			return ret;		
 		}
 		cur_send_wr= next_send_wr;

From tduffy at sun.com  Mon Nov  1 16:25:47 2004
From: tduffy at sun.com (Tom Duffy)
Date: Mon, 01 Nov 2004 16:25:47 -0800
Subject: [openib-general] [PATCH] Better IPoIB multicast handling
In-Reply-To: <52wtxiea77.fsf@topspin.com>
References: <528y9yhb5o.fsf@topspin.com> <1098477556.1127.9.camel@duffman>
	<52wtxiea77.fsf@topspin.com>
Message-ID: <1099355147.9878.75.camel@duffman>

On Fri, 2004-10-22 at 14:04 -0700, Roland Dreier wrote:
> Can you try running with this debugging patch?  (It should just crash sooner)

So, I haven't been able to trigger the bug like I used to.  I am not
sure why, but after a series of fiasco's (Linux/sparc64 box rootfs
corrupted, Solaris 10 server that I was running my SM on exposed a bug
that forced me to upgrade to later build 70, the SM needed to be
updated, and the firmware on my Tavor card was hosed in this process,
leading me to have to reflash it in protected mode), it is all working
fairly smoothly.

I do get this warning now when I ifconfig up my device:

ib0.8001: multicast group ff12401b8001000000000000ffffffff already attached

But it seems harmless.

I will keep trying to break it, but I seem to be doing a better job of
making myself other work.

-tduffy

-- 
"A democracy cannot exist as a permanent form of government. It can only
exist until the voters discover that they can vote themselves money from
the public treasure. From that moment on, the majority always votes for
the candidates promising the most money from the public treasury, with
the result that democracy always collapses over loose fiscal policy
followed by a dictatorship. The average of the world's greatest
civilizations has been two hundred years. These nations have progressed
through the following sequence: from bondage to spiritual faith, from
spiritual faith to great courage, from courage to liberty, from liberty
to abundance, from abundance to selfishness, from selfishness to
complacency, from complacency to apathy, from apathy to dependency, from
dependency back to bondage."           -- Alexander Tyler, 1778
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041101/60477dc8/attachment.sig>

From mshefty at ichips.intel.com  Mon Nov  1 16:25:44 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 16:25:44 -0800
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411011601350.26235-200000@DYN318430BLD>
References: <Pine.LNX.4.44.0411011601350.26235-200000@DYN318430BLD>
Message-ID: <20041101162544.5af3f091.mshefty@ichips.intel.com>

On Mon, 1 Nov 2004 16:12:18 -0800 (PST)
Krishna Kumar <krkumar at us.ibm.com> wrote:

> I believe the recent changes to catch all atomic_dec races with
> unregister failed to catch one spot in ib_post_send_mad. This routine
> increments mad_agent_priv->refcnt, and while unregister can run, if
> the ib_send_mad() fails, we drop the refcnt without checking if the
> refcnt has dropped to zero. The unregister would block indefinitely
> waiting to be woken up. I think the rest of the atomic_dec's looks
> good though.

I looked at this area of the code, and my thought was that we cannot
handle a client that tries to send a MAD at the same time that they
unregister.  So, I think that a simple atomic_dec should be okay.  If a
client is calling unregister in a separate thread, then they are
essentially trying to send a MAD after unregistering, in which case our
data structures have been freed.

- Sean

 
>  			*bad_send_wr = cur_send_wr;
> -			atomic_dec(&mad_agent_priv->refcount);
> +			if (atomic_dec_and_test(&mad_agent_priv->refcount))
> +				wake_up(&mad_agent_priv->wait);
>  			return ret;
>  		}


From krkumar at us.ibm.com  Mon Nov  1 16:38:03 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 1 Nov 2004 16:38:03 -0800 (PST)
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <20041101162544.5af3f091.mshefty@ichips.intel.com>
Message-ID: <Pine.LNX.4.44.0411011627200.26320-100000@DYN318430BLD>

Hi Sean,

I think it is reasonable to have current senders racing with unregister.
The unregister is waiting for all references to drop to zero before
freeing up the resources. It killed the ones waiting for responses
(mad_cancel), killed the ones who are executing in callback handlers,
and finally after dropping the loader's module refcnt, it waits for the
refcnt to drop to zero. These can only be threads which are actively
receiving mad packets and those threads in the process of sending mad
packets while the unregister was going on (and the ones which fail is
the only cause of the problem). Essentially I think the unregister will
hang and not free up the resource.

Thanks,

- KK

On Mon, 1 Nov 2004, Sean Hefty wrote:

> On Mon, 1 Nov 2004 16:12:18 -0800 (PST)
> Krishna Kumar <krkumar at us.ibm.com> wrote:
>
> > I believe the recent changes to catch all atomic_dec races with
> > unregister failed to catch one spot in ib_post_send_mad. This routine
> > increments mad_agent_priv->refcnt, and while unregister can run, if
> > the ib_send_mad() fails, we drop the refcnt without checking if the
> > refcnt has dropped to zero. The unregister would block indefinitely
> > waiting to be woken up. I think the rest of the atomic_dec's looks
> > good though.
>
> I looked at this area of the code, and my thought was that we cannot
> handle a client that tries to send a MAD at the same time that they
> unregister.  So, I think that a simple atomic_dec should be okay.  If a
> client is calling unregister in a separate thread, then they are
> essentially trying to send a MAD after unregistering, in which case our
> data structures have been freed.
>
> - Sean
>
>
> >  			*bad_send_wr = cur_send_wr;
> > -			atomic_dec(&mad_agent_priv->refcount);
> > +			if (atomic_dec_and_test(&mad_agent_priv->refcount))
> > +				wake_up(&mad_agent_priv->wait);
> >  			return ret;
> >  		}
>
>


From mshefty at ichips.intel.com  Mon Nov  1 16:59:04 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 1 Nov 2004 16:59:04 -0800
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411011627200.26320-100000@DYN318430BLD>
References: <20041101162544.5af3f091.mshefty@ichips.intel.com>
	<Pine.LNX.4.44.0411011627200.26320-100000@DYN318430BLD>
Message-ID: <20041101165904.7a91f394.mshefty@ichips.intel.com>

On Mon, 1 Nov 2004 16:38:03 -0800 (PST)
Krishna Kumar <krkumar at us.ibm.com> wrote:

> Hi Sean,
> 
> I think it is reasonable to have current senders racing with
> unregister. The unregister is waiting for all references to drop to
> zero before freeing up the resources. It killed the ones waiting for
> responses(mad_cancel), killed the ones who are executing in callback
> handlers, and finally after dropping the loader's module refcnt, it
> waits for the refcnt to drop to zero. These can only be threads which
> are actively receiving mad packets and those threads in the process of
> sending mad packets while the unregister was going on (and the ones
> which fail is the only cause of the problem). Essentially I think the
> unregister will hang and not free up the resource.

The difference here is that a client is calling into the API at the same
time that they are trying to unregister.  The code, even with this
change, cannot handle this condition.

For example, if the thread calling ib_unregister_mad_agent executes
completely before the thread calling ib_post_send_mad runs (or can take
a reference on the mad_agent), the mad_agent is no longer valid, and the
structure will have been freed.  The thread executing ib_post_send_mad
can crash the system at this point.

If we want to allow a client to call ib_unregister_mad_agent and
ib_post_send_mad simultaneously, then ib_post_send_mad would need to
perform some sort of lookup (likely in some global map) to validate the
mad_agent.

- Sean


From krkumar at us.ibm.com  Mon Nov  1 17:40:56 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 1 Nov 2004 17:40:56 -0800 (PST)
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <20041101165904.7a91f394.mshefty@ichips.intel.com>
Message-ID: <Pine.LNX.4.44.0411011717120.26518-100000@DYN318430BLD>

Hi Sean,

I agree on the race between the threads, and this is something that I
had considered as a separate problem (but now it comes back to haunt
me :-).

An easier solution for this problem is to make sure that whoever
gets the agent (ib_mad_recv_done_handler) validate the mad_agent
before calling us. Basically find_mad_agent can hold a refcnt
on the agent. Is that correct ? If so, I can make a patch to handle
races on that front. This code is pretty complicated, so please let
me know if I have grossly mis-stated something (agents and agent_private,
and whatnots :-).

Thanks for your feedback,

- KK

On Mon, 1 Nov 2004, Sean Hefty wrote:

> On Mon, 1 Nov 2004 16:38:03 -0800 (PST)
> Krishna Kumar <krkumar at us.ibm.com> wrote:
>
> > Hi Sean,
> >
> > I think it is reasonable to have current senders racing with
> > unregister. The unregister is waiting for all references to drop to
> > zero before freeing up the resources. It killed the ones waiting for
> > responses(mad_cancel), killed the ones who are executing in callback
> > handlers, and finally after dropping the loader's module refcnt, it
> > waits for the refcnt to drop to zero. These can only be threads which
> > are actively receiving mad packets and those threads in the process of
> > sending mad packets while the unregister was going on (and the ones
> > which fail is the only cause of the problem). Essentially I think the
> > unregister will hang and not free up the resource.
>
> The difference here is that a client is calling into the API at the same
> time that they are trying to unregister.  The code, even with this
> change, cannot handle this condition.
>
> For example, if the thread calling ib_unregister_mad_agent executes
> completely before the thread calling ib_post_send_mad runs (or can take
> a reference on the mad_agent), the mad_agent is no longer valid, and the
> structure will have been freed.  The thread executing ib_post_send_mad
> can crash the system at this point.
>
> If we want to allow a client to call ib_unregister_mad_agent and
> ib_post_send_mad simultaneously, then ib_post_send_mad would need to
> perform some sort of lookup (likely in some global map) to validate the
> mad_agent.
>
> - Sean
>
>


From krkumar at us.ibm.com  Mon Nov  1 17:50:26 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 1 Nov 2004 17:50:26 -0800 (PST)
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411011717120.26518-100000@DYN318430BLD>
Message-ID: <Pine.LNX.4.44.0411011749010.26518-100000@DYN318430BLD>

BTW, what I mean is that in the following code, put the atomic_inc()
into the find() routine ...

- KK

ib_mad_recv_done_handler()
{
...
	spin_lock_irqsave(&port_priv->reg_lock, flags);
        /* Determine corresponding MAD agent for incoming receive MAD */
        solicited = solicited_mad(recv->header.recv_buf.mad);
        mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
                                   solicited);
        if (!mad_agent) {
                spin_unlock_irqrestore(&port_priv->reg_lock, flags);
                printk(KERN_NOTICE PFX "No matching mad agent found for "
                       "received MAD on port %d\n", port_priv->port_num);
        } else {
                atomic_inc(&mad_agent->refcount);
                spin_unlock_irqrestore(&port_priv->reg_lock, flags);
                ib_mad_complete_recv(mad_agent, recv, solicited);
        }
...
}

On Mon, 1 Nov 2004, Krishna Kumar wrote:

> Hi Sean,
>
> I agree on the race between the threads, and this is something that I
> had considered as a separate problem (but now it comes back to haunt
> me :-).
>
> An easier solution for this problem is to make sure that whoever
> gets the agent (ib_mad_recv_done_handler) validate the mad_agent
> before calling us. Basically find_mad_agent can hold a refcnt
> on the agent. Is that correct ? If so, I can make a patch to handle
> races on that front. This code is pretty complicated, so please let
> me know if I have grossly mis-stated something (agents and agent_private,
> and whatnots :-).
>
> Thanks for your feedback,
>
> - KK
>
> On Mon, 1 Nov 2004, Sean Hefty wrote:
>
> > On Mon, 1 Nov 2004 16:38:03 -0800 (PST)
> > Krishna Kumar <krkumar at us.ibm.com> wrote:
> >
> > > Hi Sean,
> > >
> > > I think it is reasonable to have current senders racing with
> > > unregister. The unregister is waiting for all references to drop to
> > > zero before freeing up the resources. It killed the ones waiting for
> > > responses(mad_cancel), killed the ones who are executing in callback
> > > handlers, and finally after dropping the loader's module refcnt, it
> > > waits for the refcnt to drop to zero. These can only be threads which
> > > are actively receiving mad packets and those threads in the process of
> > > sending mad packets while the unregister was going on (and the ones
> > > which fail is the only cause of the problem). Essentially I think the
> > > unregister will hang and not free up the resource.
> >
> > The difference here is that a client is calling into the API at the same
> > time that they are trying to unregister.  The code, even with this
> > change, cannot handle this condition.
> >
> > For example, if the thread calling ib_unregister_mad_agent executes
> > completely before the thread calling ib_post_send_mad runs (or can take
> > a reference on the mad_agent), the mad_agent is no longer valid, and the
> > structure will have been freed.  The thread executing ib_post_send_mad
> > can crash the system at this point.
> >
> > If we want to allow a client to call ib_unregister_mad_agent and
> > ib_post_send_mad simultaneously, then ib_post_send_mad would need to
> > perform some sort of lookup (likely in some global map) to validate the
> > mad_agent.
> >
> > - Sean
> >
> >
>
>


From krkumar at us.ibm.com  Mon Nov  1 18:02:54 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 1 Nov 2004 18:02:54 -0800 (PST)
Subject: [openib-general] [PATCH] Fix MMU if find_mad_agent() finds no agent.
Message-ID: <Pine.LNX.4.44.0411011800160.26518-200000@DYN318430BLD>

This fixes the above case and I also took the liberty of changing
"goto ret" to "goto out", which just looks more aesthetic.

I am not including inline, since my patches seem to get inlined
automatically and without getting mangled. Hope this continues :-)

Thanks,

- KK
-------------- next part --------------
diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-01 17:57:02.000000000 -0800
+++ 2/mad.c	2004-11-01 17:59:03.000000000 -0800
@@ -660,7 +660,7 @@ static void remove_mad_reg_req(struct ib
 
 	/* Was MAD registration request supplied with original registration ? */
 	if (!agent_priv->reg_req) {
-		goto ret;
+		goto out;
 	}
 
 	port_priv = agent_priv->port_priv;
@@ -668,7 +668,7 @@ static void remove_mad_reg_req(struct ib
 	if (!class) {
 		printk(KERN_ERR PFX "No class table yet MAD registration "
 		       "request supplied\n");
-		goto ret;
+		goto out;
 	}
 
 	mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class);
@@ -691,7 +691,7 @@ static void remove_mad_reg_req(struct ib
 		}
 	}
 
-ret:
+out:
 	return;
 }
 
@@ -753,7 +753,7 @@ find_mad_agent(struct ib_mad_port_privat
 		if (!mad_agent) {
 			printk(KERN_ERR PFX "No client 0x%x for received MAD "
 			       "on port %d\n", hi_tid, port_priv->port_num);
-			goto ret;
+			goto out;
 		}
 	} else {
 		/* Routing is based on version, class, and method */
@@ -761,14 +761,14 @@ find_mad_agent(struct ib_mad_port_privat
 			printk(KERN_ERR PFX "MAD received with unsupported "
 			       "class version %d on port %d\n",
 			       mad->mad_hdr.class_version, port_priv->port_num);
-			goto ret;
+			goto out;
 		}
 		version = port_priv->version[mad->mad_hdr.class_version];
 		if (!version) {
 			printk(KERN_ERR PFX "MAD received on port %d for class "
 			       "version %d with no client\n",
 			       port_priv->port_num, mad->mad_hdr.class_version);
-			goto ret;
+			goto out;
 		}
 		class = version->method_table[convert_mgmt_class(
 						mad->mad_hdr.mgmt_class)];
@@ -776,18 +776,17 @@ find_mad_agent(struct ib_mad_port_privat
 			printk(KERN_ERR PFX "MAD received on port %d for class "
 			       "%d with no client\n",
 			       port_priv->port_num, mad->mad_hdr.mgmt_class);
-			goto ret;
+			goto out;
 		}
 		mad_agent = class->agent[mad->mad_hdr.method &
 					 ~IB_MGMT_METHOD_RESP];		
 	}
 
-ret:
-	if (!mad_agent->agent.recv_handler) {
+out:
+	if (mad_agent && !mad_agent->agent.recv_handler) {
 		printk(KERN_ERR PFX "No receive handler for client "
 		       "%p on port %d\n",
-		       &mad_agent->agent,
-		       port_priv->port_num);
+		       &mad_agent->agent, port_priv->port_num);
 		mad_agent = NULL;
 	}
 
@@ -802,7 +801,7 @@ static int validate_mad(struct ib_mad *m
 	if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) {
 		printk(KERN_ERR PFX "MAD received with unsupported base "
 		       "version %d\n", mad->mad_hdr.base_version);
-		goto ret;
+		goto out;
 	}
 
 	/* Filter SMI packets sent to other than QP0 */
@@ -816,7 +815,7 @@ static int validate_mad(struct ib_mad *m
 			valid = 1;	
 	}
 
-ret:
+out:
 	return valid;
 }
 
@@ -978,7 +977,7 @@ static void ib_mad_recv_done_handler(str
 
 	/* Validate MAD */
 	if (!validate_mad(recv->header.recv_buf.mad, qp_num))
-		goto ret;
+		goto out;
 
 	/* Snoop MAD ? */
 	if (port_priv->device->snoop_mad)
@@ -986,7 +985,7 @@ static void ib_mad_recv_done_handler(str
 						 (u8)port_priv->port_num,
 						 wc->slid,
 						 recv->header.recv_buf.mad))
-			goto ret;
+			goto out;
 
 	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	/* Determine corresponding MAD agent for incoming receive MAD */
@@ -1003,7 +1002,7 @@ static void ib_mad_recv_done_handler(str
 		ib_mad_complete_recv(mad_agent, recv, solicited);
 	}
 
-ret:
+out:
 	if (!mad_agent) {
 		/* Should this case be optimized ? */
 		kmem_cache_free(ib_mad_cache, recv);
@@ -1255,7 +1254,7 @@ void ib_cancel_mad(struct ib_mad_agent *
 	mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id);
 	if (!mad_send_wr) {
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
-		goto ret;
+		goto out;
 	}
 
 	if (mad_send_wr->status == IB_WC_SUCCESS)
@@ -1264,7 +1263,7 @@ void ib_cancel_mad(struct ib_mad_agent *
 	if (mad_send_wr->refcount != 0) {
 		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
-		goto ret;
+		goto out;
 	}
 
 	list_del(&mad_send_wr->agent_list);
@@ -1281,7 +1280,7 @@ void ib_cancel_mad(struct ib_mad_agent *
 	if (atomic_dec_and_test(&mad_agent_priv->refcount))
 		wake_up(&mad_agent_priv->wait);
 
-ret:
+out:
 	return;
 }
 EXPORT_SYMBOL(ib_cancel_mad);

From krkumar at us.ibm.com  Mon Nov  1 18:10:16 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 1 Nov 2004 18:10:16 -0800 (PST)
Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup the
	locking area
Message-ID: <Pine.LNX.4.44.0411011803160.26518-200000@DYN318430BLD>

This is a minor cleanup/optimize in this area. No need to hold lock
for too long (for atomic_inc), no multiple unlocks and normal case
of finding mad_agent first.

Thanks,

- KK


-------------- next part --------------
--- mad.c.org	2004-11-01 17:41:09.000000000 -0800
+++ mad.c	2004-11-01 17:43:55.000000000 -0800
@@ -988,20 +988,19 @@ static void ib_mad_recv_done_handler(str
 						 recv->header.recv_buf.mad))
 			goto out;
 
-	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	/* Determine corresponding MAD agent for incoming receive MAD */
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	solicited = solicited_mad(recv->header.recv_buf.mad);
 	mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
 				   solicited);
-	if (!mad_agent) {
-		spin_unlock_irqrestore(&port_priv->reg_lock, flags);
-		printk(KERN_NOTICE PFX "No matching mad agent found for "
-		       "received MAD on port %d\n", port_priv->port_num);
-	} else {
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+
+	if (mad_agent) {
 		atomic_inc(&mad_agent->refcount);
-		spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 		ib_mad_complete_recv(mad_agent, recv, solicited);
-	}
+	} else
+		printk(KERN_NOTICE PFX "No matching mad agent found for "
+		       "received MAD on port %d\n", port_priv->port_num);
 
 out:
 	if (!mad_agent) {

From halr at voltaire.com  Mon Nov  1 18:52:53 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 01 Nov 2004 21:52:53 -0500
Subject: [openib-general] mad: Validate *bad_send_wr in ib_post_send_mad()
Message-ID: <1099363972.13695.2.camel@hpc-1>

mad: Validate *bad_send_wr in ib_post_send_mad()
Fix previous patch

Index: mad.c
===================================================================
--- mad.c	(revision 1110)
+++ mad.c	(working copy)
@@ -369,16 +369,15 @@
 	struct ib_mad_port_private	*port_priv;
 
 	/* Validate supplied parameters */
-	if (!mad_agent || !send_wr || !*bad_send_wr) {
-		*bad_send_wr = send_wr;
-		return -EINVAL;
-	}
+	if (!*bad_send_wr)
+		goto error1;
 
+	if (!mad_agent || !send_wr )
+		goto error2;
+
 	if (!mad_agent->send_handler ||
-	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) {
-		*bad_send_wr = send_wr;
-		return -EINVAL;
-	}
+	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler))
+		goto error2;
 
 	mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private,
 				      agent);
@@ -439,6 +438,11 @@
 	}
 
 	return 0;	
+
+error2:
+	*bad_send_wr = send_wr;
+error1:
+	return -EINVAL;
 }
 EXPORT_SYMBOL(ib_post_send_mad);
 

From halr at voltaire.com  Tue Nov  2 07:53:26 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 10:53:26 -0500
Subject: [openib-general] [PATCH] mad: Fix previous patch again
Message-ID: <1099410806.14725.5.camel@hpc-1>

mad: Fix previous patch again (bad_send_wr validation in
ib_post_send_mad)

Index: mad.c
===================================================================
--- mad.c	(revision 1111)
+++ mad.c	(working copy)
@@ -369,7 +369,7 @@
 	struct ib_mad_port_private	*port_priv;
 
 	/* Validate supplied parameters */
-	if (!*bad_send_wr)
+	if (!bad_send_wr)
 		goto error1;
 
 	if (!mad_agent || !send_wr )


From halr at voltaire.com  Tue Nov  2 08:18:41 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 11:18:41 -0500
Subject: [openib-general] [PATCH]return the wrong bad_send_wr in
	ib_post_send_mad()
In-Reply-To: <200411011537.02928.mashirle@us.ibm.com>
References: <200411011537.02928.mashirle@us.ibm.com>
Message-ID: <1099412321.3114.0.camel@hpc-1>

On Mon, 2004-11-01 at 18:37, Shirley Ma wrote:
> Here is the patch for return the correct bad_send_wr value after calling 
> ib_send_mad() in ib_post_send_mad().

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov  2 08:36:41 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 11:36:41 -0500
Subject: [openib-general] [PATCH]return the wrong bad_send_wr	in
	ib_send_mad()
In-Reply-To: <200411011547.47539.mashirle@us.ibm.com>
References: <200411011537.02928.mashirle@us.ibm.com>
	<200411011547.47539.mashirle@us.ibm.com>
Message-ID: <1099413401.3581.0.camel@hpc-1>

On Mon, 2004-11-01 at 18:47, Shirley Ma wrote:
> Another patch to fix wrong bad_send_wr in ib_send_mad()

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov  2 08:49:52 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 11:49:52 -0500
Subject: [openib-general] [PATCH] Fix MMU if find_mad_agent() finds no
	agent.
In-Reply-To: <Pine.LNX.4.44.0411011800160.26518-200000@DYN318430BLD>
References: <Pine.LNX.4.44.0411011800160.26518-200000@DYN318430BLD>
Message-ID: <1099414192.3581.8.camel@hpc-1>

On Mon, 2004-11-01 at 21:02, Krishna Kumar wrote:
> This fixes the above case and I also took the liberty of changing
> "goto ret" to "goto out", which just looks more aesthetic.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov  2 08:59:18 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 11:59:18 -0500
Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup	the
	locking area
In-Reply-To: <Pine.LNX.4.44.0411011803160.26518-200000@DYN318430BLD>
References: <Pine.LNX.4.44.0411011803160.26518-200000@DYN318430BLD>
Message-ID: <1099414758.3581.15.camel@hpc-1>

On Mon, 2004-11-01 at 21:10, Krishna Kumar wrote:
> This is a minor cleanup/optimize in this area. No need to hold lock
> for too long (for atomic_inc), no multiple unlocks and normal case
> of finding mad_agent first.

Doesn't this create a window between the unlocking after the mad_agent
is found and the atomic_inc ? Couldn't a deregistration occur then ?

-- Hal


From halr at voltaire.com  Tue Nov  2 09:11:46 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 12:11:46 -0500
Subject: [openib-general] ib_sa oops
Message-ID: <1099415506.3581.27.camel@hpc-1>

When I did a modprobe -r ib_ipoib, I got the following oops when the
SA's send_handler is called on it's deregistering it's MAD client with
pending MADs.

I first bringup and configure IPoIB:
/sbin/modprobe ib_ipoib
/sbin/ifconfig ib0 192.168.0.20

I then do:
ping -b 192.168.0.255
and ctl-C before it cycles around the list a second time and then:

/sbin/modprobe -r ib_ipoib
Segmentation fault

/var/log/messages showed:
Nov  2 10:54:17 hpc-1 kernel: Unable to handle kernel paging request at
virtual
address f8a50407
Nov  2 10:54:17 hpc-1 kernel:  printing eip:
Nov  2 10:54:17 hpc-1 kernel: f8a50407
Nov  2 10:54:17 hpc-1 kernel: *pde = 019a5067
Nov  2 10:54:17 hpc-1 kernel: *pte = 00000000
Nov  2 10:54:17 hpc-1 kernel: Oops: 0000 [#1]
Nov  2 10:54:17 hpc-1 kernel: SMP
Nov  2 10:54:17 hpc-1 kernel: Modules linked in: ib_sa ib_mad
ib_services ib_mthca ib_core loop autofs e1000 ohci1394 ieee1394
parport_pc parport usbcore
Nov  2 10:54:17 hpc-1 kernel: CPU:    0
Nov  2 10:54:17 hpc-1 kernel: EIP:    0060:[<f8a50407>]    Not tainted
VLI
Nov  2 10:54:17 hpc-1 kernel: EFLAGS: 00010246   (2.6.9)
Nov  2 10:54:17 hpc-1 kernel: EIP is at 0xf8a50407
Nov  2 10:54:17 hpc-1 kernel: eax: e2f05280   ebx: 00000286   ecx:
00000000   edx: fffffffb
Nov  2 10:54:17 hpc-1 kernel: esi: c6ba3340   edi: c6ba3348   ebp:
fffffffb   esp: e6eebdfc
Nov  2 10:54:17 hpc-1 kernel: ds: 007b   es: 007b   ss: 0068
Nov  2 10:54:17 hpc-1 kernel: Process modprobe (pid: 12680,
threadinfo=e6eea000
task=f5f30230)
Nov  2 10:54:17 hpc-1 kernel: Stack: f8a217d8 fffffffb 00000000 e2f05280
e6eebe60 c02a1e5e 00000000 f5f30230
Nov  2 10:54:17 hpc-1 kernel:        c0117d96 00000000 00000000 00000003
c170b060 c6ff3a70 c6ff3830 c011685a
Nov  2 10:54:17 hpc-1 kernel:        f5f30230 e74b5800 f5f30230 00000000
e6eebe98 c02a1a92 c6ff3830 c170e4d0
Nov  2 10:54:17 hpc-1 kernel: Call Trace:
Nov  2 10:54:17 hpc-1 kernel:  [<f8a217d8>]
ib_sa_mcmember_rec_callback+0x5a/0x7f [ib_sa]
Nov  2 10:54:17 hpc-1 kernel:  [<c02a1e5e>]
wait_for_completion+0xc4/0xcc
Nov  2 10:54:17 hpc-1 kernel:  [<c0117d96>]
default_wake_function+0x0/0x12
Nov  2 10:54:17 hpc-1 kernel:  [<c011685a>] finish_task_switch+0x3a/0x83
Nov  2 10:54:17 hpc-1 kernel:  [<c02a1a92>] schedule+0x326/0x62e
Nov  2 10:54:17 hpc-1 kernel:  [<f8a21a24>] send_handler+0xaa/0xbc
[ib_sa]
Nov  2 10:54:17 hpc-1 kernel:  [<f89e8642>] cancel_mads+0xe5/0x127
[ib_mad]
Nov  2 10:54:17 hpc-1 kernel:  [<f89e737a>]
ib_unregister_mad_agent+0x16/0x135 [ib_mad]
Nov  2 10:54:17 hpc-1 kernel:  [<c0117d96>]
default_wake_function+0x0/0x12
Nov  2 10:54:17 hpc-1 kernel:  [<c0117d96>]
default_wake_function+0x0/0x12
Nov  2 10:54:17 hpc-1 kernel:  [<f89f2928>] ib_get_client_data+0x42/0x4e
[ib_core]
Nov  2 10:54:17 hpc-1 kernel:  [<f8a21d87>] ib_sa_remove_one+0x44/0x7d
[ib_sa]
Nov  2 10:54:17 hpc-1 kernel:  [<f89f28e1>]
ib_unregister_client+0xee/0xf3 [ib_core]
Nov  2 10:54:17 hpc-1 kernel:  [<c0130ecb>] try_stop_module+0x37/0x3b
Nov  2 10:54:17 hpc-1 kernel:  [<c0133941>] __try_stop_module+0x0/0x41
Nov  2 10:54:17 hpc-1 kernel:  [<f8a21dcf>] ib_sa_cleanup+0xf/0x13
[ib_sa]
Nov  2 10:54:17 hpc-1 kernel:  [<c01310c1>]
sys_delete_module+0x16d/0x19b
Nov  2 10:54:17 hpc-1 kernel:  [<c01474de>] sys_munmap+0x51/0x76
Nov  2 10:54:17 hpc-1 kernel:  [<c0105cf5>] sysenter_past_esp+0x52/0x71
Nov  2 10:54:17 hpc-1 kernel: Code:  Bad EIP value.

-- Hal


From roland at topspin.com  Tue Nov  2 09:15:31 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 02 Nov 2004 09:15:31 -0800
Subject: [openib-general] ib_sa oops
In-Reply-To: <1099415506.3581.27.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 02 Nov 2004 12:11:46 -0500")
References: <1099415506.3581.27.camel@hpc-1>
Message-ID: <52wtx4rx3g.fsf@topspin.com>

    Hal> When I did a modprobe -r ib_ipoib, I got the following oops
    Hal> when the SA's send_handler is called on it's deregistering
    Hal> it's MAD client with pending MADs.

Can you reproduce it with a kernel with CONFIG_KALLSYMS turned on so
that I can read the oops?

Thanks,
  Roland


From roland at topspin.com  Tue Nov  2 09:16:34 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 02 Nov 2004 09:16:34 -0800
Subject: [openib-general] ib_sa oops
In-Reply-To: <52wtx4rx3g.fsf@topspin.com> (Roland Dreier's message of "Tue,
	02 Nov 2004 09:15:31 -0800")
References: <1099415506.3581.27.camel@hpc-1> <52wtx4rx3g.fsf@topspin.com>
Message-ID: <52sm7srx1p.fsf@topspin.com>

    Roland> Can you reproduce it with a kernel with CONFIG_KALLSYMS
    Roland> turned on so that I can read the oops?

Sorry, never mind... the line wrapping was so bad that I didn't notice
the function names.

I'll take a look at what's happening.

Thanks,
  Roland


From halr at voltaire.com  Tue Nov  2 09:26:53 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 12:26:53 -0500
Subject: [openib-general] ib_sa oops
In-Reply-To: <52wtx4rx3g.fsf@topspin.com>
References: <1099415506.3581.27.camel@hpc-1> <52wtx4rx3g.fsf@topspin.com>
Message-ID: <1099416413.2985.0.camel@hpc-1>

On Tue, 2004-11-02 at 12:15, Roland Dreier wrote:
>     Hal> When I did a modprobe -r ib_ipoib, I got the following oops
>     Hal> when the SA's send_handler is called on it's deregistering
>     Hal> it's MAD client with pending MADs.
> 
> Can you reproduce it with a kernel with CONFIG_KALLSYMS turned on so
> that I can read the oops?

CONFIG_KALLSYMS is y. 

CONFIG_KALLSYMS_ALL is not set
nor is CONFIG_KALLSYMS_EXTRA_PASS.

Should either or both of them be set ?

-- Hal


From mshefty at ichips.intel.com  Tue Nov  2 09:41:08 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 09:41:08 -0800
Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup	the
	locking area
In-Reply-To: <1099414758.3581.15.camel@hpc-1>
References: <Pine.LNX.4.44.0411011803160.26518-200000@DYN318430BLD>
	<1099414758.3581.15.camel@hpc-1>
Message-ID: <20041102094108.7f85249f.mshefty@ichips.intel.com>

On Tue, 02 Nov 2004 11:59:18 -0500
Hal Rosenstock <halr at voltaire.com> wrote:

> On Mon, 2004-11-01 at 21:10, Krishna Kumar wrote:
> > This is a minor cleanup/optimize in this area. No need to hold lock
> > for too long (for atomic_inc), no multiple unlocks and normal case
> > of finding mad_agent first.
> 
> Doesn't this create a window between the unlocking after the mad_agent
> is found and the atomic_inc ? Couldn't a deregistration occur then ?

You are correct, Hal.  We need to find and increment the mad_agent under
the lock in order to prevent deregistration from completing.

- Sean


From roland at topspin.com  Tue Nov  2 09:45:57 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 02 Nov 2004 09:45:57 -0800
Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap	from
	2 verb routines.
In-Reply-To: <1099341735.3074.97.camel@hpc-1> (Hal Rosenstock's message of
	"Mon, 01 Nov 2004 15:42:16 -0500")
References: <Pine.LNX.4.44.0410291247460.17190-200000@DYN318430BLD>
	<20041029131437.6f1d0cf6.mshefty@ichips.intel.com>
	<1099326275.3074.3.camel@hpc-1>
	<20041101094003.6c7bc3e0.mshefty@ichips.intel.com>
	<1099341735.3074.97.camel@hpc-1>
Message-ID: <52oeigrvoq.fsf@topspin.com>

As far as I can tell this patch is broken: it removes the qp_cap
parameter to modify_qp but doesn't fix up the mthca functions.  I
added the missing pieces by hand and applied.

 - R.


From mshefty at ichips.intel.com  Tue Nov  2 09:50:51 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 09:50:51 -0800
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411011717120.26518-100000@DYN318430BLD>
References: <20041101165904.7a91f394.mshefty@ichips.intel.com>
	<Pine.LNX.4.44.0411011717120.26518-100000@DYN318430BLD>
Message-ID: <20041102095051.39cdc978.mshefty@ichips.intel.com>

On Mon, 1 Nov 2004 17:40:56 -0800 (PST)
Krishna Kumar <krkumar at us.ibm.com> wrote:

> I agree on the race between the threads, and this is something that I
> had considered as a separate problem (but now it comes back to haunt
> me :-).

These sort of race conditions was something that I gave careful
attention to when updating the code.  That doesn't mean that I didn't
miss something, and I appreciate that you're willing to review these for
correctness.
 
> An easier solution for this problem is to make sure that whoever
> gets the agent (ib_mad_recv_done_handler) validate the mad_agent
> before calling us. Basically find_mad_agent can hold a refcnt
> on the agent. Is that correct ?

This is correct.  After find_mad_agent is called, the code takes a
reference on the mad_agent.  I think this is in the portion of code from
one of your patches.  Moving the reference inside find_mad_agent is a
minor restructuring of the code.  If we do move the reference, I think
it makes sense to move the locking inside that routine as well.


From krkumar at us.ibm.com  Tue Nov  2 09:46:15 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 09:46:15 -0800 (PST)
Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup the
	locking area
In-Reply-To: <1099414758.3581.15.camel@hpc-1>
Message-ID: <Pine.LNX.4.44.0411020945070.28458-100000@DYN318430BLD>

Hi Hal,

Yes, you are right. I was talking about this case with Sean, but forgot
it when I actually sent the patch.

Please disregard it.

thx,

- KK


On Tue, 2 Nov 2004, Hal Rosenstock wrote:

> On Mon, 2004-11-01 at 21:10, Krishna Kumar wrote:
> > This is a minor cleanup/optimize in this area. No need to hold lock
> > for too long (for atomic_inc), no multiple unlocks and normal case
> > of finding mad_agent first.
>
> Doesn't this create a window between the unlocking after the mad_agent
> is found and the atomic_inc ? Couldn't a deregistration occur then ?
>
> -- Hal
>
>
>


From mshefty at ichips.intel.com  Tue Nov  2 09:52:39 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 09:52:39 -0800
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411011749010.26518-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411011717120.26518-100000@DYN318430BLD>
	<Pine.LNX.4.44.0411011749010.26518-100000@DYN318430BLD>
Message-ID: <20041102095239.2b64f0b5.mshefty@ichips.intel.com>

On Mon, 1 Nov 2004 17:50:26 -0800 (PST)
Krishna Kumar <krkumar at us.ibm.com> wrote:

> 	spin_lock_irqsave(&port_priv->reg_lock, flags);
>         /* Determine corresponding MAD agent for incoming receive MAD
>         */ solicited = solicited_mad(recv->header.recv_buf.mad);
>         mad_agent = find_mad_agent(port_priv,
>         recv->header.recv_buf.mad,
>                                    solicited);
>         if (!mad_agent) {
>                 spin_unlock_irqrestore(&port_priv->reg_lock, flags);
>                 printk(KERN_NOTICE PFX "No matching mad agent found
>                 for "
>                        "received MAD on port %d\n",
>                        port_priv->port_num);
>         } else {
>                 atomic_inc(&mad_agent->refcount);
>                 spin_unlock_irqrestore(&port_priv->reg_lock, flags);
>                 ib_mad_complete_recv(mad_agent, recv, solicited);

Related to this, the call to solicited_mad() doesn't need to be made
while holding the lock.  Moving this outside, we can push the locking
inside find_mad_agent as well, if it makes more sense to do so.

- Sean


From mshefty at ichips.intel.com  Tue Nov  2 09:56:47 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 09:56:47 -0800
Subject: [openib-general] [PATCH] for review -- fix MAD completion handling
In-Reply-To: <20041028233000.19879b59.mshefty@ichips.intel.com>
References: <20041028233000.19879b59.mshefty@ichips.intel.com>
Message-ID: <20041102095647.3b74fbc9.mshefty@ichips.intel.com>

On Thu, 28 Oct 2004 23:30:00 -0700
Sean Hefty <mshefty at ichips.intel.com> wrote:

> Here's what I have to handle MAD completion handling.  This patch
> tries to fix the issue of matching a completion (successful or error)
> with the corresponding work request.  Some notes:

As just an update, there were a couple of minor issues in this patch
(minor to fix anyway...).  I will post a new patch after merging in the
latest changes to the code and retesting.

- Sean


From krkumar at us.ibm.com  Tue Nov  2 09:59:14 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 09:59:14 -0800 (PST)
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <20041102095051.39cdc978.mshefty@ichips.intel.com>
Message-ID: <Pine.LNX.4.44.0411020953090.28545-100000@DYN318430BLD>

Hi Sean,

I think that is the best approach. And using this method, we can also
avoid holding the lock if solicited is set. I will send a patch in a
few minutes if this approach looks good.

Thanks,

- KK

On Tue, 2 Nov 2004, Sean Hefty wrote:

> Related to this, the call to silicited_mad() doesn't need to be made
> while holding the lock. Moving this outside, we can push the locking
> inside find_mad_agent as well, if it makes more sense to do so.


From mshefty at ichips.intel.com  Tue Nov  2 10:21:26 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 10:21:26 -0800
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411020953090.28545-100000@DYN318430BLD>
References: <20041102095051.39cdc978.mshefty@ichips.intel.com>
	<Pine.LNX.4.44.0411020953090.28545-100000@DYN318430BLD>
Message-ID: <20041102102126.26746a63.mshefty@ichips.intel.com>

On Tue, 2 Nov 2004 09:59:14 -0800 (PST)
Krishna Kumar <krkumar at us.ibm.com> wrote:

> Hi Sean,
> 
> I think that is the best approach. And using this method, we can also
> avoid holding the lock if solicited is set. I will send a patch in a
> few minutes if this approach looks good.

Sounds good.  

I think that you'll need to hold the lock even if solicited is set to
handle the case where a response is received after the sender
unregistered.

- Sean


From krkumar at us.ibm.com  Tue Nov  2 10:40:04 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 10:40:04 -0800 (PST)
Subject: [openib-general] [PATCH] Optimize check_class_table and
	method_table to return BOOL.
Message-ID: <Pine.LNX.4.44.0411021037470.28686-100000@DYN318430BLD>

The callers are just interested in knowing whether any methods or method
tables are in use, not the actual use count.

Thanks,

- KK

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-02 10:32:51.000000000 -0800
+++ 2/mad.c	2004-11-02 10:35:01.000000000 -0800
@@ -530,34 +530,30 @@ static int allocate_method_table(struct
 	return 0;
 }

+/*
+ * Check to see if there are any methods still in use.
+ */
 static int check_method_table(struct ib_mad_mgmt_method_table *method)
 {
-	int i, j;
+	int i;

-	/* Check to see if there are any methods still in use */
-	j = 0;
-	for (i = 0; i < IB_MGMT_MAX_METHODS; i++) {
+	for (i = 0; i < IB_MGMT_MAX_METHODS; i++)
 		if (method->agent[i])
-			j++;
-	}
-	return j;
+			return 1;
+	return 0;
 }

+/*
+ * Check to see if there are any method tables for this class still in use.
+ */
 static int check_class_table(struct ib_mad_mgmt_class_table *class)
 {
-	int i, j;
+	int i;

-	/*
-	 * Check to see if there are any method tables for this class still
-	 * in use
-	 */
-	j = 0;
-	for (i = 0; i < MAX_MGMT_CLASS; i++) {
-		if (class->method_table[i]) {
-			j++;
-		}
-	}
-	return j;
+	for (i = 0; i < MAX_MGMT_CLASS; i++)
+		if (class->method_table[i])
+			return 1;
+	return 0;
 }

 static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method,


From halr at voltaire.com  Tue Nov  2 10:57:59 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 13:57:59 -0500
Subject: [openib-general] [PATCH] Better IPoIB multicast handling
In-Reply-To: <1099355147.9878.75.camel@duffman>
References: <528y9yhb5o.fsf@topspin.com> <1098477556.1127.9.camel@duffman>
	<52wtxiea77.fsf@topspin.com> <1099355147.9878.75.camel@duffman>
Message-ID: <1099421879.2834.3.camel@hpc-1>

On Mon, 2004-11-01 at 19:25, Tom Duffy wrote:
> I do get this warning now when I ifconfig up my device:
> 
> ib0.8001: multicast group ff12401b8001000000000000ffffffff already attached
> 
> But it seems harmless.

I see it too and it appears to be benign. I see it along with a previous
message:
ib0: multicast join failed for ff12401bffff000000000000ffffffff, status
-5

There are multiple Sets of MCMemberRecord with different component masks
which are attempted for the broadcast group and the 224.0.0.1 group at
when the network interface is brought up with an IP address. 

-- Hal


From halr at voltaire.com  Tue Nov  2 11:15:40 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 14:15:40 -0500
Subject: [openib-general] [RFC] [PATCH] Remove redundant	ib_qp_cap	from	2
	verb routines.
In-Reply-To: <52oeigrvoq.fsf@topspin.com>
References: <Pine.LNX.4.44.0410291247460.17190-200000@DYN318430BLD>
	<20041029131437.6f1d0cf6.mshefty@ichips.intel.com>
	<1099326275.3074.3.camel@hpc-1>
	<20041101094003.6c7bc3e0.mshefty@ichips.intel.com>
	<1099341735.3074.97.camel@hpc-1> <52oeigrvoq.fsf@topspin.com>
Message-ID: <1099422940.2838.2.camel@hpc-1>

On Tue, 2004-11-02 at 12:45, Roland Dreier wrote:
> As far as I can tell this patch is broken: it removes the qp_cap
> parameter to modify_qp but doesn't fix up the mthca functions.

Sorry :-( It looks like I somehow missed either including the change or
the change to mthca_qp.c. If it is the former, I wonder how things could
build. Do you think this could be related to the oops ? (I will retest
and see if I can recreate it).

> I added the missing pieces by hand and applied.

Thanks.

-- Hal


From mshefty at ichips.intel.com  Tue Nov  2 11:19:06 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 11:19:06 -0800
Subject: [openib-general] [PATCH] for review -- fix MAD completion handling
In-Reply-To: <20041028233000.19879b59.mshefty@ichips.intel.com>
References: <20041028233000.19879b59.mshefty@ichips.intel.com>
Message-ID: <20041102111906.431f78b0.mshefty@ichips.intel.com>

On Thu, 28 Oct 2004 23:30:00 -0700
Sean Hefty <mshefty at ichips.intel.com> wrote:

> Here's what I have to handle MAD completion handling.  This patch
> tries to fix the issue of matching a completion (successful or error)
> with the corresponding work request.  Some notes:

Please use this patch instead.  I merged with the latest changes (as of
this morning) and tested with opensm running on a remote node and ipoib
running locally.

This change is for the openib-candidate branch, but going forward, my
intention is to create patches for the roland-merge branch.

- Sean


Index: access/mad.c
===================================================================
--- access/mad.c	(revision 1116)
+++ access/mad.c	(working copy)
@@ -81,9 +81,8 @@
 static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
 			   struct ib_mad_agent_private *priv);
 static void remove_mad_reg_req(struct ib_mad_agent_private *priv); 
-static int ib_mad_post_receive_mad(struct ib_mad_port_private
*port_priv,
-				   struct ib_qp *qp);
-static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv);
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info);
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info);
 static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
 static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private
*mad_send_wr,
 				    struct ib_mad_send_wc *mad_send_wc);
@@ -130,6 +129,19 @@
 		0 : mgmt_class;
 }
 
+static int get_spl_qp_index(enum ib_qp_type qp_type)
+{
+	switch (qp_type)
+	{
+	case IB_QPT_SMI:
+		return 0;
+	case IB_QPT_GSI:
+		return 1;
+	default:
+		return -1;
+	}
+}
+
 /*
  * ib_register_mad_agent - Register to send/receive MADs
  */
@@ -148,12 +160,13 @@
 	struct ib_mad_reg_req *reg_req = NULL;
 	struct ib_mad_mgmt_class_table *class;
 	struct ib_mad_mgmt_method_table *method;
-	int ret2;
+	int ret2, qpn;
 	unsigned long flags;
 	u8 mgmt_class;
 
 	/* Validate parameters */
-	if (qp_type != IB_QPT_GSI && qp_type != IB_QPT_SMI) {
+	qpn = get_spl_qp_index(qp_type);
+	if (qpn == -1) {
 		ret = ERR_PTR(-EINVAL);
 		goto error1;
 	}
@@ -225,14 +238,14 @@
  
 	/* Now, fill in the various structures */
 	memset(mad_agent_priv, 0, sizeof *mad_agent_priv);
-	mad_agent_priv->port_priv = port_priv;
+	mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
 	mad_agent_priv->reg_req = reg_req;
 	mad_agent_priv->rmpp_version = rmpp_version;
 	mad_agent_priv->agent.device = device;
 	mad_agent_priv->agent.recv_handler = recv_handler;
 	mad_agent_priv->agent.send_handler = send_handler;
 	mad_agent_priv->agent.context = context;
-	mad_agent_priv->agent.qp = port_priv->qp[qp_type];
+	mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp;
 	mad_agent_priv->agent.port_num = port_num;
 
 	spin_lock_irqsave(&port_priv->reg_lock, flags);
@@ -256,6 +269,7 @@
 			}
 		}
 	}
+
 	ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv);
 	if (ret2) {
 		ret = ERR_PTR(ret2);	
@@ -272,7 +286,6 @@
 	INIT_WORK(&mad_agent_priv->work, timeout_sends, mad_agent_priv);
 	atomic_set(&mad_agent_priv->refcount, 1);
 	init_waitqueue_head(&mad_agent_priv->wait);
-	mad_agent_priv->port_priv = port_priv;
 
 	return &mad_agent_priv->agent;
 
@@ -292,6 +305,7 @@
 int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent)
 {
 	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
 
 	mad_agent_priv = container_of(mad_agent, struct
ib_mad_agent_private,
@@ -305,13 +319,14 @@
 	 */
 	cancel_mads(mad_agent_priv);
 
+	port_priv = mad_agent_priv->qp_info->port_priv;
 	cancel_delayed_work(&mad_agent_priv->work);
-	flush_workqueue(mad_agent_priv->port_priv->wq);
+	flush_workqueue(port_priv->wq);
 
-	spin_lock_irqsave(&mad_agent_priv->port_priv->reg_lock, flags);
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	remove_mad_reg_req(mad_agent_priv);
 	list_del(&mad_agent_priv->agent_list);
-	spin_unlock_irqrestore(&mad_agent_priv->port_priv->reg_lock, flags);
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 
 	/* XXX: Cleanup pending RMPP receives for this agent */
 
@@ -326,30 +341,51 @@
 }
 EXPORT_SYMBOL(ib_unregister_mad_agent);
 
+static void queue_mad(struct ib_mad_queue *mad_queue,
+		      struct ib_mad_list_head *mad_list)
+{
+	unsigned long flags;
+
+	mad_list->mad_queue = mad_queue;
+	spin_lock_irqsave(&mad_queue->lock, flags);
+	list_add_tail(&mad_list->list, &mad_queue->list);
+	mad_queue->count++;
+	spin_unlock_irqrestore(&mad_queue->lock, flags);
+}
+
+static void dequeue_mad(struct ib_mad_list_head *mad_list)
+{
+	struct ib_mad_queue *mad_queue;
+	unsigned long flags;
+
+	BUG_ON(!mad_list->mad_queue);
+	mad_queue = mad_list->mad_queue;
+	spin_lock_irqsave(&mad_queue->lock, flags);
+	list_del(&mad_list->list);
+	mad_queue->count--;
+	spin_unlock_irqrestore(&mad_queue->lock, flags);
+}
+
 static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv,
 		       struct ib_mad_send_wr_private *mad_send_wr,
 		       struct ib_send_wr *send_wr,
 		       struct ib_send_wr **bad_send_wr)
 {
-	struct ib_mad_port_private *port_priv;
-	unsigned long flags;
+	struct ib_mad_qp_info *qp_info;
 	int ret;
 
-	port_priv = mad_agent_priv->port_priv;
-
 	/* Replace user's WR ID with our own to find WR upon completion
*/
+	qp_info = mad_agent_priv->qp_info;
 	mad_send_wr->wr_id = send_wr->wr_id;
-	send_wr->wr_id = (unsigned long)mad_send_wr;
+	send_wr->wr_id = (unsigned long)&mad_send_wr->mad_list;
+	queue_mad(&qp_info->send_queue, &mad_send_wr->mad_list);
 
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
 	ret = ib_post_send(mad_agent_priv->agent.qp, send_wr,
bad_send_wr);
-	if (!ret) {
-		list_add_tail(&mad_send_wr->send_list,
-			      &port_priv->send_posted_mad_list);
-		port_priv->send_posted_mad_count++;
-	} else 
+	if (ret) {
 		printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret);
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+		dequeue_mad(&mad_send_wr->mad_list);
+		*bad_send_wr = send_wr;
+	}
 	return ret;
 }
 
@@ -364,7 +400,6 @@
 	int ret;
 	struct ib_send_wr	*cur_send_wr, *next_send_wr;
 	struct ib_mad_agent_private	*mad_agent_priv;
-	struct ib_mad_port_private	*port_priv;
 
 	/* Validate supplied parameters */
 	if (!bad_send_wr)
@@ -379,7 +414,6 @@
 
 	mad_agent_priv = container_of(mad_agent, struct
ib_mad_agent_private,
 				      agent);
-	port_priv = mad_agent_priv->port_priv;
 
 	/* Walk list of send WRs and post each on send list */
 	cur_send_wr = send_wr;
@@ -421,6 +455,7 @@
 				  cur_send_wr, bad_send_wr);
 		if (ret) {
 			/* Handle QP overrun separately... -ENOMEM */
+			/* Handle posting when QP is in error state... */
 
 			/* Fail send request */
 			spin_lock_irqsave(&mad_agent_priv->lock, flags);
@@ -587,7 +622,7 @@
 	if (!mad_reg_req)
 		return 0;
 
-	private = priv->port_priv;
+	private = priv->qp_info->port_priv;
 	mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class);
 	class = &private->version[mad_reg_req->mgmt_class_version];
 	if (!*class) {
@@ -663,7 +698,7 @@
 		goto out;
 	}
 
-	port_priv = agent_priv->port_priv;
+	port_priv = agent_priv->qp_info->port_priv;
 	class =
port_priv->version[agent_priv->reg_req->mgmt_class_version];
 	if (!class) {
 		printk(KERN_ERR PFX "No class table yet MAD registration "
@@ -695,20 +730,6 @@
 	return;
 }
 
-static int convert_qpnum(u32 qp_num)
-{
-	/* 
-	 * XXX: No redirection currently
-	 * QP0 and QP1 only
-	 * Ultimately, will need table of QP numbers and table index
-	 * as QP numbers will not be packed once redirection supported
-	 */
-	if (qp_num > 1) {
-		return -1;
-	}
-	return qp_num;
-}
-
 static int response_mad(struct ib_mad *mad)
 {
 	/* Trap represses are responses although response bit is reset
*/
@@ -913,55 +934,21 @@
 static void ib_mad_recv_done_handler(struct ib_mad_port_private
*port_priv,
 				     struct ib_wc *wc)
 {
+	struct ib_mad_qp_info *qp_info;
 	struct ib_mad_private_header *mad_priv_hdr;
-	struct ib_mad_recv_buf *rbuf;
 	struct ib_mad_private *recv;
-	union ib_mad_recv_wrid wrid;
-	unsigned long flags;
-	u32 qp_num;
+	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent = NULL;
-	int solicited, qpn;
-
-	/* For receive, QP number is field in the WC WRID */
-	wrid.wrid = wc->wr_id;
-	qp_num = wrid.wrid_field.qpn;
-	qpn = convert_qpnum(qp_num);
-	if (qpn == -1) {
-		ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]);
-		printk(KERN_ERR PFX "Packet received on unknown QPN %d\n",
-		       qp_num);
-		return;
-	}
-	
-	/* 
-	 * Completion corresponds to first entry on 
-	 * posted MAD receive list based on WRID in completion
-	 */
-	spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-	if (!list_empty(&port_priv->recv_posted_mad_list[qpn])) {
-		rbuf = list_entry(port_priv->recv_posted_mad_list[qpn].next,
-				 struct ib_mad_recv_buf,
-				 list);
-		mad_priv_hdr = container_of(rbuf, struct ib_mad_private_header,
-					    recv_buf);
-		recv = container_of(mad_priv_hdr, struct ib_mad_private,
-				    header);
-	
-		/* Remove from posted receive MAD list */
-		list_del(&recv->header.recv_buf.list);
-		port_priv->recv_posted_mad_count[qpn]--;
-
-	} else {
-		spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
-		ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]);
-		printk(KERN_ERR PFX "Receive completion WR ID 0x%Lx on QP %d "
-		       "with no posted receive\n",
-		       (unsigned long long) wc->wr_id,
-		       qp_num);
-		return;
-	}
-	spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
+	int solicited;
+	unsigned long flags;
 
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	qp_info = mad_list->mad_queue->qp_info;
+	dequeue_mad(mad_list);
+
+	mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header,
+				    mad_list);
+	recv = container_of(mad_priv_hdr, struct ib_mad_private, header);
 	pci_unmap_single(port_priv->device->dma_device,
 			 pci_unmap_addr(&recv->header, mapping),
 			 sizeof(struct ib_mad_private) -
@@ -976,7 +963,7 @@
 	recv->header.recv_buf.grh = &recv->grh;
 
 	/* Validate MAD */
-	if (!validate_mad(recv->header.recv_buf.mad, qp_num))
+	if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num))
 		goto out;
 
 	/* Snoop MAD ? */
@@ -1009,7 +996,7 @@
 	}
 
 	/* Post another receive request for this QP */
-	ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]);
+	ib_mad_post_receive_mad(qp_info);
 }
 
 static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
@@ -1030,7 +1017,8 @@
 			delay = mad_send_wr->timeout - jiffies;
 			if ((long)delay <= 0)
 				delay = 1;
-			queue_delayed_work(mad_agent_priv->port_priv->wq,
+			queue_delayed_work(mad_agent_priv->qp_info->
+					   port_priv->wq,
 					   &mad_agent_priv->work, delay);
 		}
 	}
@@ -1060,7 +1048,7 @@
 	/* Reschedule a work item if we have a shorter timeout */
 	if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list)
{
 		cancel_delayed_work(&mad_agent_priv->work);
-		queue_delayed_work(mad_agent_priv->port_priv->wq,
+		queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq,
 				   &mad_agent_priv->work, delay);
 	}
 }
@@ -1114,39 +1102,15 @@
 				     struct ib_wc *wc)
 {
 	struct ib_mad_send_wr_private	*mad_send_wr;
-	unsigned long			flags;
-
-	/* Completion corresponds to first entry on posted MAD send list */
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	if (list_empty(&port_priv->send_posted_mad_list)) {
-		printk(KERN_ERR PFX "Send completion WR ID 0x%Lx but send "
-		       "list is empty\n", (unsigned long long) wc->wr_id);
-		goto error;
-	}
-
-	mad_send_wr = list_entry(port_priv->send_posted_mad_list.next,
-				 struct ib_mad_send_wr_private,
-				 send_list);
-	if (wc->wr_id != (unsigned long)mad_send_wr) {
-		printk(KERN_ERR PFX "Send completion WR ID 0x%Lx doesn't match "
-		       "posted send WR ID 0x%lx\n",
-		       (unsigned long long) wc->wr_id,
-		       (unsigned long)mad_send_wr);
-		goto error;
-	}
-
-	/* Remove from posted send MAD list */
-	list_del(&mad_send_wr->send_list);
-	port_priv->send_posted_mad_count--;
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+	struct ib_mad_list_head		*mad_list;
 
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
+				   mad_list);
+	dequeue_mad(mad_list);
 	/* Restore client wr_id in WC */
 	wc->wr_id = mad_send_wr->wr_id;
 	ib_mad_complete_send_wr(mad_send_wr, (struct
ib_mad_send_wc*)wc);
-	return;
-
-error:
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 }
 
 /*
@@ -1156,28 +1120,33 @@
 {
 	struct ib_mad_port_private *port_priv;
 	struct ib_wc wc;
+	struct ib_mad_list_head *mad_list;
+	struct ib_mad_qp_info *qp_info;
 
 	port_priv = (struct ib_mad_port_private*)data;
 	ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	
 	while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) {
 		if (wc.status != IB_WC_SUCCESS) {
-			printk(KERN_ERR PFX "Completion error %d WRID 0x%Lx\n",
-                                       wc.status, (unsigned long long)
wc.wr_id);
+			/* Determine if failure was a send or receive. */
+			mad_list = (struct ib_mad_list_head *)
+				   (unsigned long)wc.wr_id;
+			qp_info = mad_list->mad_queue->qp_info;
+			if (mad_list->mad_queue == &qp_info->send_queue)
+				wc.opcode = IB_WC_SEND;
+			else
+				wc.opcode = IB_WC_RECV;
+		}
+		switch (wc.opcode) {
+		case IB_WC_SEND:
 			ib_mad_send_done_handler(port_priv, &wc);
-		} else {
-			switch (wc.opcode) {
-			case IB_WC_SEND:
-				ib_mad_send_done_handler(port_priv, &wc);
-				break;
-			case IB_WC_RECV:
-				ib_mad_recv_done_handler(port_priv, &wc);
-				break;
-			default:
-				printk(KERN_ERR PFX "Wrong Opcode 0x%x on completion\n",
-				       wc.opcode);
-				break;
-			}
+			break;
+		case IB_WC_RECV:
+			ib_mad_recv_done_handler(port_priv, &wc);
+			break;
+		default:
+			BUG_ON(1);
+			break;
 		}
 	}
 }
@@ -1307,7 +1276,8 @@
 			delay = mad_send_wr->timeout - jiffies;
 			if ((long)delay <= 0)
 				delay = 1;
-			queue_delayed_work(mad_agent_priv->port_priv->wq,
+			queue_delayed_work(mad_agent_priv->qp_info->
+					   port_priv->wq,
 					   &mad_agent_priv->work, delay);
 			break;
 		}
@@ -1332,24 +1302,13 @@
 	queue_work(port_priv->wq, &port_priv->work);
 }
 
-static int ib_mad_post_receive_mad(struct ib_mad_port_private
*port_priv,
-				   struct ib_qp *qp)
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info)
 {
 	struct ib_mad_private *mad_priv;
 	struct ib_sge sg_list;
 	struct ib_recv_wr recv_wr;
 	struct ib_recv_wr *bad_recv_wr;
-	unsigned long flags;
 	int ret;
-	union ib_mad_recv_wrid wrid;
-	int qpn;
-
-
-	qpn = convert_qpnum(qp->qp_num);
-	if (qpn == -1) {
-		printk(KERN_ERR PFX "Post receive to invalid QPN %d\n",
qp->qp_num);
-		return -EINVAL;
-	}
 
 	/* 
 	 * Allocate memory for receive buffer.
@@ -1367,47 +1326,32 @@
 	}
 
 	/* Setup scatter list */
-	sg_list.addr = pci_map_single(port_priv->device->dma_device,
+	sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device,
 				      &mad_priv->grh,
 				      sizeof *mad_priv -
 					sizeof mad_priv->header,
 				      PCI_DMA_FROMDEVICE);
 	sg_list.length = sizeof *mad_priv - sizeof mad_priv->header;
-	sg_list.lkey = (*port_priv->mr).lkey;
+	sg_list.lkey = (*qp_info->port_priv->mr).lkey;
 
 	/* Setup receive WR */
 	recv_wr.next = NULL;
 	recv_wr.sg_list = &sg_list;
 	recv_wr.num_sge = 1;
 	recv_wr.recv_flags = IB_RECV_SIGNALED;
-	wrid.wrid_field.index = port_priv->recv_wr_index[qpn]++;
-	wrid.wrid_field.qpn = qp->qp_num;
-	recv_wr.wr_id = wrid.wrid;
-
-	/* Link receive WR into posted receive MAD list */
-	spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-	list_add_tail(&mad_priv->header.recv_buf.list,
-		      &port_priv->recv_posted_mad_list[qpn]);
-	port_priv->recv_posted_mad_count[qpn]++;
-	spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
-
+	recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
 	pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
 
-	/* Now, post receive WR */
-	ret = ib_post_recv(qp, &recv_wr, &bad_recv_wr);
+	/* Post receive WR. */
+	queue_mad(&qp_info->recv_queue, &mad_priv->header.mad_list);
+	ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
 	if (ret) {
-
-		pci_unmap_single(port_priv->device->dma_device,
+		dequeue_mad(&mad_priv->header.mad_list);
+		pci_unmap_single(qp_info->port_priv->device->dma_device,
 				 pci_unmap_addr(&mad_priv->header, mapping),
 				 sizeof *mad_priv - sizeof mad_priv->header,
 				 PCI_DMA_FROMDEVICE);
 
-		/* Unlink from posted receive MAD list */
-		spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-		list_del(&mad_priv->header.recv_buf.list);
-		port_priv->recv_posted_mad_count[qpn]--;
-		spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
-
 		kmem_cache_free(ib_mad_cache, mad_priv);
 		printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx failed ret =
%d\n",
 		       (unsigned long long) recv_wr.wr_id, ret);
@@ -1420,79 +1364,72 @@
 /*
  * Allocate receive MADs and post receive WRs for them 
  */
-static int ib_mad_post_receive_mads(struct ib_mad_port_private
*port_priv)
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info)
 {
-	int i, j;
+	int i, ret;
 
 	for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) {
-		for (j = 0; j < IB_MAD_QPS_CORE; j++) {
-			if (ib_mad_post_receive_mad(port_priv,
-						    port_priv->qp[j])) {
-				printk(KERN_ERR PFX "receive post %d failed "
-				       "on %s port %d\n", i + 1,
-				       port_priv->device->name,
-				       port_priv->port_num);
-			}
+		ret = ib_mad_post_receive_mad(qp_info);
+		if (ret) {
+			printk(KERN_ERR PFX "receive post %d failed "
+				"on %s port %d\n", i + 1,
+				qp_info->port_priv->device->name,
+				qp_info->port_priv->port_num);
+			break;
 		}
 	}
-
-	return 0;
+	return ret;
 }
 
 /*
  * Return all the posted receive MADs
  */
-static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private
*port_priv)
+static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info
*qp_info)
 {
-	int i;
 	unsigned long flags;
 	struct ib_mad_private_header *mad_priv_hdr;
-	struct ib_mad_recv_buf *rbuf;
 	struct ib_mad_private *recv;
+	struct ib_mad_list_head *mad_list;
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-		while (!list_empty(&port_priv->recv_posted_mad_list[i])) {
+	spin_lock_irqsave(&qp_info->recv_queue.lock, flags);
+	while (!list_empty(&qp_info->recv_queue.list)) {
 
-			rbuf = list_entry(port_priv->recv_posted_mad_list[i].next,
-					  struct ib_mad_recv_buf, list);
-			mad_priv_hdr = container_of(rbuf,
-						    struct ib_mad_private_header,
-						    recv_buf);
-			recv = container_of(mad_priv_hdr,
-					    struct ib_mad_private, header);
+		mad_list = list_entry(qp_info->recv_queue.list.next,
+				      struct ib_mad_list_head, list);
+		mad_priv_hdr = container_of(mad_list,
+					    struct ib_mad_private_header,
+					    mad_list);
+		recv = container_of(mad_priv_hdr, struct ib_mad_private,
+				    header);
 
-			/* Remove for posted receive MAD list */
-			list_del(&recv->header.recv_buf.list);
- 
-			/* Undo PCI mapping */
-			pci_unmap_single(port_priv->device->dma_device,
-					 pci_unmap_addr(&recv->header, mapping),
-					 sizeof(struct ib_mad_private) -
-					 sizeof(struct ib_mad_private_header),
-					 PCI_DMA_FROMDEVICE);
-
-			kmem_cache_free(ib_mad_cache, recv);
-		}
+		/* Remove from posted receive MAD list */
+		list_del(&mad_list->list);
 
-		INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]);
-		port_priv->recv_posted_mad_count[i] = 0;
-		spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
+		/* Undo PCI mapping */
+		pci_unmap_single(qp_info->port_priv->device->dma_device,
+				 pci_unmap_addr(&recv->header, mapping),
+				 sizeof(struct ib_mad_private) -
+				 sizeof(struct ib_mad_private_header),
+				 PCI_DMA_FROMDEVICE);
+		kmem_cache_free(ib_mad_cache, recv);
 	}
+
+	qp_info->recv_queue.count = 0;
+	spin_unlock_irqrestore(&qp_info->recv_queue.lock, flags);
 }
 
 /*
  * Return all the posted send MADs
  */
-static void ib_mad_return_posted_send_mads(struct ib_mad_port_private
*port_priv)
+static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info
*qp_info)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	/* Just clear port send posted MAD list */
-	INIT_LIST_HEAD(&port_priv->send_posted_mad_list);
-	port_priv->send_posted_mad_count = 0;
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+	/* Just clear port send posted MAD list... revisit!!! */
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	INIT_LIST_HEAD(&qp_info->send_queue.list);
+	qp_info->send_queue.count = 0;
+	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
 }
 
 /*
@@ -1618,35 +1555,21 @@
 	int ret, i, ret2;
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_init(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "INIT\n", i);
-			return ret;
+			goto error;
 		}
-	}
-
-	ret = ib_mad_post_receive_mads(port_priv);
-	if (ret) {
-		printk(KERN_ERR PFX "Couldn't post receive requests\n");
-		goto error;
-	}
-
-	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
-	if (ret) {
-		printk(KERN_ERR PFX "Failed to request completion
notification\n");
-		goto error;
-	}
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTR\n", i);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTS\n", i);
@@ -1654,17 +1577,31 @@
 		}
 	}
 
+	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		printk(KERN_ERR PFX "Failed to request completion
notification\n");
+		goto error;
+	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't post receive requests\n");
+			goto error;
+		}
+	}
 	return 0;
+
 error:
-	ib_mad_return_posted_recv_mads(port_priv);
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret2 = ib_mad_change_qp_state_to_reset(port_priv->qp[i]);
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
+		ret2 = ib_mad_change_qp_state_to_reset(port_priv->
+						       qp_info[i].qp);
 		if (ret2) {
 			printk(KERN_ERR PFX "ib_mad_port_start: Couldn't "
 			       "change QP%d state to RESET\n", i);
 		}
 	}
-
 	return ret;
 }
 
@@ -1676,16 +1613,64 @@
 	int i, ret;
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_reset(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_reset(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change "
 			       "%s port %d QP%d state to RESET\n",
 			       port_priv->device->name, port_priv->port_num, i);
 		}
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
+		ib_mad_return_posted_send_mads(&port_priv->qp_info[i]);
 	}
+}
 
-	ib_mad_return_posted_recv_mads(port_priv);
-	ib_mad_return_posted_send_mads(port_priv);
+static void init_mad_queue(struct ib_mad_qp_info *qp_info,
+			   struct ib_mad_queue *mad_queue)
+{
+	mad_queue->qp_info = qp_info;
+	mad_queue->count = 0;
+	spin_lock_init(&mad_queue->lock);
+	INIT_LIST_HEAD(&mad_queue->list);
+}
+
+static int create_mad_qp(struct ib_mad_port_private *port_priv,
+			 struct ib_mad_qp_info *qp_info,
+			 enum ib_qp_type qp_type)
+{
+	struct ib_qp_init_attr	qp_init_attr;
+	int ret;
+
+	qp_info->port_priv = port_priv;
+	init_mad_queue(qp_info, &qp_info->send_queue);
+	init_mad_queue(qp_info, &qp_info->recv_queue);
+
+	memset(&qp_init_attr, 0, sizeof qp_init_attr);
+	qp_init_attr.send_cq = port_priv->cq;
+	qp_init_attr.recv_cq = port_priv->cq;
+	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
+	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
+	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
+	qp_init_attr.qp_type = qp_type;
+	qp_init_attr.port_num = port_priv->port_num;
+	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
+	if (IS_ERR(qp_info->qp)) {
+		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
+		       get_spl_qp_index(qp_type));
+		ret = PTR_ERR(qp_info->qp);
+		goto error;		
+	}
+	return 0;
+
+error:
+	return ret;
+}
+
+static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
+{
+	ib_destroy_qp(qp_info->qp);
 }
 
 /*
@@ -1694,7 +1679,7 @@
  */
 static int ib_mad_port_open(struct ib_device *device, int port_num)
 {
-	int ret, cq_size, i;
+	int ret, cq_size;
 	u64 iova = 0;
 	struct ib_phys_buf buf_list = {
 		.addr = 0,
@@ -1749,38 +1734,15 @@
 		goto error5;
 	}
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		struct ib_qp_init_attr	qp_init_attr;
-
-		memset(&qp_init_attr, 0, sizeof qp_init_attr);
-		qp_init_attr.send_cq = port_priv->cq;
-		qp_init_attr.recv_cq = port_priv->cq;
-		qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
-		qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR;
-		qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
-		qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
-		qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
-		qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
-		qp_init_attr.qp_type = i;	/* Relies on ib_qp_type enum ordering of
IB_QPT_SMI and IB_QPT_GSI */
-		qp_init_attr.port_num = port_priv->port_num;
-		port_priv->qp[i] = ib_create_qp(port_priv->pd, &qp_init_attr);
-		if (IS_ERR(port_priv->qp[i])) {
-			printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", i);
-			ret = PTR_ERR(port_priv->qp[i]);
-			if (i == 0)
-				goto error6;		
-			else
-				goto error7;
-		}
-	}
+	ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI);
+	if (ret)
+		goto error6;
+	ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI);
+	if (ret)
+		goto error7;
 
 	spin_lock_init(&port_priv->reg_lock);
-	spin_lock_init(&port_priv->recv_list_lock);
-	spin_lock_init(&port_priv->send_list_lock);
 	INIT_LIST_HEAD(&port_priv->agent_list);
-	INIT_LIST_HEAD(&port_priv->send_posted_mad_list);
-	for (i = 0; i < IB_MAD_QPS_CORE; i++)
-		INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]);
 
 	port_priv->wq = create_workqueue("ib_mad");
 	if (!port_priv->wq) {
@@ -1798,15 +1760,14 @@
 	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
 	list_add_tail(&port_priv->port_list, &ib_mad_port_list);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
-
 	return 0;
 
 error9:
 	destroy_workqueue(port_priv->wq);
 error8:
-	ib_destroy_qp(port_priv->qp[1]);
+	destroy_mad_qp(&port_priv->qp_info[1]);
 error7:
-	ib_destroy_qp(port_priv->qp[0]);
+	destroy_mad_qp(&port_priv->qp_info[0]);
 error6:
 	ib_dereg_mr(port_priv->mr);
 error5:
@@ -1842,8 +1803,8 @@
 	ib_mad_port_stop(port_priv);
 	flush_workqueue(port_priv->wq);
 	destroy_workqueue(port_priv->wq);
-	ib_destroy_qp(port_priv->qp[1]);
-	ib_destroy_qp(port_priv->qp[0]);
+	destroy_mad_qp(&port_priv->qp_info[1]);
+	destroy_mad_qp(&port_priv->qp_info[0]);
 	ib_dereg_mr(port_priv->mr);
 	ib_dealloc_pd(port_priv->pd);
 	ib_destroy_cq(port_priv->cq);
Index: access/mad_priv.h
===================================================================
--- access/mad_priv.h	(revision 1116)
+++ access/mad_priv.h	(working copy)
@@ -79,16 +79,13 @@
 #define MAX_MGMT_CLASS		80	
 #define MAX_MGMT_VERSION	8
 
-
-union ib_mad_recv_wrid {
-	u64 wrid;
-	struct {
-		u32 index;
-		u32 qpn;
-	} wrid_field;
+struct ib_mad_list_head {
+	struct list_head list;
+	struct ib_mad_queue *mad_queue;
 };
 
 struct ib_mad_private_header {
+	struct ib_mad_list_head mad_list;
 	struct ib_mad_recv_wc recv_wc;
 	struct ib_mad_recv_buf recv_buf;
 	DECLARE_PCI_UNMAP_ADDR(mapping)
@@ -108,7 +105,7 @@
 	struct list_head agent_list;
 	struct ib_mad_agent agent;
 	struct ib_mad_reg_req *reg_req;
-	struct ib_mad_port_private *port_priv;
+	struct ib_mad_qp_info *qp_info;
 
 	spinlock_t lock;
 	struct list_head send_list;
@@ -122,7 +119,7 @@
 };
 
 struct ib_mad_send_wr_private {
-	struct list_head send_list;
+	struct ib_mad_list_head mad_list;
 	struct list_head agent_list;
 	struct ib_mad_agent *agent;
 	u64 wr_id;			/* client WR ID */
@@ -140,11 +137,25 @@
 	struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS];
 };
 
+struct ib_mad_queue {
+	spinlock_t lock;
+	struct list_head list;
+	int count;
+	struct ib_mad_qp_info *qp_info;
+};
+
+struct ib_mad_qp_info {
+	struct ib_mad_port_private *port_priv;
+	struct ib_qp *qp;
+	struct ib_mad_queue send_queue;
+	struct ib_mad_queue recv_queue;
+	/* struct ib_mad_queue overflow_queue; */
+};
+
 struct ib_mad_port_private {
 	struct list_head port_list;
 	struct ib_device *device;
 	int port_num;
-	struct ib_qp *qp[IB_MAD_QPS_CORE];
 	struct ib_cq *cq;
 	struct ib_pd *pd;
 	struct ib_mr *mr;
@@ -154,15 +165,7 @@
 	struct list_head agent_list;
 	struct workqueue_struct *wq;
 	struct work_struct work;
-
-	spinlock_t send_list_lock;
-	struct list_head send_posted_mad_list;
-	int send_posted_mad_count;
-
-	spinlock_t recv_list_lock;
-	struct list_head recv_posted_mad_list[IB_MAD_QPS_CORE];
-	int recv_posted_mad_count[IB_MAD_QPS_CORE];
-	u32 recv_wr_index[IB_MAD_QPS_CORE];
+	struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE];
 };
 
 #endif	/* __IB_MAD_PRIV_H__ */


From krkumar at us.ibm.com  Tue Nov  2 11:17:25 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 11:17:25 -0800 (PST)
Subject: [openib-general] [PATCH] General cleanup in add_mad_reg_req
Message-ID: <Pine.LNX.4.44.0411021059130.28763-100000@DYN318430BLD>

Optimize a "clear" operation to use memset, and re-arrange the code a bit
to make the function cleaner.

(Hal, I am sending patches based on latest bits, but not on top of my
earlier sent-but-not-applied patches. If it doesn't apply, please let me
know and I will recreate the patch).

Thanks.

- KK

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-02 10:46:19.000000000 -0800
+++ 2/mad.c	2004-11-02 10:58:50.000000000 -0800
@@ -596,31 +596,28 @@ static int add_mad_reg_req(struct ib_mad
 		if (!*class) {
 			printk(KERN_ERR PFX "No memory for "
 			       "ib_mad_mgmt_class_table\n");
+			ret = -ENOMEM;
 			goto error1;
 		}
 		/* Clear management class table for this class version */
-		for (i = 0; i < MAX_MGMT_CLASS; i++) {
-			(*class)->method_table[i] = NULL;
-		}
+		memset((*class)->method_table, 0,
+		       sizeof((*class)->method_table));
 		/* Allocate method table for this management class */
 		method = &(*class)->method_table[mgmt_class];
-		if (allocate_method_table(method)) {
+		if ((ret = allocate_method_table(method)))
 			goto error2;
-		}
 	} else {
 		method = &(*class)->method_table[mgmt_class];
 		if (!*method) {
 			/* Allocate method table for this management class */
-			if (allocate_method_table(method)) {
+			if ((ret = allocate_method_table(method)))
 				goto error1;
-			}
 		}
 	}

 	/* Now, make sure methods are not already in use */
-	if (method_in_use(method, mad_reg_req)) {
+	if (method_in_use(method, mad_reg_req))
 		goto error3;
-	}

 	/* Finally, add in methods being registered */
 	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
@@ -641,13 +638,11 @@ error3:
 		*method = NULL;
 	}
 	ret = -EINVAL;
-	goto error;
+	goto error1;
 error2:
 	kfree(*class);
 	*class = NULL;
 error1:
-	ret = -ENOMEM;
-error:
 	return ret;
 }


From halr at voltaire.com  Tue Nov  2 11:34:25 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 14:34:25 -0500
Subject: [openib-general] [PATCH] Optimize check_class_table	and
	method_table to return BOOL.
In-Reply-To: <Pine.LNX.4.44.0411021037470.28686-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411021037470.28686-100000@DYN318430BLD>
Message-ID: <1099424065.4129.0.camel@hpc-1>

On Tue, 2004-11-02 at 13:40, Krishna Kumar wrote:
> The callers are just interested in knowing whether any methods or method
> tables are in use, not the actual use count.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov  2 11:45:03 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 14:45:03 -0500
Subject: [openib-general] ifconfig ib0 down and then up vis a vis IP
	connectivity
Message-ID: <1099424703.4129.9.camel@hpc-1>

Hi,

What is the ARP timeout in Linux ?

If I down and then up the ib0 interface, there is some delay before
connectivity is restored despite the fact that it is successfully
(re)attached to the multicast groups and that all the QPNs seem to be
the same. After some time period, connectivity is restored. Any idea on
what is different ? It seems like it is an ARP cache issue.

Thanks.

-- Hal


From halr at voltaire.com  Tue Nov  2 12:00:04 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 15:00:04 -0500
Subject: [openib-general] [PATCH] General cleanup in add_mad_reg_req
In-Reply-To: <Pine.LNX.4.44.0411021059130.28763-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411021059130.28763-100000@DYN318430BLD>
Message-ID: <1099425604.4129.26.camel@hpc-1>

On Tue, 2004-11-02 at 14:17, Krishna Kumar wrote:
> Optimize a "clear" operation to use memset, and re-arrange the code a bit
> to make the function cleaner.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov  2 12:05:06 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 15:05:06 -0500
Subject: [openib-general] [PATCH] for review -- fix MAD completion	handling
In-Reply-To: <20041102111906.431f78b0.mshefty@ichips.intel.com>
References: <20041028233000.19879b59.mshefty@ichips.intel.com>
	<20041102111906.431f78b0.mshefty@ichips.intel.com>
Message-ID: <1099425905.4129.29.camel@hpc-1>

On Tue, 2004-11-02 at 14:19, Sean Hefty wrote:

> Index: access/mad.c
> ===================================================================
> --- access/mad.c	(revision 1116)
> +++ access/mad.c	(working copy)
> @@ -81,9 +81,8 @@
>  static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
>  			   struct ib_mad_agent_private *priv);
>  static void remove_mad_reg_req(struct ib_mad_agent_private *priv); 
> -static int ib_mad_post_receive_mad(struct ib_mad_port_private
> *port_priv,

I get an error here:
patching file mad.c
patch: **** malformed patch at line 10: *port_priv,

I think the mail somehow made *port_priv a separate line.

-- Hal


From mshefty at ichips.intel.com  Tue Nov  2 12:02:44 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 2 Nov 2004 12:02:44 -0800
Subject: [openib-general] [PATCH] for review -- fix MAD completion handling
In-Reply-To: <1099425905.4129.29.camel@hpc-1>
References: <20041028233000.19879b59.mshefty@ichips.intel.com>
	<20041102111906.431f78b0.mshefty@ichips.intel.com>
	<1099425905.4129.29.camel@hpc-1>
Message-ID: <20041102120244.498d75f2.mshefty@ichips.intel.com>

On Tue, 02 Nov 2004 15:05:06 -0500
Hal Rosenstock <halr at voltaire.com> wrote:


> I get an error here:
> patching file mad.c
> patch: **** malformed patch at line 10: *port_priv,

I think my mailer wrapped the lines after I hit send.  Let me try again.

- Sean

Index: access/mad.c
===================================================================
--- access/mad.c	(revision 1116)
+++ access/mad.c	(working copy)
@@ -81,9 +81,8 @@
 static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
 			   struct ib_mad_agent_private *priv);
 static void remove_mad_reg_req(struct ib_mad_agent_private *priv); 
-static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv,
-				   struct ib_qp *qp);
-static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv);
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info);
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info);
 static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
 static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
 				    struct ib_mad_send_wc *mad_send_wc);
@@ -130,6 +129,19 @@
 		0 : mgmt_class;
 }
 
+static int get_spl_qp_index(enum ib_qp_type qp_type)
+{
+	switch (qp_type)
+	{
+	case IB_QPT_SMI:
+		return 0;
+	case IB_QPT_GSI:
+		return 1;
+	default:
+		return -1;
+	}
+}
+
 /*
  * ib_register_mad_agent - Register to send/receive MADs
  */
@@ -148,12 +160,13 @@
 	struct ib_mad_reg_req *reg_req = NULL;
 	struct ib_mad_mgmt_class_table *class;
 	struct ib_mad_mgmt_method_table *method;
-	int ret2;
+	int ret2, qpn;
 	unsigned long flags;
 	u8 mgmt_class;
 
 	/* Validate parameters */
-	if (qp_type != IB_QPT_GSI && qp_type != IB_QPT_SMI) {
+	qpn = get_spl_qp_index(qp_type);
+	if (qpn == -1) {
 		ret = ERR_PTR(-EINVAL);
 		goto error1;
 	}
@@ -225,14 +238,14 @@
  
 	/* Now, fill in the various structures */
 	memset(mad_agent_priv, 0, sizeof *mad_agent_priv);
-	mad_agent_priv->port_priv = port_priv;
+	mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
 	mad_agent_priv->reg_req = reg_req;
 	mad_agent_priv->rmpp_version = rmpp_version;
 	mad_agent_priv->agent.device = device;
 	mad_agent_priv->agent.recv_handler = recv_handler;
 	mad_agent_priv->agent.send_handler = send_handler;
 	mad_agent_priv->agent.context = context;
-	mad_agent_priv->agent.qp = port_priv->qp[qp_type];
+	mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp;
 	mad_agent_priv->agent.port_num = port_num;
 
 	spin_lock_irqsave(&port_priv->reg_lock, flags);
@@ -256,6 +269,7 @@
 			}
 		}
 	}
+
 	ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv);
 	if (ret2) {
 		ret = ERR_PTR(ret2);	
@@ -272,7 +286,6 @@
 	INIT_WORK(&mad_agent_priv->work, timeout_sends, mad_agent_priv);
 	atomic_set(&mad_agent_priv->refcount, 1);
 	init_waitqueue_head(&mad_agent_priv->wait);
-	mad_agent_priv->port_priv = port_priv;
 
 	return &mad_agent_priv->agent;
 
@@ -292,6 +305,7 @@
 int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent)
 {
 	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
 
 	mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private,
@@ -305,13 +319,14 @@
 	 */
 	cancel_mads(mad_agent_priv);
 
+	port_priv = mad_agent_priv->qp_info->port_priv;
 	cancel_delayed_work(&mad_agent_priv->work);
-	flush_workqueue(mad_agent_priv->port_priv->wq);
+	flush_workqueue(port_priv->wq);
 
-	spin_lock_irqsave(&mad_agent_priv->port_priv->reg_lock, flags);
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	remove_mad_reg_req(mad_agent_priv);
 	list_del(&mad_agent_priv->agent_list);
-	spin_unlock_irqrestore(&mad_agent_priv->port_priv->reg_lock, flags);
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 
 	/* XXX: Cleanup pending RMPP receives for this agent */
 
@@ -326,30 +341,51 @@
 }
 EXPORT_SYMBOL(ib_unregister_mad_agent);
 
+static void queue_mad(struct ib_mad_queue *mad_queue,
+		      struct ib_mad_list_head *mad_list)
+{
+	unsigned long flags;
+
+	mad_list->mad_queue = mad_queue;
+	spin_lock_irqsave(&mad_queue->lock, flags);
+	list_add_tail(&mad_list->list, &mad_queue->list);
+	mad_queue->count++;
+	spin_unlock_irqrestore(&mad_queue->lock, flags);
+}
+
+static void dequeue_mad(struct ib_mad_list_head *mad_list)
+{
+	struct ib_mad_queue *mad_queue;
+	unsigned long flags;
+
+	BUG_ON(!mad_list->mad_queue);
+	mad_queue = mad_list->mad_queue;
+	spin_lock_irqsave(&mad_queue->lock, flags);
+	list_del(&mad_list->list);
+	mad_queue->count--;
+	spin_unlock_irqrestore(&mad_queue->lock, flags);
+}
+
 static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv,
 		       struct ib_mad_send_wr_private *mad_send_wr,
 		       struct ib_send_wr *send_wr,
 		       struct ib_send_wr **bad_send_wr)
 {
-	struct ib_mad_port_private *port_priv;
-	unsigned long flags;
+	struct ib_mad_qp_info *qp_info;
 	int ret;
 
-	port_priv = mad_agent_priv->port_priv;
-
 	/* Replace user's WR ID with our own to find WR upon completion */
+	qp_info = mad_agent_priv->qp_info;
 	mad_send_wr->wr_id = send_wr->wr_id;
-	send_wr->wr_id = (unsigned long)mad_send_wr;
+	send_wr->wr_id = (unsigned long)&mad_send_wr->mad_list;
+	queue_mad(&qp_info->send_queue, &mad_send_wr->mad_list);
 
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
 	ret = ib_post_send(mad_agent_priv->agent.qp, send_wr, bad_send_wr);
-	if (!ret) {
-		list_add_tail(&mad_send_wr->send_list,
-			      &port_priv->send_posted_mad_list);
-		port_priv->send_posted_mad_count++;
-	} else 
+	if (ret) {
 		printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret);
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+		dequeue_mad(&mad_send_wr->mad_list);
+		*bad_send_wr = send_wr;
+	}
 	return ret;
 }
 
@@ -364,7 +400,6 @@
 	int ret;
 	struct ib_send_wr	*cur_send_wr, *next_send_wr;
 	struct ib_mad_agent_private	*mad_agent_priv;
-	struct ib_mad_port_private	*port_priv;
 
 	/* Validate supplied parameters */
 	if (!bad_send_wr)
@@ -379,7 +414,6 @@
 
 	mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private,
 				      agent);
-	port_priv = mad_agent_priv->port_priv;
 
 	/* Walk list of send WRs and post each on send list */
 	cur_send_wr = send_wr;
@@ -421,6 +455,7 @@
 				  cur_send_wr, bad_send_wr);
 		if (ret) {
 			/* Handle QP overrun separately... -ENOMEM */
+			/* Handle posting when QP is in error state... */
 
 			/* Fail send request */
 			spin_lock_irqsave(&mad_agent_priv->lock, flags);
@@ -587,7 +622,7 @@
 	if (!mad_reg_req)
 		return 0;
 
-	private = priv->port_priv;
+	private = priv->qp_info->port_priv;
 	mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class);
 	class = &private->version[mad_reg_req->mgmt_class_version];
 	if (!*class) {
@@ -663,7 +698,7 @@
 		goto out;
 	}
 
-	port_priv = agent_priv->port_priv;
+	port_priv = agent_priv->qp_info->port_priv;
 	class = port_priv->version[agent_priv->reg_req->mgmt_class_version];
 	if (!class) {
 		printk(KERN_ERR PFX "No class table yet MAD registration "
@@ -695,20 +730,6 @@
 	return;
 }
 
-static int convert_qpnum(u32 qp_num)
-{
-	/* 
-	 * XXX: No redirection currently
-	 * QP0 and QP1 only
-	 * Ultimately, will need table of QP numbers and table index
-	 * as QP numbers will not be packed once redirection supported
-	 */
-	if (qp_num > 1) {
-		return -1;
-	}
-	return qp_num;
-}
-
 static int response_mad(struct ib_mad *mad)
 {
 	/* Trap represses are responses although response bit is reset */
@@ -913,55 +934,21 @@
 static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv,
 				     struct ib_wc *wc)
 {
+	struct ib_mad_qp_info *qp_info;
 	struct ib_mad_private_header *mad_priv_hdr;
-	struct ib_mad_recv_buf *rbuf;
 	struct ib_mad_private *recv;
-	union ib_mad_recv_wrid wrid;
-	unsigned long flags;
-	u32 qp_num;
+	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent = NULL;
-	int solicited, qpn;
-
-	/* For receive, QP number is field in the WC WRID */
-	wrid.wrid = wc->wr_id;
-	qp_num = wrid.wrid_field.qpn;
-	qpn = convert_qpnum(qp_num);
-	if (qpn == -1) {
-		ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]);
-		printk(KERN_ERR PFX "Packet received on unknown QPN %d\n",
-		       qp_num);
-		return;
-	}
-	
-	/* 
-	 * Completion corresponds to first entry on 
-	 * posted MAD receive list based on WRID in completion
-	 */
-	spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-	if (!list_empty(&port_priv->recv_posted_mad_list[qpn])) {
-		rbuf = list_entry(port_priv->recv_posted_mad_list[qpn].next,
-				 struct ib_mad_recv_buf,
-				 list);
-		mad_priv_hdr = container_of(rbuf, struct ib_mad_private_header,
-					    recv_buf);
-		recv = container_of(mad_priv_hdr, struct ib_mad_private,
-				    header);
-	
-		/* Remove from posted receive MAD list */
-		list_del(&recv->header.recv_buf.list);
-		port_priv->recv_posted_mad_count[qpn]--;
-
-	} else {
-		spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
-		ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]);
-		printk(KERN_ERR PFX "Receive completion WR ID 0x%Lx on QP %d "
-		       "with no posted receive\n",
-		       (unsigned long long) wc->wr_id,
-		       qp_num);
-		return;
-	}
-	spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
+	int solicited;
+	unsigned long flags;
 
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	qp_info = mad_list->mad_queue->qp_info;
+	dequeue_mad(mad_list);
+
+	mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header,
+				    mad_list);
+	recv = container_of(mad_priv_hdr, struct ib_mad_private, header);
 	pci_unmap_single(port_priv->device->dma_device,
 			 pci_unmap_addr(&recv->header, mapping),
 			 sizeof(struct ib_mad_private) -
@@ -976,7 +963,7 @@
 	recv->header.recv_buf.grh = &recv->grh;
 
 	/* Validate MAD */
-	if (!validate_mad(recv->header.recv_buf.mad, qp_num))
+	if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num))
 		goto out;
 
 	/* Snoop MAD ? */
@@ -1009,7 +996,7 @@
 	}
 
 	/* Post another receive request for this QP */
-	ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]);
+	ib_mad_post_receive_mad(qp_info);
 }
 
 static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
@@ -1030,7 +1017,8 @@
 			delay = mad_send_wr->timeout - jiffies;
 			if ((long)delay <= 0)
 				delay = 1;
-			queue_delayed_work(mad_agent_priv->port_priv->wq,
+			queue_delayed_work(mad_agent_priv->qp_info->
+					   port_priv->wq,
 					   &mad_agent_priv->work, delay);
 		}
 	}
@@ -1060,7 +1048,7 @@
 	/* Reschedule a work item if we have a shorter timeout */
 	if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) {
 		cancel_delayed_work(&mad_agent_priv->work);
-		queue_delayed_work(mad_agent_priv->port_priv->wq,
+		queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq,
 				   &mad_agent_priv->work, delay);
 	}
 }
@@ -1114,39 +1102,15 @@
 				     struct ib_wc *wc)
 {
 	struct ib_mad_send_wr_private	*mad_send_wr;
-	unsigned long			flags;
-
-	/* Completion corresponds to first entry on posted MAD send list */
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	if (list_empty(&port_priv->send_posted_mad_list)) {
-		printk(KERN_ERR PFX "Send completion WR ID 0x%Lx but send "
-		       "list is empty\n", (unsigned long long) wc->wr_id);
-		goto error;
-	}
-
-	mad_send_wr = list_entry(port_priv->send_posted_mad_list.next,
-				 struct ib_mad_send_wr_private,
-				 send_list);
-	if (wc->wr_id != (unsigned long)mad_send_wr) {
-		printk(KERN_ERR PFX "Send completion WR ID 0x%Lx doesn't match "
-		       "posted send WR ID 0x%lx\n",
-		       (unsigned long long) wc->wr_id,
-		       (unsigned long)mad_send_wr);
-		goto error;
-	}
-
-	/* Remove from posted send MAD list */
-	list_del(&mad_send_wr->send_list);
-	port_priv->send_posted_mad_count--;
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+	struct ib_mad_list_head		*mad_list;
 
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
+				   mad_list);
+	dequeue_mad(mad_list);
 	/* Restore client wr_id in WC */
 	wc->wr_id = mad_send_wr->wr_id;
 	ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc);
-	return;
-
-error:
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 }
 
 /*
@@ -1156,28 +1120,33 @@
 {
 	struct ib_mad_port_private *port_priv;
 	struct ib_wc wc;
+	struct ib_mad_list_head *mad_list;
+	struct ib_mad_qp_info *qp_info;
 
 	port_priv = (struct ib_mad_port_private*)data;
 	ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	
 	while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) {
 		if (wc.status != IB_WC_SUCCESS) {
-			printk(KERN_ERR PFX "Completion error %d WRID 0x%Lx\n",
-                                       wc.status, (unsigned long long) wc.wr_id);
+			/* Determine if failure was a send or receive. */
+			mad_list = (struct ib_mad_list_head *)
+				   (unsigned long)wc.wr_id;
+			qp_info = mad_list->mad_queue->qp_info;
+			if (mad_list->mad_queue == &qp_info->send_queue)
+				wc.opcode = IB_WC_SEND;
+			else
+				wc.opcode = IB_WC_RECV;
+		}
+		switch (wc.opcode) {
+		case IB_WC_SEND:
 			ib_mad_send_done_handler(port_priv, &wc);
-		} else {
-			switch (wc.opcode) {
-			case IB_WC_SEND:
-				ib_mad_send_done_handler(port_priv, &wc);
-				break;
-			case IB_WC_RECV:
-				ib_mad_recv_done_handler(port_priv, &wc);
-				break;
-			default:
-				printk(KERN_ERR PFX "Wrong Opcode 0x%x on completion\n",
-				       wc.opcode);
-				break;
-			}
+			break;
+		case IB_WC_RECV:
+			ib_mad_recv_done_handler(port_priv, &wc);
+			break;
+		default:
+			BUG_ON(1);
+			break;
 		}
 	}
 }
@@ -1307,7 +1276,8 @@
 			delay = mad_send_wr->timeout - jiffies;
 			if ((long)delay <= 0)
 				delay = 1;
-			queue_delayed_work(mad_agent_priv->port_priv->wq,
+			queue_delayed_work(mad_agent_priv->qp_info->
+					   port_priv->wq,
 					   &mad_agent_priv->work, delay);
 			break;
 		}
@@ -1332,24 +1302,13 @@
 	queue_work(port_priv->wq, &port_priv->work);
 }
 
-static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv,
-				   struct ib_qp *qp)
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info)
 {
 	struct ib_mad_private *mad_priv;
 	struct ib_sge sg_list;
 	struct ib_recv_wr recv_wr;
 	struct ib_recv_wr *bad_recv_wr;
-	unsigned long flags;
 	int ret;
-	union ib_mad_recv_wrid wrid;
-	int qpn;
-
-
-	qpn = convert_qpnum(qp->qp_num);
-	if (qpn == -1) {
-		printk(KERN_ERR PFX "Post receive to invalid QPN %d\n", qp->qp_num);
-		return -EINVAL;
-	}
 
 	/* 
 	 * Allocate memory for receive buffer.
@@ -1367,47 +1326,32 @@
 	}
 
 	/* Setup scatter list */
-	sg_list.addr = pci_map_single(port_priv->device->dma_device,
+	sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device,
 				      &mad_priv->grh,
 				      sizeof *mad_priv -
 					sizeof mad_priv->header,
 				      PCI_DMA_FROMDEVICE);
 	sg_list.length = sizeof *mad_priv - sizeof mad_priv->header;
-	sg_list.lkey = (*port_priv->mr).lkey;
+	sg_list.lkey = (*qp_info->port_priv->mr).lkey;
 
 	/* Setup receive WR */
 	recv_wr.next = NULL;
 	recv_wr.sg_list = &sg_list;
 	recv_wr.num_sge = 1;
 	recv_wr.recv_flags = IB_RECV_SIGNALED;
-	wrid.wrid_field.index = port_priv->recv_wr_index[qpn]++;
-	wrid.wrid_field.qpn = qp->qp_num;
-	recv_wr.wr_id = wrid.wrid;
-
-	/* Link receive WR into posted receive MAD list */
-	spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-	list_add_tail(&mad_priv->header.recv_buf.list,
-		      &port_priv->recv_posted_mad_list[qpn]);
-	port_priv->recv_posted_mad_count[qpn]++;
-	spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
-
+	recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
 	pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
 
-	/* Now, post receive WR */
-	ret = ib_post_recv(qp, &recv_wr, &bad_recv_wr);
+	/* Post receive WR. */
+	queue_mad(&qp_info->recv_queue, &mad_priv->header.mad_list);
+	ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
 	if (ret) {
-
-		pci_unmap_single(port_priv->device->dma_device,
+		dequeue_mad(&mad_priv->header.mad_list);
+		pci_unmap_single(qp_info->port_priv->device->dma_device,
 				 pci_unmap_addr(&mad_priv->header, mapping),
 				 sizeof *mad_priv - sizeof mad_priv->header,
 				 PCI_DMA_FROMDEVICE);
 
-		/* Unlink from posted receive MAD list */
-		spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-		list_del(&mad_priv->header.recv_buf.list);
-		port_priv->recv_posted_mad_count[qpn]--;
-		spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
-
 		kmem_cache_free(ib_mad_cache, mad_priv);
 		printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx failed ret = %d\n",
 		       (unsigned long long) recv_wr.wr_id, ret);
@@ -1420,79 +1364,72 @@
 /*
  * Allocate receive MADs and post receive WRs for them 
  */
-static int ib_mad_post_receive_mads(struct ib_mad_port_private *port_priv)
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info)
 {
-	int i, j;
+	int i, ret;
 
 	for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) {
-		for (j = 0; j < IB_MAD_QPS_CORE; j++) {
-			if (ib_mad_post_receive_mad(port_priv,
-						    port_priv->qp[j])) {
-				printk(KERN_ERR PFX "receive post %d failed "
-				       "on %s port %d\n", i + 1,
-				       port_priv->device->name,
-				       port_priv->port_num);
-			}
+		ret = ib_mad_post_receive_mad(qp_info);
+		if (ret) {
+			printk(KERN_ERR PFX "receive post %d failed "
+				"on %s port %d\n", i + 1,
+				qp_info->port_priv->device->name,
+				qp_info->port_priv->port_num);
+			break;
 		}
 	}
-
-	return 0;
+	return ret;
 }
 
 /*
  * Return all the posted receive MADs
  */
-static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *port_priv)
+static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info *qp_info)
 {
-	int i;
 	unsigned long flags;
 	struct ib_mad_private_header *mad_priv_hdr;
-	struct ib_mad_recv_buf *rbuf;
 	struct ib_mad_private *recv;
+	struct ib_mad_list_head *mad_list;
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		spin_lock_irqsave(&port_priv->recv_list_lock, flags);
-		while (!list_empty(&port_priv->recv_posted_mad_list[i])) {
+	spin_lock_irqsave(&qp_info->recv_queue.lock, flags);
+	while (!list_empty(&qp_info->recv_queue.list)) {
 
-			rbuf = list_entry(port_priv->recv_posted_mad_list[i].next,
-					  struct ib_mad_recv_buf, list);
-			mad_priv_hdr = container_of(rbuf,
-						    struct ib_mad_private_header,
-						    recv_buf);
-			recv = container_of(mad_priv_hdr,
-					    struct ib_mad_private, header);
+		mad_list = list_entry(qp_info->recv_queue.list.next,
+				      struct ib_mad_list_head, list);
+		mad_priv_hdr = container_of(mad_list,
+					    struct ib_mad_private_header,
+					    mad_list);
+		recv = container_of(mad_priv_hdr, struct ib_mad_private,
+				    header);
 
-			/* Remove for posted receive MAD list */
-			list_del(&recv->header.recv_buf.list);
- 
-			/* Undo PCI mapping */
-			pci_unmap_single(port_priv->device->dma_device,
-					 pci_unmap_addr(&recv->header, mapping),
-					 sizeof(struct ib_mad_private) -
-					 sizeof(struct ib_mad_private_header),
-					 PCI_DMA_FROMDEVICE);
-
-			kmem_cache_free(ib_mad_cache, recv);
-		}
+		/* Remove from posted receive MAD list */
+		list_del(&mad_list->list);
 
-		INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]);
-		port_priv->recv_posted_mad_count[i] = 0;
-		spin_unlock_irqrestore(&port_priv->recv_list_lock, flags);
+		/* Undo PCI mapping */
+		pci_unmap_single(qp_info->port_priv->device->dma_device,
+				 pci_unmap_addr(&recv->header, mapping),
+				 sizeof(struct ib_mad_private) -
+				 sizeof(struct ib_mad_private_header),
+				 PCI_DMA_FROMDEVICE);
+		kmem_cache_free(ib_mad_cache, recv);
 	}
+
+	qp_info->recv_queue.count = 0;
+	spin_unlock_irqrestore(&qp_info->recv_queue.lock, flags);
 }
 
 /*
  * Return all the posted send MADs
  */
-static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *port_priv)
+static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	/* Just clear port send posted MAD list */
-	INIT_LIST_HEAD(&port_priv->send_posted_mad_list);
-	port_priv->send_posted_mad_count = 0;
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+	/* Just clear port send posted MAD list... revisit!!! */
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	INIT_LIST_HEAD(&qp_info->send_queue.list);
+	qp_info->send_queue.count = 0;
+	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
 }
 
 /*
@@ -1618,35 +1555,21 @@
 	int ret, i, ret2;
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_init(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "INIT\n", i);
-			return ret;
+			goto error;
 		}
-	}
-
-	ret = ib_mad_post_receive_mads(port_priv);
-	if (ret) {
-		printk(KERN_ERR PFX "Couldn't post receive requests\n");
-		goto error;
-	}
-
-	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
-	if (ret) {
-		printk(KERN_ERR PFX "Failed to request completion notification\n");
-		goto error;
-	}
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTR\n", i);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTS\n", i);
@@ -1654,17 +1577,31 @@
 		}
 	}
 
+	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		printk(KERN_ERR PFX "Failed to request completion notification\n");
+		goto error;
+	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't post receive requests\n");
+			goto error;
+		}
+	}
 	return 0;
+
 error:
-	ib_mad_return_posted_recv_mads(port_priv);
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret2 = ib_mad_change_qp_state_to_reset(port_priv->qp[i]);
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
+		ret2 = ib_mad_change_qp_state_to_reset(port_priv->
+						       qp_info[i].qp);
 		if (ret2) {
 			printk(KERN_ERR PFX "ib_mad_port_start: Couldn't "
 			       "change QP%d state to RESET\n", i);
 		}
 	}
-
 	return ret;
 }
 
@@ -1676,16 +1613,64 @@
 	int i, ret;
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_reset(port_priv->qp[i]);
+		ret = ib_mad_change_qp_state_to_reset(port_priv->qp_info[i].qp);
 		if (ret) {
 			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change "
 			       "%s port %d QP%d state to RESET\n",
 			       port_priv->device->name, port_priv->port_num, i);
 		}
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
+		ib_mad_return_posted_send_mads(&port_priv->qp_info[i]);
 	}
+}
 
-	ib_mad_return_posted_recv_mads(port_priv);
-	ib_mad_return_posted_send_mads(port_priv);
+static void init_mad_queue(struct ib_mad_qp_info *qp_info,
+			   struct ib_mad_queue *mad_queue)
+{
+	mad_queue->qp_info = qp_info;
+	mad_queue->count = 0;
+	spin_lock_init(&mad_queue->lock);
+	INIT_LIST_HEAD(&mad_queue->list);
+}
+
+static int create_mad_qp(struct ib_mad_port_private *port_priv,
+			 struct ib_mad_qp_info *qp_info,
+			 enum ib_qp_type qp_type)
+{
+	struct ib_qp_init_attr	qp_init_attr;
+	int ret;
+
+	qp_info->port_priv = port_priv;
+	init_mad_queue(qp_info, &qp_info->send_queue);
+	init_mad_queue(qp_info, &qp_info->recv_queue);
+
+	memset(&qp_init_attr, 0, sizeof qp_init_attr);
+	qp_init_attr.send_cq = port_priv->cq;
+	qp_init_attr.recv_cq = port_priv->cq;
+	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
+	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
+	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
+	qp_init_attr.qp_type = qp_type;
+	qp_init_attr.port_num = port_priv->port_num;
+	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
+	if (IS_ERR(qp_info->qp)) {
+		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
+		       get_spl_qp_index(qp_type));
+		ret = PTR_ERR(qp_info->qp);
+		goto error;		
+	}
+	return 0;
+
+error:
+	return ret;
+}
+
+static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
+{
+	ib_destroy_qp(qp_info->qp);
 }
 
 /*
@@ -1694,7 +1679,7 @@
  */
 static int ib_mad_port_open(struct ib_device *device, int port_num)
 {
-	int ret, cq_size, i;
+	int ret, cq_size;
 	u64 iova = 0;
 	struct ib_phys_buf buf_list = {
 		.addr = 0,
@@ -1749,38 +1734,15 @@
 		goto error5;
 	}
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		struct ib_qp_init_attr	qp_init_attr;
-
-		memset(&qp_init_attr, 0, sizeof qp_init_attr);
-		qp_init_attr.send_cq = port_priv->cq;
-		qp_init_attr.recv_cq = port_priv->cq;
-		qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
-		qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR;
-		qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
-		qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
-		qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
-		qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
-		qp_init_attr.qp_type = i;	/* Relies on ib_qp_type enum ordering of IB_QPT_SMI and IB_QPT_GSI */
-		qp_init_attr.port_num = port_priv->port_num;
-		port_priv->qp[i] = ib_create_qp(port_priv->pd, &qp_init_attr);
-		if (IS_ERR(port_priv->qp[i])) {
-			printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", i);
-			ret = PTR_ERR(port_priv->qp[i]);
-			if (i == 0)
-				goto error6;		
-			else
-				goto error7;
-		}
-	}
+	ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI);
+	if (ret)
+		goto error6;
+	ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI);
+	if (ret)
+		goto error7;
 
 	spin_lock_init(&port_priv->reg_lock);
-	spin_lock_init(&port_priv->recv_list_lock);
-	spin_lock_init(&port_priv->send_list_lock);
 	INIT_LIST_HEAD(&port_priv->agent_list);
-	INIT_LIST_HEAD(&port_priv->send_posted_mad_list);
-	for (i = 0; i < IB_MAD_QPS_CORE; i++)
-		INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]);
 
 	port_priv->wq = create_workqueue("ib_mad");
 	if (!port_priv->wq) {
@@ -1798,15 +1760,14 @@
 	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
 	list_add_tail(&port_priv->port_list, &ib_mad_port_list);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
-
 	return 0;
 
 error9:
 	destroy_workqueue(port_priv->wq);
 error8:
-	ib_destroy_qp(port_priv->qp[1]);
+	destroy_mad_qp(&port_priv->qp_info[1]);
 error7:
-	ib_destroy_qp(port_priv->qp[0]);
+	destroy_mad_qp(&port_priv->qp_info[0]);
 error6:
 	ib_dereg_mr(port_priv->mr);
 error5:
@@ -1842,8 +1803,8 @@
 	ib_mad_port_stop(port_priv);
 	flush_workqueue(port_priv->wq);
 	destroy_workqueue(port_priv->wq);
-	ib_destroy_qp(port_priv->qp[1]);
-	ib_destroy_qp(port_priv->qp[0]);
+	destroy_mad_qp(&port_priv->qp_info[1]);
+	destroy_mad_qp(&port_priv->qp_info[0]);
 	ib_dereg_mr(port_priv->mr);
 	ib_dealloc_pd(port_priv->pd);
 	ib_destroy_cq(port_priv->cq);
Index: access/mad_priv.h
===================================================================
--- access/mad_priv.h	(revision 1116)
+++ access/mad_priv.h	(working copy)
@@ -79,16 +79,13 @@
 #define MAX_MGMT_CLASS		80	
 #define MAX_MGMT_VERSION	8
 
-
-union ib_mad_recv_wrid {
-	u64 wrid;
-	struct {
-		u32 index;
-		u32 qpn;
-	} wrid_field;
+struct ib_mad_list_head {
+	struct list_head list;
+	struct ib_mad_queue *mad_queue;
 };
 
 struct ib_mad_private_header {
+	struct ib_mad_list_head mad_list;
 	struct ib_mad_recv_wc recv_wc;
 	struct ib_mad_recv_buf recv_buf;
 	DECLARE_PCI_UNMAP_ADDR(mapping)
@@ -108,7 +105,7 @@
 	struct list_head agent_list;
 	struct ib_mad_agent agent;
 	struct ib_mad_reg_req *reg_req;
-	struct ib_mad_port_private *port_priv;
+	struct ib_mad_qp_info *qp_info;
 
 	spinlock_t lock;
 	struct list_head send_list;
@@ -122,7 +119,7 @@
 };
 
 struct ib_mad_send_wr_private {
-	struct list_head send_list;
+	struct ib_mad_list_head mad_list;
 	struct list_head agent_list;
 	struct ib_mad_agent *agent;
 	u64 wr_id;			/* client WR ID */
@@ -140,11 +137,25 @@
 	struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS];
 };
 
+struct ib_mad_queue {
+	spinlock_t lock;
+	struct list_head list;
+	int count;
+	struct ib_mad_qp_info *qp_info;
+};
+
+struct ib_mad_qp_info {
+	struct ib_mad_port_private *port_priv;
+	struct ib_qp *qp;
+	struct ib_mad_queue send_queue;
+	struct ib_mad_queue recv_queue;
+	/* struct ib_mad_queue overflow_queue; */
+};
+
 struct ib_mad_port_private {
 	struct list_head port_list;
 	struct ib_device *device;
 	int port_num;
-	struct ib_qp *qp[IB_MAD_QPS_CORE];
 	struct ib_cq *cq;
 	struct ib_pd *pd;
 	struct ib_mr *mr;
@@ -154,15 +165,7 @@
 	struct list_head agent_list;
 	struct workqueue_struct *wq;
 	struct work_struct work;
-
-	spinlock_t send_list_lock;
-	struct list_head send_posted_mad_list;
-	int send_posted_mad_count;
-
-	spinlock_t recv_list_lock;
-	struct list_head recv_posted_mad_list[IB_MAD_QPS_CORE];
-	int recv_posted_mad_count[IB_MAD_QPS_CORE];
-	u32 recv_wr_index[IB_MAD_QPS_CORE];
+	struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE];
 };
 
 #endif	/* __IB_MAD_PRIV_H__ */


From krkumar at us.ibm.com  Tue Nov  2 12:09:42 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 12:09:42 -0800 (PST)
Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent
	using bitops
Message-ID: <Pine.LNX.4.44.0411021158340.28949-100000@DYN318430BLD>

I am not entirely sure that I understand the bitwise operator being
used in the code. Following patch is assuming that I have got it
right :-).

thanks,

- KK

diff -ruNp 5/mad.c 6/mad.c
--- 5/mad.c	2004-11-02 12:07:51.000000000 -0800
+++ 6/mad.c	2004-11-02 12:08:32.000000000 -0800
@@ -537,9 +537,13 @@ static int check_method_table(struct ib_
 {
 	int i;

-	for (i = 0; i < IB_MGMT_MAX_METHODS; i++)
-		if (method->agent[i])
-			return 1;
+	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
+	     i < IB_MGMT_MAX_METHODS;
+	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
+			       1+i)) {
+		/* if we entered the loop, we have found an agent bit set */
+		return 1;
+	}
 	return 0;
 }

@@ -561,11 +565,13 @@ static void remove_methods_mad_agent(str
 {
 	int i;

-	/* Remove any methods for this mad agent */
-	for (i = 0; i < IB_MGMT_MAX_METHODS; i++) {
-		if (method->agent[i] == agent) {
-			method->agent[i] = NULL;
-		}
+	/* Remove all methods for this mad agent */
+	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
+	     i < IB_MGMT_MAX_METHODS;
+	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
+			       1+i)) {
+		BUG_ON(method->agent[i] != agent);
+		method->agent[i] = NULL;
 	}
 }


From halr at voltaire.com  Tue Nov  2 14:06:30 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 17:06:30 -0500
Subject: [openib-general] [PATCH] for review -- fix MAD completion	handling
In-Reply-To: <20041102120244.498d75f2.mshefty@ichips.intel.com>
References: <20041028233000.19879b59.mshefty@ichips.intel.com>
	<20041102111906.431f78b0.mshefty@ichips.intel.com>
	<1099425905.4129.29.camel@hpc-1>
	<20041102120244.498d75f2.mshefty@ichips.intel.com>
Message-ID: <1099433189.3266.0.camel@localhost.localdomain>

On Tue, 2004-11-02 at 15:02, Sean Hefty wrote:
> I think my mailer wrapped the lines after I hit send.  Let me try again.

That's better. Thanks.

-- Hal


From halr at voltaire.com  Tue Nov  2 14:13:11 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 17:13:11 -0500
Subject: [openib-general] [PATCH] for review -- fix MAD completion	handling
In-Reply-To: <20041102120244.498d75f2.mshefty@ichips.intel.com>
References: <20041028233000.19879b59.mshefty@ichips.intel.com>
	<20041102111906.431f78b0.mshefty@ichips.intel.com>
	<1099425905.4129.29.camel@hpc-1>
	<20041102120244.498d75f2.mshefty@ichips.intel.com>
Message-ID: <1099433591.3266.4.camel@localhost.localdomain>

On Tue, 2004-11-02 at 15:02, Sean Hefty wrote:
> I think my mailer wrapped the lines after I hit send.  Let me try again.

Thanks! Applied.

-- Hal


From halr at voltaire.com  Tue Nov  2 14:30:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 17:30:39 -0500
Subject: [openib-general] [RFC] [PATCH] Optimize access to	method->agent
	using bitops
In-Reply-To: <Pine.LNX.4.44.0411021158340.28949-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411021158340.28949-100000@DYN318430BLD>
Message-ID: <1099434639.3266.18.camel@localhost.localdomain>

On Tue, 2004-11-02 at 15:09, Krishna Kumar wrote:
> I am not entirely sure that I understand the bitwise operator being
> used in the code. Following patch is assuming that I have got it
> right :-).
> 
> thanks,
> 
> - KK
> 
> diff -ruNp 5/mad.c 6/mad.c
> --- 5/mad.c	2004-11-02 12:07:51.000000000 -0800
> +++ 6/mad.c	2004-11-02 12:08:32.000000000 -0800
> @@ -537,9 +537,13 @@ static int check_method_table(struct ib_
>  {
>  	int i;
> 
> -	for (i = 0; i < IB_MGMT_MAX_METHODS; i++)
> -		if (method->agent[i])
> -			return 1;
> +	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
> +	     i < IB_MGMT_MAX_METHODS;
> +	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
> +			       1+i)) {
> +		/* if we entered the loop, we have found an agent bit set */
> +		return 1;
> +	}
>  	return 0;
>  }

This is no longer checking the method table. It is checking the
registration request. Also, a pointer to the registration request
would need to be passed into this routine if it is to be used. 

> @@ -561,11 +565,13 @@ static void remove_methods_mad_agent(str
>  {
>  	int i;
> 
> -	/* Remove any methods for this mad agent */
> -	for (i = 0; i < IB_MGMT_MAX_METHODS; i++) {
> -		if (method->agent[i] == agent) {
> -			method->agent[i] = NULL;
> -		}
> +	/* Remove all methods for this mad agent */
> +	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
> +	     i < IB_MGMT_MAX_METHODS;
> +	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
> +			       1+i)) {
> +		BUG_ON(method->agent[i] != agent);
> +		method->agent[i] = NULL;
>  	}
>  }

Same compilation issue as above:
A pointer to the registration request would need to be passed into this
routine if it is to be used. 

-- Hal


From krkumar at us.ibm.com  Tue Nov  2 15:46:26 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 15:46:26 -0800 (PST)
Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent
	using bitops
In-Reply-To: <1099434639.3266.18.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.44.0411021522390.29509-100000@DYN318430BLD>

Hi Hal,

I didn't fix the argument to this routine, I was trying to understand if
the idea behind this will work, hence the RFC in the subject. Sorry for
creating the confusion. I was trying to understand if what I was assuming
is right or not.

The first part of the patch checks if any bit is set in the method_mask,
and if so, it means that a method (table?) is registered and hence it
returns error. Actually there might be a better way to check if the
bitmask is all-zero's and avoid the for loop, but I don't see any macros
for that, and I didn't want to use "if (method_mask)".

The add_mad_reg_req() code is doing :
	/* Finally, add in methods being registered */
        for (i = find_first_bit(mad_reg_req->method_mask,IB_MGMT_MAX_METHODS);
	     i < IB_MGMT_MAX_METHODS;
	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
                               1+i)) {
                (*method)->agent[i] = priv;
        }
So the agent[0-128] is pointing to the priv when that particular bitmask
is set in the method_mask (exact same bit number is used as index in
agent).

Do you think this model is correct ? The Get/Set/Repress functions can be
checked faster by checking if the first/next bit being set rather than
going through the entire array of 128 agents, or in one case whether the
bitmask is zero or non-zero instead of looping 128 times. If this is true,
I will recreate the patch and compile before sending in the final patch.

thx,

- KK

On Tue, 2 Nov 2004, Hal Rosenstock wrote:

> On Tue, 2004-11-02 at 15:09, Krishna Kumar wrote:
> > I am not entirely sure that I understand the bitwise operator being
> > used in the code. Following patch is assuming that I have got it
> > right :-).
> >
> > thanks,
> >
> > - KK
> >
> > diff -ruNp 5/mad.c 6/mad.c
> > --- 5/mad.c	2004-11-02 12:07:51.000000000 -0800
> > +++ 6/mad.c	2004-11-02 12:08:32.000000000 -0800
> > @@ -537,9 +537,13 @@ static int check_method_table(struct ib_
> >  {
> >  	int i;
> >
> > -	for (i = 0; i < IB_MGMT_MAX_METHODS; i++)
> > -		if (method->agent[i])
> > -			return 1;
> > +	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
> > +	     i < IB_MGMT_MAX_METHODS;
> > +	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
> > +			       1+i)) {
> > +		/* if we entered the loop, we have found an agent bit set */
> > +		return 1;
> > +	}
> >  	return 0;
> >  }
>
> This is no longer checking the method table. It is checking the
> registration request. Also, a pointer to the registration request
> would need to be passed into this routine if it is to be used.
>
> > @@ -561,11 +565,13 @@ static void remove_methods_mad_agent(str
> >  {
> >  	int i;
> >
> > -	/* Remove any methods for this mad agent */
> > -	for (i = 0; i < IB_MGMT_MAX_METHODS; i++) {
> > -		if (method->agent[i] == agent) {
> > -			method->agent[i] = NULL;
> > -		}
> > +	/* Remove all methods for this mad agent */
> > +	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
> > +	     i < IB_MGMT_MAX_METHODS;
> > +	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
> > +			       1+i)) {
> > +		BUG_ON(method->agent[i] != agent);
> > +		method->agent[i] = NULL;
> >  	}
> >  }
>
> Same compilation issue as above:
> A pointer to the registration request would need to be passed into this
> routine if it is to be used.
>
> -- Hal
>
>
>


From iod00d at hp.com  Tue Nov  2 16:10:15 2004
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 2 Nov 2004 16:10:15 -0800
Subject: [openib-general] ib_modify_qp() too many arguments
Message-ID: <20041103001015.GA13563@cup.hp.com>

Roland,
I am trying to build roland-merge #1119 on top of 2.6.10-rc1 for ia64.
And yes, the usage noted below doesn't match the declaration:

  CC [M]  drivers/infiniband/core/cm_main.o
  In file included from drivers/infiniband/core/cm_main.c:24:
  drivers/infiniband/core/cm_priv.h: In function `ib_cm_qp_modify':
  drivers/infiniband/core/cm_priv.h:183: error: too many arguments to
  function `ib_modify_qp'

Should I not (yet) be enabling CONFIG_INFINIBAND_CM?
Trivial patch appended to "fix". Though I don't know if it's "right"
since it seems ib_cm_qp_modify() could just go away.

thanks,
grant


Index: src/linux-kernel/infiniband/core/cm_priv.h
===================================================================
--- src/linux-kernel/infiniband/core/cm_priv.h	(revision 1116)
+++ src/linux-kernel/infiniband/core/cm_priv.h	(working copy)
@@ -178,9 +178,7 @@
 				  struct ib_qp_attr *attr,
 				  int                attr_mask)
 {
-	struct ib_qp_cap qp_cap;
-
-	return qp ? ib_modify_qp(qp, attr, attr_mask, &qp_cap) : 0;
+	return qp ? ib_modify_qp(qp, attr, attr_mask) : 0;
 }
 
 int ib_cm_timeout_to_jiffies(int timeout);


From halr at voltaire.com  Tue Nov  2 18:25:34 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 21:25:34 -0500
Subject: [openib-general] [RFC] [PATCH] Optimize access to	method->agent
	using bitops
In-Reply-To: <Pine.LNX.4.44.0411021522390.29509-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411021522390.29509-100000@DYN318430BLD>
Message-ID: <1099448734.3266.239.camel@localhost.localdomain>

On Tue, 2004-11-02 at 18:46, Krishna Kumar wrote:
> I didn't fix the argument to this routine, I was trying to understand if
> the idea behind this will work, hence the RFC in the subject. Sorry for
> creating the confusion. I was trying to understand if what I was assuming
> is right or not.
> 
> The first part of the patch checks if any bit is set in the method_mask,
> and if so, it means that a method (table?) is registered and hence it
> returns error. Actually there might be a better way to check if the
> bitmask is all-zero's and avoid the for loop, but I don't see any macros
> for that, and I didn't want to use "if (method_mask)".
> 
> The add_mad_reg_req() code is doing :
> 	/* Finally, add in methods being registered */
>         for (i = find_first_bit(mad_reg_req->method_mask,IB_MGMT_MAX_METHODS);
> 	     i < IB_MGMT_MAX_METHODS;
> 	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
>                                1+i)) {
>                 (*method)->agent[i] = priv;
>         }
> So the agent[0-128] is pointing to the priv when that particular bitmask
> is set in the method_mask (exact same bit number is used as index in
> agent).
> 
> Do you think this model is correct ? The Get/Set/Repress functions can be
> checked faster by checking if the first/next bit being set rather than
> going through the entire array of 128 agents, or in one case whether the
> bitmask is zero or non-zero instead of looping 128 times. If this is true,
> I will recreate the patch and compile before sending in the final patch.

Sorry I misunderstood that it was RFC. In fact what I wrote was wrong.
Checking the registration request is just as good as checking the method
table. Perhaps the routine is now check_method_mask rather than
check_method_table. The calling parameters are straightforward to fix.
So the model seems fine to me.

-- Hal


From halr at voltaire.com  Tue Nov  2 18:29:08 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 02 Nov 2004 21:29:08 -0500
Subject: [openib-general] ib_modify_qp() too many arguments
In-Reply-To: <20041103001015.GA13563@cup.hp.com>
References: <20041103001015.GA13563@cup.hp.com>
Message-ID: <1099448948.3266.248.camel@localhost.localdomain>

On Tue, 2004-11-02 at 19:10, Grant Grundler wrote:
> I am trying to build roland-merge #1119 on top of 2.6.10-rc1 for ia64.
> And yes, the usage noted below doesn't match the declaration:
> 
>   CC [M]  drivers/infiniband/core/cm_main.o
>   In file included from drivers/infiniband/core/cm_main.c:24:
>   drivers/infiniband/core/cm_priv.h: In function `ib_cm_qp_modify':
>   drivers/infiniband/core/cm_priv.h:183: error: too many arguments to
>   function `ib_modify_qp'
> 
> Should I not (yet) be enabling CONFIG_INFINIBAND_CM?
> Trivial patch appended to "fix". 

There was recently a verbs change which caused this and it looks like we
missed this as CM is not "used" yet.

> Though I don't know if it's "right"

Looks right to me.

> since it seems ib_cm_qp_modify() could just go away.

QP modify is needed by CM (as it walks a connection through it's states
it modifies the QP state) and UD QPs as well. It can't go away.

> Index: src/linux-kernel/infiniband/core/cm_priv.h
> ===================================================================
> --- src/linux-kernel/infiniband/core/cm_priv.h	(revision 1116)
> +++ src/linux-kernel/infiniband/core/cm_priv.h	(working copy)
> @@ -178,9 +178,7 @@
>  				  struct ib_qp_attr *attr,
>  				  int                attr_mask)
>  {
> -	struct ib_qp_cap qp_cap;
> -
> -	return qp ? ib_modify_qp(qp, attr, attr_mask, &qp_cap) : 0;
> +	return qp ? ib_modify_qp(qp, attr, attr_mask) : 0;
>  }
>  
>  int ib_cm_timeout_to_jiffies(int timeout);

-- Hal


From krkumar at us.ibm.com  Tue Nov  2 18:33:10 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 18:33:10 -0800 (PST)
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <20041102102126.26746a63.mshefty@ichips.intel.com>
Message-ID: <Pine.LNX.4.44.0411021027060.28639-100000@DYN318430BLD>

Hi Sean,

I guess you meant "even if solicited is NOT set". What you described is
right, the race will mean that the remove_mad_reg_req() will free things
like method/class, while the find_mad_agent looks through the version
and class to find the mad_agent. This patch will fix it correctly.

I have also cleaned up a hack in ib_mad_recv_done_handler() where a
test for '!mad_agent' was being done to determine whether to free 'recv'
or not :-).

Couple of issues with the new code (same as old code, though) :

1. printk(KERN_ERR PFX "No client 0x%x for received MAD "
                      "on port %d\n",
		      hi_tid, port_priv->port_num);
   and printk(KERN_NOTICE PFX "No matching mad agent found for "
                       "received MAD on port %d\n", port_priv->port_num);
   both get printed when mad_agent is not found in solicited case.

2. spin_unlock is performed after all the printk's, which is a bit icky.

Compile-tested patch (not tested) follows at the end of the mail. Let me
know if I should fix above problems too.

Thanks,

- KK

On Tue, 2 Nov 2004, Sean Hefty wrote:

> On Tue, 2 Nov 2004 09:59:14 -0800 (PST)
> Krishna Kumar <krkumar at us.ibm.com> wrote:
>
> > Hi Sean,
> >
> > I think that is the best approach. And using this method, we can also
> > avoid holding the lock if solicited is set. I will send a patch in a
> > few minutes if this approach looks good.
>
> Sounds good.
>
> I think that you'll need to hold the lock even if solicited is set to
> handle the case where a response is received after the sender
> unregistered.
>
> - Sean


diff -ruNp 7/mad.c 8/mad.c
--- 7/mad.c	2004-11-02 16:13:05.000000000 -0800
+++ 8/mad.c	2004-11-02 18:30:19.000000000 -0800
@@ -747,13 +747,16 @@ find_mad_agent(struct ib_mad_port_privat
 	       struct ib_mad *mad,
 	       int solicited)
 {
-	struct ib_mad_agent_private *entry, *mad_agent = NULL;
-	struct ib_mad_mgmt_class_table *version;
-	struct ib_mad_mgmt_method_table *class;
-	u32 hi_tid;
+	struct ib_mad_agent_private *mad_agent = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port_priv->reg_lock, flags);

 	/* Whether MAD was solicited determines type of routing to MAD client */
 	if (solicited) {
+		u32 hi_tid;
+		struct ib_mad_agent_private *entry;
+
 		/* Routing is based on high 32 bits of transaction ID of MAD  */
 		hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32;
 		list_for_each_entry(entry, &port_priv->agent_list, agent_list) {
@@ -762,12 +765,14 @@ find_mad_agent(struct ib_mad_port_privat
 				break;
 			}
 		}
-		if (!mad_agent) {
+		if (!mad_agent)
 			printk(KERN_ERR PFX "No client 0x%x for received MAD "
-			       "on port %d\n", hi_tid, port_priv->port_num);
-			goto out;
-		}
+			       "on port %d\n",
+			       hi_tid, port_priv->port_num);
 	} else {
+		struct ib_mad_mgmt_class_table *version;
+		struct ib_mad_mgmt_method_table *class;
+
 		/* Routing is based on version, class, and method */
 		if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) {
 			printk(KERN_ERR PFX "MAD received with unsupported "
@@ -784,23 +789,30 @@ find_mad_agent(struct ib_mad_port_privat
 		}
 		class = version->method_table[convert_mgmt_class(
 						mad->mad_hdr.mgmt_class)];
-		if (!class) {
+		if (class)
+			mad_agent = class->agent[mad->mad_hdr.method &
+					 ~IB_MGMT_METHOD_RESP];
+		else
 			printk(KERN_ERR PFX "MAD received on port %d for class "
 			       "%d with no client\n",
 			       port_priv->port_num, mad->mad_hdr.mgmt_class);
-			goto out;
-		}
-		mad_agent = class->agent[mad->mad_hdr.method &
-					 ~IB_MGMT_METHOD_RESP];
 	}

 out:
-	if (mad_agent && !mad_agent->agent.recv_handler) {
-		printk(KERN_ERR PFX "No receive handler for client "
-		       "%p on port %d\n",
-		       &mad_agent->agent, port_priv->port_num);
-		mad_agent = NULL;
-	}
+	if (mad_agent) {
+		if (mad_agent->agent.recv_handler)
+			atomic_inc(&mad_agent->refcount);
+		else {
+			mad_agent = NULL;
+			printk(KERN_ERR PFX "No receive handler for client "
+			       "%p on port %d\n",
+			       &mad_agent->agent, port_priv->port_num);
+		}
+	} else
+		printk(KERN_NOTICE PFX "No matching mad agent found for "
+		       "received MAD on port %d\n", port_priv->port_num);
+
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);

 	return mad_agent;
 }
@@ -929,9 +941,8 @@ static void ib_mad_recv_done_handler(str
 	struct ib_mad_private_header *mad_priv_hdr;
 	struct ib_mad_private *recv;
 	struct ib_mad_list_head *mad_list;
-	struct ib_mad_agent_private *mad_agent = NULL;
+	struct ib_mad_agent_private *mad_agent;
 	int solicited;
-	unsigned long flags;

 	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
 	qp_info = mad_list->mad_queue->qp_info;
@@ -965,23 +976,17 @@ static void ib_mad_recv_done_handler(str
 						 recv->header.recv_buf.mad))
 			goto out;

-	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	/* Determine corresponding MAD agent for incoming receive MAD */
 	solicited = solicited_mad(recv->header.recv_buf.mad);
 	mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
 				   solicited);
-	if (!mad_agent) {
-		spin_unlock_irqrestore(&port_priv->reg_lock, flags);
-		printk(KERN_NOTICE PFX "No matching mad agent found for "
-		       "received MAD on port %d\n", port_priv->port_num);
-	} else {
-		atomic_inc(&mad_agent->refcount);
-		spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+	if (mad_agent) {
 		ib_mad_complete_recv(mad_agent, recv, solicited);
+		recv = NULL;	/* recv is freed up via ib_mad_complete_recv */
 	}

 out:
-	if (!mad_agent) {
+	if (recv) {
 		/* Should this case be optimized ? */
 		kmem_cache_free(ib_mad_cache, recv);
 	}


From krkumar at us.ibm.com  Tue Nov  2 18:35:16 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 2 Nov 2004 18:35:16 -0800 (PST)
Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent
	using bitops
In-Reply-To: <1099448734.3266.239.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.44.0411021833290.2002-100000@DYN318430BLD>

Hi Hal,

Thanks for the update. I will recreate the patch and send it tomorrow.

thanks,

- KK

On Tue, 2 Nov 2004, Hal Rosenstock wrote:

> On Tue, 2004-11-02 at 18:46, Krishna Kumar wrote:
> > I didn't fix the argument to this routine, I was trying to understand if
> > the idea behind this will work, hence the RFC in the subject. Sorry for
> > creating the confusion. I was trying to understand if what I was assuming
> > is right or not.
> >
> > The first part of the patch checks if any bit is set in the method_mask,
> > and if so, it means that a method (table?) is registered and hence it
> > returns error. Actually there might be a better way to check if the
> > bitmask is all-zero's and avoid the for loop, but I don't see any macros
> > for that, and I didn't want to use "if (method_mask)".
> >
> > The add_mad_reg_req() code is doing :
> > 	/* Finally, add in methods being registered */
> >         for (i = find_first_bit(mad_reg_req->method_mask,IB_MGMT_MAX_METHODS);
> > 	     i < IB_MGMT_MAX_METHODS;
> > 	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
> >                                1+i)) {
> >                 (*method)->agent[i] = priv;
> >         }
> > So the agent[0-128] is pointing to the priv when that particular bitmask
> > is set in the method_mask (exact same bit number is used as index in
> > agent).
> >
> > Do you think this model is correct ? The Get/Set/Repress functions can be
> > checked faster by checking if the first/next bit being set rather than
> > going through the entire array of 128 agents, or in one case whether the
> > bitmask is zero or non-zero instead of looping 128 times. If this is true,
> > I will recreate the patch and compile before sending in the final patch.
>
> Sorry I misunderstood that it was RFC. In fact what I wrote was wrong.
> Checking the registration request is just as good as checking the method
> table. Perhaps the routine is now check_method_mask rather than
> check_method_table. The calling parameters are straightforward to fix.
> So the model seems fine to me.
>
> -- Hal
>
>
>


From mashirle at us.ibm.com  Tue Nov  2 20:00:09 2004
From: mashirle at us.ibm.com (Shirley Ma)
Date: Tue, 2 Nov 2004 20:00:09 -0800
Subject: [openib-general] [PATCH] fix memory leak problem in agent_mad_send()
Message-ID: <200411022000.09120.mashirle@us.ibm.com>

Here is the patch. Please review it.

diff -urN access/agent.c access.patch5/agent.c
--- access/agent.c	2004-11-02 17:40:06.000000000 -0800
+++ access.patch5/agent.c	2004-11-02 18:43:47.534608536 -0800
@@ -357,12 +357,16 @@
 	if (!port_priv) {
 		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n",
 		       mad_agent);
+		kfree(mad);
 		return;
 	}
 
 	agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL);
-	if (!agent_send_wr)
+	if (!agent_send_wr) {
+		printk(KERN_ERR SPFX "No memory for agent work request\n");
+		kfree(mad);
 		return;
+	}
 	agent_send_wr->mad = mad;
 
 	/* PCI mapping */
@@ -407,6 +411,7 @@
 	if (IS_ERR(agent_send_wr->ah)) {
 		printk(KERN_ERR SPFX "No memory for address handle\n");
 		kfree(mad);
+		kfree(agent_send_wr);
 		return;
 	}
                                                                                 
@@ -432,6 +437,8 @@
 				 sizeof(struct ib_mad),
 				 PCI_DMA_TODEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
+		kfree(mad);
+		kfree(agent_send_wr);
 	} else {
 		list_add_tail(&agent_send_wr->send_list,
 			      &port_priv->send_posted_list);


-- 
Thanks
Shirley Ma
IBM Linux Technology Center


From roland at topspin.com  Tue Nov  2 20:06:47 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 02 Nov 2004 20:06:47 -0800
Subject: [openib-general] ib_modify_qp() too many arguments
In-Reply-To: <20041103001015.GA13563@cup.hp.com> (Grant Grundler's message
	of "Tue, 2 Nov 2004 16:10:15 -0800")
References: <20041103001015.GA13563@cup.hp.com>
Message-ID: <523bzrshig.fsf@topspin.com>

    Grant> Roland, I am trying to build roland-merge #1119 on top of
    Grant> 2.6.10-rc1 for ia64.  And yes, the usage noted below
    Grant> doesn't match the declaration:

    Grant> Should I not (yet) be enabling CONFIG_INFINIBAND_CM?
    Grant> Trivial patch appended to "fix". Though I don't know if
    Grant> it's "right" since it seems ib_cm_qp_modify() could just go
    Grant> away.

CONFIG_INFINIBAND_CM depends on CONFIG_BROKEN now, so I wouldn't
expect it to build (it needs to be converted to the new MAD API).
So for now don't try to build it.

Your patch is a small step in the right direction so I applied it.
The reason I have the ib_cm_qp_modify() function is to add the check
for a NULL qp, which makes the CM logic simpler.

 - R.


From roland at topspin.com  Tue Nov  2 22:27:48 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 02 Nov 2004 22:27:48 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
Message-ID: <52y8hjqwez.fsf@topspin.com>

I've just checked in an initial version of userspace MAD access
(including documentation in docs/user_mad.txt).

Unfortunately this is not quite ready for use underneath OpenSM, since
it is not possible to register an agent for the SM classes (since they
are currently grabbed by the kernel SMA first).

All criticisms and comments greatly appreciated...

Thanks,
  Roland

Index: infiniband/include/ib_user_mad.h
===================================================================
--- infiniband/include/ib_user_mad.h	(revision 0)
+++ infiniband/include/ib_user_mad.h	(revision 0)
@@ -0,0 +1,97 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#ifndef IB_USER_MAD_H
+#define IB_USER_MAD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ */
+
+/**
+ * ib_user_mad - MAD packet
+ * @data - Contents of MAD
+ * @id - ID of agent MAD received with/to be sent with
+ * @qpn - Remote QP number received from/to be sent to
+ * @qkey - Remote Q_Key to be sent with (unset on receive)
+ * @lid - Remote lid received from/to be sent to
+ * @sl - Service level received with/to be sent with
+ * @path_bits - Local path bits received with/to be sent with
+ * @grh_present - If set, GRH was received/should be sent
+ * @gid_index - Local GID index to send with (unset on receive)
+ * @hop_limit - Hop limit in GRH
+ * @traffic_class - Traffic class in GRH
+ * @gid - Remote GID in GRH
+ * @flow_label - Flow label in GRH
+ *
+ * All multi-byte quantities are stored in network (big endian) byte order.
+ */
+struct ib_user_mad {
+	__u8	data[256];
+	__u32	id;
+	__u32	qpn;
+	__u32   qkey;
+	__u16	lid;
+	__u8	sl;
+	__u8	path_bits;
+	__u8	grh_present;
+	__u8	gid_index;
+	__u8	hop_limit;
+	__u8	traffic_class;
+	__u8	gid[16];
+	__u32	flow_label;
+};
+
+/**
+ * ib_user_mad_reg_req - MAD registration request
+ * @id - Set by the kernel; used to identify agent in future requests.
+ * @qpn - Queue pair number; must be 0 or 1.
+ * @method_mask - The caller will receive unsolicited MADs for any method
+ *   where @method_mask = 1.
+ * @mgmt_class - Indicates which management class of MADs should be receive
+ *   by the caller.  This field is only required if the user wishes to
+ *   receive unsolicited MADs, otherwise it should be 0.
+ * @mgmt_class_version - Indicates which version of MADs for the given
+ *   management class to receive.
+ */
+struct ib_user_mad_reg_req {
+	__u32	id;
+	__u32	method_mask[4];
+	__u8	qpn;
+	__u8	mgmt_class;
+	__u8	mgmt_class_version;
+};
+
+#define IB_IOCTL_MAGIC		0x1b
+
+#define IB_USER_MAD_REGISTER_AGENT	_IOWR(IB_IOCTL_MAGIC, 0, \
+					      struct ib_user_mad_reg_req)
+
+#define IB_USER_MAD_UNREGISTER_AGENT	_IOW(IB_IOCTL_MAGIC, 1, __u32)
+
+#endif /* IB_USER_MAD_H */
Index: infiniband/core/Makefile
===================================================================
--- infiniband/core/Makefile	(revision 1086)
+++ infiniband/core/Makefile	(working copy)
@@ -10,7 +10,8 @@
 obj-$(CONFIG_INFINIBAND) += \
     ib_core.o \
     ib_mad.o \
-    ib_sa.o
+    ib_sa.o \
+    ib_umad.o
 
 obj-$(CONFIG_INFINIBAND_CM) += \
     ib_cm.o
@@ -36,6 +37,8 @@
 
 ib_sa-objs := sa_query.o
 
+ib_umad-objs := user_mad.o
+
 ib_cm-objs := \
     cm_main.o \
     cm_api.o \
Index: infiniband/core/user_mad.c
===================================================================
--- infiniband/core/user_mad.c	(revision 0)
+++ infiniband/core/user_mad.c	(revision 0)
@@ -0,0 +1,639 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <linux/pci.h>
+#include <linux/poll.h>
+
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+
+#include <ib_mad.h>
+#include <ib_user_mad.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("InfiniBand userspace MAD packet access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	IB_UMAD_MAX_PORTS  = 256,
+	IB_UMAD_MAX_AGENTS = 32
+};
+
+struct ib_umad_port {
+	int                  devnum;
+	struct cdev          dev;
+	struct class_device *class_dev;
+	struct ib_device    *ib_dev;
+	u8                   port_num;
+};
+
+struct ib_umad_device {
+	int                  start_port, end_port;
+	struct ib_umad_port  port[0];
+};
+
+struct ib_umad_file {
+	struct ib_umad_port *port;
+	struct semaphore     mutex;
+	struct list_head     recv_list;
+	wait_queue_head_t    recv_wait;
+	struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS];
+	struct ib_mr        *mr[IB_UMAD_MAX_AGENTS];
+};
+
+struct ib_umad_packet {
+	struct ib_user_mad mad;
+	struct ib_ah      *ah;
+	struct list_head   list;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+static dev_t base_dev;
+static spinlock_t map_lock = SPIN_LOCK_UNLOCKED;
+static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS);
+
+static struct class_simple *umad_class;
+
+static void ib_umad_add_one(struct ib_device *device);
+static void ib_umad_remove_one(struct ib_device *device);
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct ib_umad_packet *packet =
+		(void *) (unsigned long) mad_send_wc->wr_id;
+
+	pci_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(packet, mapping),
+			 sizeof packet->mad.data,
+			 PCI_DMA_TODEVICE);
+	ib_destroy_ah(packet->ah);
+	kfree(packet);
+}
+
+static void recv_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_umad_file *file = agent->context;
+	struct ib_umad_packet *packet;
+
+	if (mad_recv_wc->wc->status != IB_WC_SUCCESS)
+		goto out;
+
+	packet = kmalloc(sizeof *packet, GFP_KERNEL);
+	if (!packet)
+		goto out;
+
+	memset(packet, 0, sizeof *packet);
+
+	memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data);
+	packet->mad.qpn 	  = cpu_to_be32(mad_recv_wc->wc->src_qp);
+	packet->mad.lid 	  = cpu_to_be16(mad_recv_wc->wc->slid);
+	packet->mad.sl  	  = mad_recv_wc->wc->sl;
+	packet->mad.path_bits 	  = mad_recv_wc->wc->dlid_path_bits;
+	packet->mad.grh_present   = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH);
+	if (packet->mad.grh_present) {
+		/* XXX parse GRH */
+		packet->mad.gid_index 	  = 0;
+		packet->mad.hop_limit 	  = 0;
+		packet->mad.traffic_class = 0;
+		memset(packet->mad.gid, 0, 16);
+		packet->mad.flow_label 	  = 0;
+	}
+
+	down(&file->mutex);
+	for (packet->mad.id = 0;
+	     packet->mad.id < IB_UMAD_MAX_AGENTS;
+	     packet->mad.id++)
+		if (agent == file->agent[packet->mad.id]) {
+			list_add_tail(&packet->list, &file->recv_list);
+			wake_up_interruptible(&file->recv_wait);
+			goto agent;
+		}
+
+	kfree(packet);
+
+agent:
+	up(&file->mutex);
+
+out:
+	ib_free_recv_mad(mad_recv_wc);
+}
+
+static ssize_t ib_umad_read(struct file *filp, char __user *buf,
+			    size_t count, loff_t *pos)
+{
+	struct ib_umad_file *file = filp->private_data;
+	struct ib_umad_packet *packet;
+	ssize_t ret;
+
+	if (count < sizeof (struct ib_user_mad))
+		return -EINVAL;
+
+	if (down_interruptible(&file->mutex))
+		return -ERESTARTSYS;
+
+	while (list_empty(&file->recv_list)) {
+		up(&file->mutex);
+
+		if (filp->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+
+		if (wait_event_interruptible(file->recv_wait,
+					     !list_empty(&file->recv_list)))
+			return -ERESTARTSYS;
+
+		if (down_interruptible(&file->mutex))
+			return -ERESTARTSYS;
+	}
+
+	packet = list_entry(file->recv_list.next, struct ib_umad_packet, list);
+	list_del(&packet->list);
+
+	up(&file->mutex);
+
+	if (copy_to_user(buf, &packet->mad, sizeof packet->mad))
+		ret = -EFAULT;
+	else
+		ret = sizeof packet->mad;
+
+	kfree(packet);
+	return ret;
+}
+
+static ssize_t ib_umad_write(struct file *filp, const char __user *buf,
+			     size_t count, loff_t *pos)
+{
+	struct ib_umad_file *file = filp->private_data;
+	struct ib_umad_packet *packet;
+	struct ib_mad_agent *agent;
+	struct ib_ah_attr ah_attr;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+	};
+	int ret;
+
+	if (count < sizeof (struct ib_user_mad))
+		return -EINVAL;
+
+	packet = kmalloc(sizeof *packet, GFP_KERNEL);
+	if (!packet)
+		return -ENOMEM;
+
+	if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) {
+		kfree(packet);
+		return -EFAULT;
+	}
+
+	if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (down_interruptible(&file->mutex)) {
+		ret = -ERESTARTSYS;
+		goto err;
+	}
+
+	agent = file->agent[packet->mad.id];
+	if (!agent) {
+		ret = -EINVAL;
+		goto err_up;
+	}
+
+	((struct ib_mad_hdr *) packet->mad.data)->tid =
+		cpu_to_be64(((u64) agent->hi_tid) << 32 |
+			    (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) &
+			     0xffffffff));
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid          = be16_to_cpu(packet->mad.lid);
+	ah_attr.sl            = packet->mad.sl;
+	ah_attr.src_path_bits = packet->mad.path_bits;
+	ah_attr.port_num      = file->port->port_num;
+	/* XXX handle GRH */
+
+	packet->ah = ib_create_ah(agent->qp->pd, &ah_attr);
+	if (IS_ERR(packet->ah)) {
+		ret = PTR_ERR(packet->ah);
+		goto err_up;
+	}
+
+	gather_list.addr = pci_map_single(agent->device->dma_device,
+					  packet->mad.data,
+					  sizeof packet->mad.data,
+					  PCI_DMA_TODEVICE);
+	gather_list.length = sizeof packet->mad.data;
+	gather_list.lkey   = file->mr[packet->mad.id]->lkey;
+	pci_unmap_addr_set(packet, mapping, gather_list.addr);
+
+	wr.wr.ud.mad_hdr     = (struct ib_mad_hdr *) packet->mad.data;
+	wr.wr.ud.ah          = packet->ah;
+	wr.wr.ud.remote_qpn  = be32_to_cpu(packet->mad.qpn);
+	wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey);
+
+	wr.wr_id            = (unsigned long) packet;
+
+	ret = ib_post_send_mad(agent, &wr, &bad_wr);
+	if (ret) {
+		pci_unmap_single(agent->device->dma_device,
+				 pci_unmap_addr(packet, mapping),
+				 sizeof packet->mad.data,
+				 PCI_DMA_TODEVICE);
+		goto err_up;
+	}
+
+	up(&file->mutex);
+
+	return sizeof packet->mad;
+
+err_up:
+	up(&file->mutex);
+
+err:
+	kfree(packet);
+	return ret;
+}
+
+static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ib_umad_file *file = filp->private_data;
+
+	/* we will always be able to post a MAD send */
+	unsigned int mask = POLLOUT | POLLWRNORM;
+
+	poll_wait(filp, &file->recv_wait, wait);
+
+	if (!list_empty(&file->recv_list))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg)
+{
+	struct ib_user_mad_reg_req ureq;
+	struct ib_mad_reg_req req;
+	struct ib_mad_agent *agent;
+	int agent_id;
+	int ret;
+
+	if (down_interruptible(&file->mutex))
+		return -EINTR;
+
+	if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (ureq.qpn != 0 && ureq.qpn != 1) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id)
+		if (!file->agent[agent_id])
+			goto found;
+
+	ret = -ENOMEM;
+	goto out;
+
+found:
+	req.mgmt_class         = ureq.mgmt_class;
+	req.mgmt_class_version = ureq.mgmt_class_version;
+	memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask);
+
+	agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num,
+				      ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI,
+				      &req, 0, send_handler, recv_handler,
+				      file);
+	if (IS_ERR(agent)) {
+		ret = PTR_ERR(agent);
+		goto out;
+	}
+
+	file->agent[agent_id] = agent;
+
+	file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(file->mr[agent_id])) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	if (put_user(agent_id,
+		     (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) {
+		ret = -EFAULT;
+		goto err_mr;
+	}
+
+	ret = 0;
+	goto out;
+
+err_mr:
+	ib_dereg_mr(file->mr[agent_id]);
+
+err:
+	file->agent[agent_id] = NULL;
+	ib_unregister_mad_agent(agent);
+
+out:
+	up(&file->mutex);
+	return ret;
+}
+
+static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg)
+{
+	u32 id;
+	int ret = 0;
+
+	if (down_interruptible(&file->mutex))
+		return -EINTR;
+
+	if (get_user(id, (u32 __user *) arg)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ib_dereg_mr(file->mr[id]);
+	ib_unregister_mad_agent(file->agent[id]);
+	file->agent[id] = NULL;
+
+out:
+	up(&file->mutex);
+	return ret;
+}
+
+static int ib_umad_ioctl(struct inode *inode, struct file *filp,
+			 unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case IB_USER_MAD_REGISTER_AGENT:
+		return ib_umad_reg_agent(filp->private_data, arg);
+	case IB_USER_MAD_UNREGISTER_AGENT:
+		return ib_umad_unreg_agent(filp->private_data, arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ib_umad_open(struct inode *inode, struct file *filp)
+{
+	struct ib_umad_port *port =
+		container_of(inode->i_cdev, struct ib_umad_port, dev);
+	struct ib_umad_file *file;
+
+	file = kmalloc(sizeof *file, GFP_KERNEL);
+	if (!file)
+		return -ENOMEM;
+
+	memset(file, 0, sizeof *file);
+
+	init_MUTEX(&file->mutex);
+	INIT_LIST_HEAD(&file->recv_list);
+	init_waitqueue_head(&file->recv_wait);
+
+	file->port = port;
+	filp->private_data = file;
+
+	return 0;
+}
+
+static int ib_umad_close(struct inode *inode, struct file *filp)
+{
+	struct ib_umad_file *file = filp->private_data;
+	int i;
+
+	for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i)
+		if (file->agent[i]) {
+			ib_dereg_mr(file->mr[i]);
+			ib_unregister_mad_agent(file->agent[i]);
+		}
+
+	kfree(file);
+
+	return 0;
+}
+
+static struct file_operations umad_fops = {
+	.owner 	 = THIS_MODULE,
+	.read 	 = ib_umad_read,
+	.write 	 = ib_umad_write,
+	.poll 	 = ib_umad_poll,
+	.ioctl 	 = ib_umad_ioctl,
+	.open 	 = ib_umad_open,
+	.release = ib_umad_close
+};
+
+static struct ib_client umad_client = {
+	.name   = "umad",
+	.add    = ib_umad_add_one,
+	.remove = ib_umad_remove_one
+};
+
+static ssize_t show_ibdev(struct class_device *class_dev, char *buf)
+{
+	struct ib_umad_port *port = class_get_devdata(class_dev);
+
+	return sprintf(buf, "%s\n", port->ib_dev->name);
+}
+CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
+
+static ssize_t show_port(struct class_device *class_dev, char *buf)
+{
+	struct ib_umad_port *port = class_get_devdata(class_dev);
+
+	return sprintf(buf, "%d\n", port->port_num);
+}
+CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL);
+
+static void ib_umad_add_one(struct ib_device *device)
+{
+	struct ib_umad_device *umad_dev;
+	int s, e, i;
+
+	if (device->node_type == IB_NODE_SWITCH)
+		s = e = 0;
+	else {
+		struct ib_device_attr attr;
+		if (ib_query_device(device, &attr))
+			return;
+
+		s = 1;
+		e = attr.phys_port_cnt;
+	}
+
+	umad_dev = kmalloc(sizeof *umad_dev +
+			   (e - s + 1) * sizeof (struct ib_umad_port),
+			   GFP_KERNEL);
+	if (!umad_dev)
+		return;
+
+	umad_dev->start_port = s;
+	umad_dev->end_port   = e;
+
+	for (i = s; i <= e; ++i) {
+		spin_lock(&map_lock);
+		umad_dev->port[i - s].devnum =
+			find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS);
+		if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) {
+			spin_unlock(&map_lock);
+			goto err;
+		}
+		set_bit(umad_dev->port[i - s].devnum, dev_map);
+		spin_unlock(&map_lock);
+
+		umad_dev->port[i - s].ib_dev   = device;
+		umad_dev->port[i - s].port_num = i;
+
+		cdev_init(&umad_dev->port[i - s].dev, &umad_fops);
+		umad_dev->port[i - s].dev.owner = THIS_MODULE;
+		kobject_set_name(&umad_dev->port[i - s].dev.kobj,
+				 "umad%d", umad_dev->port[i - s].devnum);
+		if (cdev_add(&umad_dev->port[i - s].dev, base_dev +
+			     umad_dev->port[i - s].devnum, 1))
+			goto err;
+
+		umad_dev->port[i - s].class_dev =
+			class_simple_device_add(umad_class,
+						umad_dev->port[i - s].dev.dev,
+						&device->dma_device->dev,
+						"umad%d", umad_dev->port[i - s].devnum);
+		if (IS_ERR(umad_dev->port[i - s].class_dev))
+			goto err_class;
+
+		class_set_devdata(umad_dev->port[i - s].class_dev,
+				  &umad_dev->port[i - s]);
+
+		class_device_create_file(umad_dev->port[i - s].class_dev,
+					 &class_device_attr_ibdev);
+		class_device_create_file(umad_dev->port[i - s].class_dev,
+					 &class_device_attr_port);
+	}
+
+	ib_set_client_data(device, &umad_client, umad_dev);
+
+	return;
+
+err_class:
+	cdev_del(&umad_dev->port[i - s].dev);
+	clear_bit(umad_dev->port[i - s].devnum, dev_map);
+
+err:
+	while (--i >= s) {
+		class_simple_device_remove(umad_dev->port[i - s].dev.dev);
+		cdev_del(&umad_dev->port[i - s].dev);
+		clear_bit(umad_dev->port[i - s].devnum, dev_map);
+	}
+
+	kfree(umad_dev);
+}
+
+static void ib_umad_remove_one(struct ib_device *device)
+{
+	struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client);
+	int i;
+
+	if (!umad_dev)
+		return;
+
+	for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) {
+		class_simple_device_remove(umad_dev->port[i].dev.dev);
+		cdev_del(&umad_dev->port[i].dev);
+		clear_bit(umad_dev->port[i].devnum, dev_map);
+	}
+
+	kfree(umad_dev);
+}
+
+static int ib_umad_hotplug(struct class_device *dev, char **envp,
+			   int num_envp, char *buffer, int buffer_size)
+{
+	return 0;
+}
+
+static int __init ib_umad_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS,
+				  "infiniband_mad");
+	if (ret) {
+		printk(KERN_ERR "user_mad: couldn't get device number\n");
+		goto out;
+	}
+
+	umad_class = class_simple_create(THIS_MODULE, "infiniband_mad");
+	if (IS_ERR(umad_class)) {
+		printk(KERN_ERR "user_mad: couldn't create class_simple\n");
+		ret = PTR_ERR(umad_class);
+		goto out_chrdev;
+	}
+
+	ret = class_simple_set_hotplug(umad_class, ib_umad_hotplug);
+	if (ret) {
+		printk(KERN_ERR "user_mad: couldn't set class_simple hotplug\n");
+		goto out_class;
+	}
+
+	ret = ib_register_client(&umad_client);
+	if (ret) {
+		printk(KERN_ERR "user_mad: couldn't register ib_umad client\n");
+		goto out_class;
+	}
+		
+	return 0;
+
+out_class:
+	class_simple_destroy(umad_class);
+
+out_chrdev:
+	unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS);
+
+out:
+	return ret;
+}
+
+static void __exit ib_umad_cleanup(void)
+{
+	ib_unregister_client(&umad_client);
+	class_simple_destroy(umad_class);
+	unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS);
+}
+
+module_init(ib_umad_init);
+module_exit(ib_umad_cleanup);
Index: docs/user_mad.txt
===================================================================
--- docs/user_mad.txt	(revision 0)
+++ docs/user_mad.txt	(revision 0)
@@ -0,0 +1,70 @@
+USERSPACE MAD ACCESS
+
+Device files
+
+  Each port of each InfiniBand device has a "umad" device attached.
+  For example, a two-port HCA will have two devices, while a switch
+  will have one device (for switch port 0).
+
+Creating MAD agents
+
+  A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+  and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+  descriptor for the appropriate device file.  If the registration
+  request succeeds, a 32-bit id will be returned in the structure.
+  For example:
+
+	struct ib_user_mad_reg_req req = { /* ... */ };
+	ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+        if (!ret)
+		my_agent = req.id;
+	else
+		perror("agent register");
+
+  Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+  ioctl.  Also, all agents registered through a file descriptor will
+  be unregistered when the descriptor is closed.
+
+Receiving MADs
+
+  MADs are received using read().  The buffer passed to read() must be
+  large enough to hold at least one struct ib_user_mad.  For example:
+
+	struct ib_user_mad mad;
+	ret = read(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("read");
+
+  In addition to the actual MAD contents, the other struct ib_user_mad
+  fields will be filled in with information on the received MAD.  For
+  example, the remote LID will be in mad.lid.
+
+  poll()/select() may be used to wait until a MAD can be read.
+
+Sending MADs
+
+  MADs are sent using write().  The agent ID for sending should be
+  filled into the id field of the MAD, the destination LID should be
+  filled into the lid field, and so on.  For example:
+
+	struct ib_user_mad mad;
+
+	/* fill in mad.data */
+
+	mad.id  = my_agent;	/* req.id from agent registration */
+	mad.lid = my_dest;	/* in network byte order... */
+	/* etc. */
+
+	ret = write(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("write");
+
+/dev files
+
+  To create the appropriate character device files automatically with
+  udev, a rule like
+
+    KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}"
+
+  can be used.  This will create nodes such as /dev/infiniband/mthca0/umad1
+  for port 1 of device mthca0.


From halr at voltaire.com  Wed Nov  3 06:01:49 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 09:01:49 -0500
Subject: [openib-general] [PATCH] Missing check for atomic_dec in
	ib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411021027060.28639-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411021027060.28639-100000@DYN318430BLD>
Message-ID: <1099490509.2831.19.camel@hpc-1>

On Tue, 2004-11-02 at 21:33, Krishna Kumar wrote:
> Hi Sean,
> 
> I guess you meant "even if solicited is NOT set". What you described is
> right, the race will mean that the remove_mad_reg_req() will free things
> like method/class, while the find_mad_agent looks through the version
> and class to find the mad_agent. This patch will fix it correctly.
> 
> I have also cleaned up a hack in ib_mad_recv_done_handler() where a
> test for '!mad_agent' was being done to determine whether to free 'recv'
> or not :-).

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Nov  3 06:06:28 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 09:06:28 -0500
Subject: [openib-general] [PATCH] fix memory leak problem in
	agent_mad_send()
In-Reply-To: <200411022000.09120.mashirle@us.ibm.com>
References: <200411022000.09120.mashirle@us.ibm.com>
Message-ID: <1099490788.2831.26.camel@hpc-1>

On Tue, 2004-11-02 at 23:00, Shirley Ma wrote:
> Here is the patch. Please review it.

Yes, these are memory leaks in agent_mad_send but it would be better to
fix them with the deallocation being done at the function level where
the allocation is done. Hence rather than agent_send_mad returning void,
it should return int, etc. I will post a patch for this shortly.

-- Hal

> 
> diff -urN access/agent.c access.patch5/agent.c
> --- access/agent.c	2004-11-02 17:40:06.000000000 -0800
> +++ access.patch5/agent.c	2004-11-02 18:43:47.534608536 -0800
> @@ -357,12 +357,16 @@
>  	if (!port_priv) {
>  		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n",
>  		       mad_agent);
> +		kfree(mad);
>  		return;
>  	}
>  
>  	agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL);
> -	if (!agent_send_wr)
> +	if (!agent_send_wr) {
> +		printk(KERN_ERR SPFX "No memory for agent work request\n");
> +		kfree(mad);
>  		return;
> +	}
>  	agent_send_wr->mad = mad;
>  
>  	/* PCI mapping */
> @@ -407,6 +411,7 @@
>  	if (IS_ERR(agent_send_wr->ah)) {
>  		printk(KERN_ERR SPFX "No memory for address handle\n");
>  		kfree(mad);
> +		kfree(agent_send_wr);
>  		return;
>  	}
>                                                                                  
> @@ -432,6 +437,8 @@
>  				 sizeof(struct ib_mad),
>  				 PCI_DMA_TODEVICE);
>  		ib_destroy_ah(agent_send_wr->ah);
> +		kfree(mad);
> +		kfree(agent_send_wr);
>  	} else {
>  		list_add_tail(&agent_send_wr->send_list,
>  			      &port_priv->send_posted_list);
> 
> 


From halr at voltaire.com  Wed Nov  3 06:22:11 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 09:22:11 -0500
Subject: [openib-general] mthca dma_pool_destroy mthca_av busy message
Message-ID: <1099491731.2831.45.camel@hpc-1>

Hi Roland,

When shutting down mthca after shutting down IPoIB, the following
message appears on the console:

ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, c03a6000 busy

Is everything OK the next time mthca, etc. is started ? (It appears to
be).

Thanks.

-- Hal


From halr at voltaire.com  Wed Nov  3 06:25:20 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 09:25:20 -0500
Subject: [openib-general] [PATCH] agent: Fix memory leaks associated with
	agent_mad_send	errors
Message-ID: <1099491920.2831.50.camel@hpc-1>

agent: Fix memory leaks (identified by Shirley Ma) associated with
agent_mad_send errors

Index: agent.c
===================================================================
--- agent.c	(revision 1104)
+++ agent.c	(working copy)
@@ -339,10 +339,10 @@
 	return entry;
 }
 
-void agent_mad_send(struct ib_mad_agent *mad_agent,
-		    struct ib_mad *mad,
-		    struct ib_grh *grh,
-		    struct ib_mad_recv_wc *mad_recv_wc)
+int agent_mad_send(struct ib_mad_agent *mad_agent,
+		   struct ib_mad *mad,
+		   struct ib_grh *grh,
+		   struct ib_mad_recv_wc *mad_recv_wc)
 {
 	struct ib_agent_port_private *port_priv;
 	struct ib_agent_send_wr *agent_send_wr;
@@ -351,18 +351,19 @@
 	struct ib_send_wr *bad_send_wr;
 	struct ib_ah_attr ah_attr;
 	unsigned long flags;
+	int ret = 1;
 
 	/* Find matching MAD agent */
 	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
 	if (!port_priv) {
 		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n",
 		       mad_agent);
-		return;
+		goto out;
 	}
 
 	agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL);
 	if (!agent_send_wr)
-		return;
+		goto out;
 	agent_send_wr->mad = mad;
 
 	/* PCI mapping */
@@ -406,8 +407,8 @@
 	agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr);
 	if (IS_ERR(agent_send_wr->ah)) {
 		printk(KERN_ERR SPFX "No memory for address handle\n");
-		kfree(mad);
-		return;
+		kfree(agent_send_wr);
+		goto out;
 	}
                                                                                 
 	send_wr.wr.ud.ah = agent_send_wr->ah;
@@ -432,11 +433,16 @@
 				 sizeof(struct ib_mad),
 				 PCI_DMA_TODEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
+		kfree(agent_send_wr);
 	} else {
 		list_add_tail(&agent_send_wr->send_list,
 			      &port_priv->send_posted_list);
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+		ret = 0;
 	}
+
+out:
+	return ret;
 }
 
 int smi_send_smp(struct ib_mad_agent *mad_agent,
@@ -470,8 +476,9 @@
 				kfree(smp_response);
 				return 0;
 			}
-			agent_mad_send(mad_agent, smp_response,
-				       NULL, mad_recv_wc);
+			if (agent_mad_send(mad_agent, smp_response,
+					   NULL, mad_recv_wc))
+				kfree(smp_response);
 		} else
 			kfree(smp_response);
 		return 1;


From roland at topspin.com  Wed Nov  3 07:06:20 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 07:06:20 -0800
Subject: [openib-general] Re: mthca dma_pool_destroy mthca_av busy message
In-Reply-To: <1099491731.2831.45.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 03 Nov 2004 09:22:11 -0500")
References: <1099491731.2831.45.camel@hpc-1>
Message-ID: <52u0s7q8er.fsf@topspin.com>

    Hal> Hi Roland, When shutting down mthca after shutting down
    Hal> IPoIB, the following message appears on the console:

    Hal> ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, c03a6000 busy

Yes, this is because IPoIB currently leaks AVs (we need to hook into
the neighbour destructor to know when we can destroy an AV).

    Hal> Is everything OK the next time mthca, etc. is started ? (It
    Hal> appears to be).

I think so.

 - Roland


From sean.hefty at intel.com  Wed Nov  3 08:34:29 2004
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 3 Nov 2004 08:34:29 -0800
Subject: [openib-general] [PATCH] Missing check for atomic_dec
	inib_post_send_mad
In-Reply-To: <Pine.LNX.4.44.0411021027060.28639-100000@DYN318430BLD>
Message-ID: <ORSMSX401FRaqbC8wSA00000005@orsmsx401.amr.corp.intel.com>

>Couple of issues with the new code (same as old code, though) :
>
>1. printk(KERN_ERR PFX "No client 0x%x for received MAD "
>                      "on port %d\n",
>		      hi_tid, port_priv->port_num);
>   and printk(KERN_NOTICE PFX "No matching mad agent found for "
>                       "received MAD on port %d\n", port_priv->port_num);
>   both get printed when mad_agent is not found in solicited case.
>
>2. spin_unlock is performed after all the printk's, which is a bit icky.
>
>Compile-tested patch (not tested) follows at the end of the mail. Let me
>know if I should fix above problems too.

Thanks for the patch.  If you can do something with the printk's, that
would be good.  They should be KERN_NOTICE, but we may want to consider
just removing them.

- Sean


From sean.hefty at intel.com  Wed Nov  3 09:00:23 2004
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 3 Nov 2004 09:00:23 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <52y8hjqwez.fsf@topspin.com>
Message-ID: <ORSMSX401FRaqbC8wSA00000006@orsmsx401.amr.corp.intel.com>

>I've just checked in an initial version of userspace MAD access
>(including documentation in docs/user_mad.txt).
>
>Unfortunately this is not quite ready for use underneath OpenSM, since
>it is not possible to register an agent for the SM classes (since they
>are currently grabbed by the kernel SMA first).
>
>All criticisms and comments greatly appreciated...

After a first review, the code looks really good.

Is anyone willing to work on porting opensm to this?  If not,
I can start on this.  Otherwise, I will continue working on
adding MAD error/overrun handling.

- Sean


From roland at topspin.com  Wed Nov  3 09:14:26 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 09:14:26 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <ORSMSX401FRaqbC8wSA00000006@orsmsx401.amr.corp.intel.com> (Sean
	Hefty's message of "Wed, 3 Nov 2004 09:00:23 -0800")
References: <ORSMSX401FRaqbC8wSA00000006@orsmsx401.amr.corp.intel.com>
Message-ID: <52pt2urh1p.fsf@topspin.com>

    Sean> Is anyone willing to work on porting opensm to this?  If
    Sean> not, I can start on this.  Otherwise, I will continue
    Sean> working on adding MAD error/overrun handling.

It would be great to work on that but we need to resolve how to handle
the SM classes first.

One option would be to extend the user_mad code to handle MADs
timeouts and have OpenSM only receive solicited MADs (and have OpenSM
register an agent with class == 0).  This is only a temporary solution
because ultimately OpenSM needs to receive SMInfo SMPs.  I think this
still requires some figuring for how to handle the DR SMI.

Another option is to revise the kernel MAD code so that it does not
need to register an agent for the SM classes (ie pass all MADs to
low-level driver first).

 - R.


From sean.hefty at intel.com  Wed Nov  3 09:24:09 2004
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 3 Nov 2004 09:24:09 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <52pt2urh1p.fsf@topspin.com>
Message-ID: <ORSMSX401FRaqbC8wSA00000007@orsmsx401.amr.corp.intel.com>

>Another option is to revise the kernel MAD code so that it does not
>need to register an agent for the SM classes (ie pass all MADs to
>low-level driver first).

I thought that we had decided to go this route, and replace snoop_mad with
calls to process_mad.  If we're in agreement on this, I can do it first.

- Sean


From roland at topspin.com  Wed Nov  3 09:31:24 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 09:31:24 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <ORSMSX401FRaqbC8wSA00000007@orsmsx401.amr.corp.intel.com> (Sean
	Hefty's message of "Wed, 3 Nov 2004 09:24:09 -0800")
References: <ORSMSX401FRaqbC8wSA00000007@orsmsx401.amr.corp.intel.com>
Message-ID: <52lldirg9f.fsf@topspin.com>

    Sean> I thought that we had decided to go this route, and replace
    Sean> snoop_mad with calls to process_mad.  If we're in agreement
    Sean> on this, I can do it first.

That was my impression too, so I think that would be a good route to go.

 - R.


From halr at voltaire.com  Wed Nov  3 10:27:20 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 13:27:20 -0500
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD	access
In-Reply-To: <ORSMSX401FRaqbC8wSA00000006@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401FRaqbC8wSA00000006@orsmsx401.amr.corp.intel.com>
Message-ID: <1099506440.2831.60.camel@hpc-1>

On Wed, 2004-11-03 at 12:00, Sean Hefty wrote:
> Is anyone willing to work on porting opensm to this?  If not,
> I can start on this.  Otherwise, I will continue working on
> adding MAD error/overrun handling.

Shahar from Voltaire will be doing this. I am working now on modifying
the MAD layer and agents to make the changes that have been discussed on
the list relative to supporting the SM.

-- Hal


From sean.hefty at intel.com  Wed Nov  3 10:24:57 2004
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 3 Nov 2004 10:24:57 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD	access
In-Reply-To: <1099506440.2831.60.camel@hpc-1>
Message-ID: <ORSMSX4011XvpFVjCRG00000008@orsmsx401.amr.corp.intel.com>

>On Wed, 2004-11-03 at 12:00, Sean Hefty wrote:
>> Is anyone willing to work on porting opensm to this?  If not,
>> I can start on this.  Otherwise, I will continue working on
>> adding MAD error/overrun handling.
>
>Shahar from Voltaire will be doing this. I am working now on modifying
>the MAD layer and agents to make the changes that have been discussed on
>the list relative to supporting the SM.

Then I shall return to my work of handling QP errors/overruns in the
MAD layer.

- Sean


From halr at voltaire.com  Wed Nov  3 10:32:49 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 13:32:49 -0500
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD	access
In-Reply-To: <52pt2urh1p.fsf@topspin.com>
References: <ORSMSX401FRaqbC8wSA00000006@orsmsx401.amr.corp.intel.com>
	<52pt2urh1p.fsf@topspin.com>
Message-ID: <1099506769.2831.66.camel@hpc-1>

On Wed, 2004-11-03 at 12:14, Roland Dreier wrote:
> Another option is to revise the kernel MAD code so that it does not
> need to register an agent for the SM classes (ie pass all MADs to
> low-level driver first).

That's what I'm working on now (eliminate snoop_mad and replace with
process_mad). I should have the first cut by COB today. After that, I
will work on SMI restructure that needs to be done so that outgoing SMI
updating becomes part of ib_post_send_mad rather than a precursor to it.

-- Hal


From iod00d at hp.com  Wed Nov  3 10:50:17 2004
From: iod00d at hp.com (Grant Grundler)
Date: Wed, 3 Nov 2004 10:50:17 -0800
Subject: [openib-general] ib_modify_qp() too many arguments
In-Reply-To: <523bzrshig.fsf@topspin.com>
References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com>
Message-ID: <20041103185017.GE17281@cup.hp.com>

On Tue, Nov 02, 2004 at 08:06:47PM -0800, Roland Dreier wrote:
> CONFIG_INFINIBAND_CM depends on CONFIG_BROKEN now,

Sorry - I didn't see that.
I only looked at the Makefile, not Kconfig.

> Your patch is a small step in the right direction so I applied it.
> The reason I have the ib_cm_qp_modify() function is to add the check
> for a NULL qp, which makes the CM logic simpler.

*nod*. I'll poke at other trivial things in the meantime...

thanks,
grant


From krkumar at us.ibm.com  Wed Nov  3 10:45:20 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 10:45:20 -0800 (PST)
Subject: [openib-general] [PATCH] Cleanup spaces to tabs
Message-ID: <Pine.LNX.4.44.0411031043560.4399-100000@DYN318430BLD>

Entire openib cleaned up to remove 8 spaces to replace with
tabs, just two files though :-)

thx,

- KK

diff -ruN 1/agent.c 2/agent.c
--- 1/agent.c	2004-11-03 10:42:29.000000000 -0800
+++ 2/agent.c	2004-11-03 10:43:24.000000000 -0800
@@ -207,7 +207,7 @@
 		if (hop_ptr == 1) {
 			if (smp->dr_slid == IB_LID_PERMISSIVE) {
 				/* giving SMP to SM - update hop_ptr */
-                                smp->hop_ptr--;
+				smp->hop_ptr--;
 				return 1;
 			}
 			/* smp->hop_ptr updated when sending */
@@ -373,7 +373,7 @@
 					  PCI_DMA_TODEVICE);
 	gather_list.length = sizeof(struct ib_mad);
 	gather_list.lkey = (*port_priv->mr).lkey;
-
+
 	send_wr.next = NULL;
 	send_wr.opcode = IB_WR_SEND;
 	send_wr.sg_list = &gather_list;
@@ -381,7 +381,7 @@
 	send_wr.wr.ud.remote_qpn = mad_recv_wc->wc->src_qp; /* DQPN */
 	send_wr.wr.ud.timeout_ms = 0;
 	send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
-
+
 	ah_attr.dlid = mad_recv_wc->wc->slid;
 	ah_attr.port_num = mad_agent->port_num;
 	ah_attr.src_path_bits = mad_recv_wc->wc->dlid_path_bits;
@@ -410,7 +410,7 @@
 		kfree(agent_send_wr);
 		goto out;
 	}
-
+
 	send_wr.wr.ud.ah = agent_send_wr->ah;
 	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		send_wr.wr.ud.pkey_index = mad_recv_wc->wc->pkey_index;
@@ -560,8 +560,8 @@
 {
 	struct ib_agent_port_private	*port_priv;
 	struct ib_agent_send_wr		*agent_send_wr;
-	struct list_head                *send_wr;
-	unsigned long                   flags;
+	struct list_head		*send_wr;
+	unsigned long			flags;

 	/* Find matching MAD agent */
 	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
@@ -579,7 +579,7 @@
 		       "is empty\n", (unsigned long long) mad_send_wc->wr_id);
 		return;
 	}
-
+
 	agent_send_wr = list_entry(&port_priv->send_posted_list,
 				    struct ib_agent_send_wr,
 				    send_list);
@@ -588,8 +588,8 @@
 				     send_list);

 	/* Remove from posted send MAD list */
-        list_del(&agent_send_wr->send_list);
-        spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+	list_del(&agent_send_wr->send_list);
+	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);

 	/* Unmap PCI */
 	pci_unmap_single(mad_agent->device->dma_device,
@@ -694,11 +694,11 @@
 		goto error3;
 	}

-        /* Obtain MAD agent for PerfMgmt class */
-        reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
+	/* Obtain MAD agent for PerfMgmt class */
+	reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
 	clear_bit(IB_MGMT_METHOD_TRAP_REPRESS,
 		  (unsigned long *)&reg_req.method_mask);
-        port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num,
+	port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num,
 							   IB_QPT_GSI,
 						  	  &reg_req, 0,
 							  &agent_send_handler,
@@ -756,7 +756,7 @@
 	ib_unregister_mad_agent(port_priv->perf_mgmt_agent);
 	ib_unregister_mad_agent(port_priv->lr_smp_agent);
 	ib_unregister_mad_agent(port_priv->dr_smp_agent);
-        kfree(port_priv);
+	kfree(port_priv);

 	return 0;
 }
diff -ruN 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-03 10:42:29.000000000 -0800
+++ 2/mad.c	2004-11-03 10:43:12.000000000 -0800
@@ -1473,7 +1473,7 @@
 	struct ib_qp_attr *attr;
 	int attr_mask;

-        attr =  kmalloc(sizeof *attr, GFP_KERNEL);
+	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr) {
 		printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n");
 		return -ENOMEM;


From mashirle at us.ibm.com  Wed Nov  3 10:56:29 2004
From: mashirle at us.ibm.com (Shirley Ma)
Date: Wed, 3 Nov 2004 10:56:29 -0800
Subject: [openib-general] [PATCH] fix memory leak and return value
	associated with agent_mad_send(response)
Message-ID: <200411031056.29522.mashirle@us.ibm.com>

Here is the patch. Please review it.

diff -urN access/agent.c access.patch6/agent.c
--- access/agent.c	2004-11-03 10:34:17.941019320 -0800
+++ access.patch6/agent.c	2004-11-03 10:54:16.001886384 -0800
@@ -477,8 +477,10 @@
 				return 0;
 			}
 			if (agent_mad_send(mad_agent, smp_response,
-					   NULL, mad_recv_wc))
+					   NULL, mad_recv_wc)) {
 				kfree(smp_response);
+				return 0;
+			}
 		} else
 			kfree(smp_response);
 		return 1;
@@ -504,7 +506,10 @@
 	ret = mad_process_local(mad_agent, mad, response, slid);
 	if (ret & IB_MAD_RESULT_SUCCESS) {
 		grh = (void *)mad - sizeof(struct ib_grh);
-		agent_mad_send(mad_agent, response, grh, mad_recv_wc);
+		if (agent_mad_send(mad_agent, response, grh, mad_recv_wc)) {
+			kfree(response);
+			return 0;
+		}
 	} else
 		kfree(response);
 	return 1;
@@ -543,12 +548,12 @@
 	} else {
 		/* PerfMgmt class */
 		if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
-			agent_mad_response(mad_agent, mad, mad_recv_wc,
-					   mad_recv_wc->wc->slid);
+			return (agent_mad_response(mad_agent, mad, mad_recv_wc,
+						   mad_recv_wc->wc->slid));
 		} else {
 			printk(KERN_ERR "agent_recv_mad: Unexpected mgmt class 0x%x received\n", 
mad->mad_hdr.mgmt_class);
+			return 0;
 		}
-		return 0;
 	}
 
 	/* Complete receive up stack */


-- 
Thanks
Shirley Ma
IBM Linux Technology Center


From iod00d at hp.com  Wed Nov  3 11:21:03 2004
From: iod00d at hp.com (Grant Grundler)
Date: Wed, 3 Nov 2004 11:21:03 -0800
Subject: [openib-general] ib_modify_qp() too many arguments
In-Reply-To: <523bzrshig.fsf@topspin.com>
References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com>
Message-ID: <20041103192103.GF17281@cup.hp.com>

On Tue, Nov 02, 2004 at 08:06:47PM -0800, Roland Dreier wrote:
> CONFIG_INFINIBAND_CM depends on CONFIG_BROKEN now,
...
> Your patch is a small step in the right direction so I applied it.

"small" is a very generous assessment :^)
It was almost irrelevant given how much code still needs work.

Here's the link phase output with CM/DM/SRP/etc enabled:
  Building modules, stage 2.
  MODPOST
*** Warning: "ib_client_query_cancel" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined!
*** Warning: "tsIbSetOutofServiceNoticeHandler" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined!
*** Warning: "tsIbPathRecordRequest" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined!
*** Warning: "tsIbSetInServiceNoticeHandler" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined!
*** Warning: "ib_client_mad_handler_register" [drivers/infiniband/core/ib_dm_client.ko] undefined!
*** Warning: "tsIbPortInfoTblQuery" [drivers/infiniband/core/ib_dm_client.ko] undefined!
*** Warning: "tsIbPortInfoQuery" [drivers/infiniband/core/ib_dm_client.ko] undefined!
*** Warning: "ib_client_query" [drivers/infiniband/core/ib_dm_client.ko] undefined!
*** Warning: "ib_client_alloc_tid" [drivers/infiniband/core/ib_dm_client.ko] undefined!
*** Warning: "ib_mad_send" [drivers/infiniband/core/ib_cm.ko] undefined!
*** Warning: "ib_mad_handler_register" [drivers/infiniband/core/ib_cm.ko] undefined!
*** Warning: "ib_mad_handler_deregister" [drivers/infiniband/core/ib_cm.ko] undefined!

Can folks offer some guidance on the following issues:

1) drivers/infiniband/include/ still has alot of files still prefixed
   with "ts".
   Do they all need to be renamed?
   Or do some need to be reworked to match some new interfaces?

   E.g. ib_client_query_cancel is declared in ts_ib_client_query.h.
   I don't know if ts_ib_client_query.h needs additional work.
   Should I submit patches so all #include's only reference
   ib_client_query.h? Or maybe just client_query.h?

2) Can some of the offending include files be dropped outright?

3) Of the ts* symbols above, can someone point me at which header file
   contains the "right" interfaces to use?
   I might be able to fixup some of the warnings above.
   I'm thinking of tsIbSetOutofServiceNoticeHandler and similar
   functions.

grant


From roland at topspin.com  Wed Nov  3 11:24:02 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 11:24:02 -0800
Subject: [openib-general] ib_modify_qp() too many arguments
In-Reply-To: <20041103192103.GF17281@cup.hp.com> (Grant Grundler's message
	of "Wed, 3 Nov 2004 11:21:03 -0800")
References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com>
	<20041103192103.GF17281@cup.hp.com>
Message-ID: <528y9irb1p.fsf@topspin.com>

    Grant> 1) drivers/infiniband/include/ still has alot of files
    Grant> still prefixed with "ts".  Do they all need to be renamed?
    Grant> Or do some need to be reworked to match some new
    Grant> interfaces?

I think pretty much every ts_*.h file is obsolete.  When we port the
CM to the new world, we'll have to update ts_ib_cm.h but it needs
quite a bit of work.

We're going to create an actual gen2/trunk very soon with the first
set of stuff for kernel submission, and I'll remove all the broken
stuff from there (I'll keep the CM etc on my branch so that it can
eventually be fixed and added to the trunk).

 - R.


From roland at topspin.com  Wed Nov  3 11:56:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 11:56:14 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <52y8hjqwez.fsf@topspin.com> (Roland Dreier's message of "Tue,
	02 Nov 2004 22:27:48 -0800")
References: <52y8hjqwez.fsf@topspin.com>
Message-ID: <524qk6r9k1.fsf@topspin.com>

By the way, buried down at the end of the patch is some documentation
about creating device files:

+/dev files
+
+  To create the appropriate character device files automatically with
+  udev, a rule like
+
+    KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}"
+
+  can be used.  This will create nodes such as /dev/infiniband/mthca0/umad1
+  for port 1 of device mthca0.

Do the names /dev/infiniband/mthca0/umad1 and so on make sense to
people?  I thought that userspace verbs support would probably use a
file like /dev/infiniband/mthca0/verbs, etc.

In any case, now is probably the time to object before we have legacy
issues to worry about....

 - R.


From iod00d at hp.com  Wed Nov  3 12:07:24 2004
From: iod00d at hp.com (Grant Grundler)
Date: Wed, 3 Nov 2004 12:07:24 -0800
Subject: [openib-general] ib_modify_qp() too many arguments
In-Reply-To: <528y9irb1p.fsf@topspin.com>
References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com>
	<20041103192103.GF17281@cup.hp.com> <528y9irb1p.fsf@topspin.com>
Message-ID: <20041103200724.GG17281@cup.hp.com>

On Wed, Nov 03, 2004 at 11:24:02AM -0800, Roland Dreier wrote:
>     Grant> 1) drivers/infiniband/include/ still has alot of files
>     Grant> still prefixed with "ts".  Do they all need to be renamed?
>     Grant> Or do some need to be reworked to match some new
>     Grant> interfaces?
> 
> I think pretty much every ts_*.h file is obsolete.  When we port the
> CM to the new world, we'll have to update ts_ib_cm.h but it needs
> quite a bit of work.

ok

> We're going to create an actual gen2/trunk very soon with the first
> set of stuff for kernel submission, and I'll remove all the broken
> stuff from there (I'll keep the CM etc on my branch so that it can
> eventually be fixed and added to the trunk).

I don't mind broken stuff in the candidate branch.  Especially if it's
something that we need anyway and just needs "the dots connected".
Fixing missing symbols is usually a pretty mundane task. Goes something
along the line of:
1) look for similar new symbol
2) find old symbol in old code
3) compare the two and see how they differ.
4) either use the new symbol and adjust code around the reference,
   drop reference to old symbol, or ask what to do.

thanks,
grant


From krkumar at us.ibm.com  Wed Nov  3 13:23:32 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 13:23:32 -0800 (PST)
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
Message-ID: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>

Hi,

I am not sure if this is a good idea, but since I am new to this area,
here it goes :-)

Section 11.2.6.3,  C11-16 states that resize of qp must be permitted.
In the patch I am submitting, I don't understand why so many parameters
are expected by driver/verbs. I thought the qp_handle and ib_qp_attr is
enough, atleast according to the spec.

Along with this, I am going to submit another patch to catch "catastrophic"
errors in return value of the resize operation. This is due to the need
to check for 2 special cases : "CQ overrun" and "CQ inaccessible". For
these two errors, I think the queues should be deallocated and error
returned. This is in the second patch. I am not sure of the error numbers,
I guessed it from mthca_eq.c and could be wrong here.

Thanks,

- KK

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-03 11:32:14.000000000 -0800
+++ 2/mad.c	2004-11-03 13:15:49.000000000 -0800
@@ -1629,6 +1629,14 @@ static void init_mad_queue(struct ib_mad
 	INIT_LIST_HEAD(&mad_queue->list);
 }

+/*
+ * Allocate one mad QP.
+ *
+ * If the return indicates success, the value returned is the new size
+ * of the queue pair that got created.
+ *
+ * Return > 0 on success and -(ERRNO) on failure. Zero should never happen.
+ */
 static int create_mad_qp(struct ib_mad_port_private *port_priv,
 			 struct ib_mad_qp_info *qp_info,
 			 enum ib_qp_type qp_type)
@@ -1652,15 +1660,23 @@ static int create_mad_qp(struct ib_mad_p
 	qp_init_attr.qp_type = qp_type;
 	qp_init_attr.port_num = port_priv->port_num;
 	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
-	if (IS_ERR(qp_info->qp)) {
-		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
-		       get_spl_qp_index(qp_type));
+	if (!IS_ERR(qp_info->qp)) {
+		struct ib_qp_attr	qp_attr;
+
+		ret = ib_query_qp(qp_info->qp, &qp_attr, 0, &qp_init_attr);
+		if (ret < 0) {
+			/*
+			 * For any error, use the same size we used to
+			 * create the queue.
+			 */
+			ret = qp_init_attr.cap.max_send_wr +
+					qp_init_attr.cap.max_recv_wr;
+		}
+	} else {
 		ret = PTR_ERR(qp_info->qp);
-		goto error;
+		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d err:%d\n",
+		       get_spl_qp_index(qp_type), ret);
 	}
-	return 0;
-
-error:
 	return ret;
 }

@@ -1682,6 +1698,7 @@ static int ib_mad_port_open(struct ib_de
 		.size = (unsigned long) high_memory - PAGE_OFFSET
 	};
 	struct ib_mad_port_private *port_priv;
+	int total_qp_size;
 	unsigned long flags;

 	/* First, check if port already open at MAD layer */
@@ -1731,11 +1748,25 @@ static int ib_mad_port_open(struct ib_de
 	}

 	ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI);
-	if (ret)
+	if (ret <= 0)
 		goto error6;
+	total_qp_size = ret;
+
 	ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI);
-	if (ret)
+	if (ret <= 0)
 		goto error7;
+	total_qp_size += ret;
+
+	/* Resize if the total QP[0,1] size is greater than CQ size. */
+	if (total_qp_size > cq_size) {
+		printk(KERN_DEBUG PFX "ib_mad_port_open: increasing size of "
+		       "CQ from %d to %d\n", cq_size, total_qp_size);
+		if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) {
+			printk(KERN_DEBUG PFX "Couldn't increase CQ size - "
+			       "err:%d\n", ret);
+			/* continue, not an error */
+		}
+	}

 	spin_lock_init(&port_priv->reg_lock);
 	INIT_LIST_HEAD(&port_priv->agent_list);


From krkumar at us.ibm.com  Wed Nov  3 13:24:34 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 13:24:34 -0800 (PST)
Subject: [openib-general] [PATCH 2/2] [RFC] Implement error handling in
	resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
Message-ID: <Pine.LNX.4.44.0411031323440.6866-100000@DYN318430BLD>

Again, this has been build-tested only.

Thx,

- KK

diff -ruNp 2/mad.c 3/mad.c
--- 2/mad.c	2004-11-03 13:15:49.000000000 -0800
+++ 3/mad.c	2004-11-03 13:16:47.000000000 -0800
@@ -1686,6 +1686,23 @@ static void destroy_mad_qp(struct ib_mad
 }

 /*
+ * Overrun and Inaccessible errors cannot be handled by QP resize operation.
+ */
+static inline int is_catastrophic_error(int err)
+{
+#define	CQ_OVERFLOW_ERROR	0x0f
+#define	CQ_ACCESS_ERROR		0x11
+
+	switch (err) {
+	default:	/* OK */
+		return 0;
+	case CQ_ACCESS_ERROR:
+	case CQ_OVERFLOW_ERROR:
+		return 1;
+	}
+}
+
+/*
  * Open the port
  * Create the QP, PD, MR, and CQ if needed
  */
@@ -1764,6 +1781,10 @@ static int ib_mad_port_open(struct ib_de
 		if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) {
 			printk(KERN_DEBUG PFX "Couldn't increase CQ size - "
 			       "err:%d\n", ret);
+			if (is_catastrophic_error(ret)) {
+				/* Clean up qp_info[0,1] */
+				goto error8;
+			}
 			/* continue, not an error */
 		}
 	}


From halr at voltaire.com  Wed Nov  3 13:54:34 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 16:54:34 -0500
Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes (both
	branches)
Message-ID: <1099518873.2837.5.camel@hpc-1>

mthca/mad/agent changes to eliminate snoop_mad and use process_mad to
give mthca driver first crack at received MAD and change agents to be
send only (and not register for receives which are now handled by
mthca).

There are some more optimizations that can be made but this is a working
start. I will do some of this hopefully tomorrow but outgoing SMI is my
primary goal (to incorporate it into ib_post_send_mad). Once that is
done, the changes to the MAD layer for SM support are usable AFAIK.

Index: openib-candidate/src/linux-kernel/infiniband/access/agent.c
===================================================================
---
openib-candidate/src/linux-kernel/infiniband/access/agent.c	(revision
1125)
+++ openib-candidate/src/linux-kernel/infiniband/access/agent.c	(working
copy)
@@ -29,7 +29,6 @@
 #include <asm/bug.h>
 
 
-
 static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED;
 static LIST_HEAD(ib_agent_port_list);
 
@@ -37,9 +36,9 @@
  * Fixup a directed route SMP for sending.  Return 0 if the SMP should
be
  * discarded.
  */
-static int smi_handle_dr_smp_send(struct ib_smp *smp,
-				  u8 node_type,
-				  int port_num)
+int smi_handle_dr_smp_send(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num)
 {
 	u8 hop_ptr, hop_cnt;
 
@@ -111,23 +110,6 @@
 }
 
 /*
- * Sender side handling of outgoing SMPs.  Fixup the SMP as required by
- * the spec.  Return 0 if the SMP should be dropped.
- */
-static int smi_handle_smp_send(struct ib_smp *smp,
-			       u8 node_type,
-			       int port_num)
-{
-	switch (smp->mgmt_class)
-	{
-	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
-		return smi_handle_dr_smp_send(smp, node_type, port_num);
-	default: 	/* LR SM class */
-		return 1;
-	}
-}
-
-/*
  * Return 1 if the SMP should be handled by the local SMA via
process_mad.
  */
 static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent,
@@ -145,10 +127,10 @@
  * Adjust information for a received SMP.  Return 0 if the SMP should
be
  * dropped.
  */
-static int smi_handle_dr_smp_recv(struct ib_smp *smp,
-				  u8 node_type,
-				  int port_num,
-				  int phys_port_cnt)
+int smi_handle_dr_smp_recv(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num,
+			   int phys_port_cnt)
 {
 	u8 hop_ptr, hop_cnt;
 
@@ -221,29 +203,10 @@
 }
 
 /*
- * Receive side handling SMPs.  Save receive information as required by
- * the spec.  Return 0 if the SMP should be dropped.
- */
-static int smi_handle_smp_recv(struct ib_smp *smp,
-			       u8 node_type,
-			       int port_num,
-			       int phys_port_cnt)
-{
-	switch (smp->mgmt_class)
-	{
-	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
-		return smi_handle_dr_smp_recv(smp, node_type,
-					      port_num, phys_port_cnt);
-	default:	/* LR SM class */
-		return 1;
-	}
-}
-
-/*
  * Return 1 if the received DR SMP should be forwarded to the send
queue.
  * Return 0 if the SMP should be completed up the stack.
  */
-static int smi_check_forward_dr_smp(struct ib_smp *smp)
+int smi_check_forward_dr_smp(struct ib_smp *smp)
 {
 	u8 hop_ptr, hop_cnt;
 
@@ -274,31 +237,6 @@
 	return 0;
 }
 
-/*
- * Return 1 if the received SMP should be forwarded to the send queue.
- * Return 0 if the SMP should be completed up the stack.
- */
-static int smi_check_forward_smp(struct ib_smp *smp)
-{
-	switch (smp->mgmt_class)
-	{
-	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
-		return smi_check_forward_dr_smp(smp);
-	default:	/* LR SM class */
-		return 1;
-	}
-}
-
-static int mad_process_local(struct ib_mad_agent *mad_agent,
-			     struct ib_mad *mad,
-			     struct ib_mad *mad_response,
-			     u16 slid)
-{
-	return mad_agent->device->process_mad(mad_agent->device, 0,
-					      mad_agent->port_num,
-					      slid, mad, mad_response);
-}
-
 static inline struct ib_agent_port_private *
 __ib_get_agent_mad(struct ib_device *device, int port_num,
 		   struct ib_mad_agent *mad_agent)
@@ -339,12 +277,28 @@
 	return entry;
 }
 
-int agent_mad_send(struct ib_mad_agent *mad_agent,
-		   struct ib_mad *mad,
-		   struct ib_grh *grh,
-		   struct ib_mad_recv_wc *mad_recv_wc)
+int smi_check_local_dr_smp(struct ib_smp *smp,
+			   struct ib_device *device,
+			   int port_num)
 {
 	struct ib_agent_port_private *port_priv;
+
+	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	if (!port_priv) {
+		printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not
open\n",
+		       device->name, port_num);
+		return 1;
+	}
+
+	return smi_check_local_smp(port_priv->dr_smp_agent, smp);
+}
+
+static int agent_mad_send(struct ib_mad_agent *mad_agent,
+			  struct ib_mad *mad,
+			  struct ib_grh *grh,
+			  struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_agent_port_private *port_priv;
 	struct ib_agent_send_wr *agent_send_wr;
 	struct ib_sge gather_list;
 	struct ib_send_wr send_wr;
@@ -445,114 +399,41 @@
 	return ret;
 }
 
-int smi_send_smp(struct ib_mad_agent *mad_agent,
-		 struct ib_smp *smp,
-		 struct ib_mad_recv_wc *mad_recv_wc,
-		 u16 slid,
-		 int phys_port_cnt)
+int agent_send(struct ib_mad *mad,
+	       struct ib_grh *grh,
+	       struct ib_wc *wc,
+	       struct ib_device *device,
+	       int port_num)
 {
-	struct ib_mad *smp_response;
-	int ret;
+	struct ib_agent_port_private *port_priv;
+	struct ib_mad_agent *mad_agent;
+	struct ib_mad_recv_wc mad_recv_wc;
 
-	if (!smi_handle_smp_send(smp, mad_agent->device->node_type,
-				 mad_agent->port_num)) {
-		/* SMI failed send */
-		return 0;
-	}
-
-	if (smi_check_local_smp(mad_agent, smp)) {
-		smp_response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
-		if (!smp_response)
-			return 0;
-
-		ret = mad_process_local(mad_agent, (struct ib_mad *)smp,
-					smp_response, slid);
-		if (ret & IB_MAD_RESULT_SUCCESS) {
-			if (!smi_handle_smp_recv((struct ib_smp *)smp_response,
-						 mad_agent->device->node_type,
-						 mad_agent->port_num,
-						 phys_port_cnt)) {
-				/* SMI failed receive */
-				kfree(smp_response);
-				return 0;
-			}
-			if (agent_mad_send(mad_agent, smp_response,
-					   NULL, mad_recv_wc))
-				kfree(smp_response);
-		} else
-			kfree(smp_response);
+	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	if (!port_priv) {
+		printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n",
+		       device->name, port_num);
 		return 1;
 	}
 
-	/* Post the send on the QP */
-	return 1;
-}
-
-int agent_mad_response(struct ib_mad_agent *mad_agent,
-		       struct ib_mad *mad,
-		       struct ib_mad_recv_wc *mad_recv_wc,
-		       u16 slid)
-{
-	struct ib_mad *response;
-	struct ib_grh *grh;
-	int ret;
-
-	response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
-	if (!response)
-		return 0;
-
-	ret = mad_process_local(mad_agent, mad, response, slid);
-	if (ret & IB_MAD_RESULT_SUCCESS) {
-		grh = (void *)mad - sizeof(struct ib_grh);
-		agent_mad_send(mad_agent, response, grh, mad_recv_wc);
-	} else
-		kfree(response);
-	return 1;
-}
-
-int agent_recv_mad(struct ib_mad_agent *mad_agent,
-		   struct ib_mad *mad,
-		   struct ib_mad_recv_wc *mad_recv_wc,
-		   int phys_port_cnt)
-{
-	int port_num;
-
-	/* SM Directed Route or LID Routed class */
-	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ||
-	    mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) {
-		if (mad_agent->device->node_type != IB_NODE_SWITCH)
-			port_num = mad_agent->port_num;
-		else
-			port_num = mad_recv_wc->wc->port_num;
-		if (!smi_handle_smp_recv((struct ib_smp *)mad,
-					 mad_agent->device->node_type,
-					 port_num, phys_port_cnt)) {
-			/* SMI failed receive */
-			return 0;
-		}
-
-		if (smi_check_forward_smp((struct ib_smp *)mad)) {
-			smi_send_smp(mad_agent,
-				    (struct ib_smp *)mad,
-				     mad_recv_wc,
-				     mad_recv_wc->wc->slid,
-				     phys_port_cnt);
-			return 0;
-		}
-
-	} else {
-		/* PerfMgmt class */
-		if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
-			agent_mad_response(mad_agent, mad, mad_recv_wc,
-					   mad_recv_wc->wc->slid);
-		} else {
-			printk(KERN_ERR "agent_recv_mad: Unexpected mgmt class 0x%x
received\n", mad->mad_hdr.mgmt_class);
-		}
-		return 0;
+	/* Get mad agent based on mgmt_class in MAD */
+	switch (mad->mad_hdr.mgmt_class) {
+		case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
+			mad_agent = port_priv->dr_smp_agent;
+			break;
+		case IB_MGMT_CLASS_SUBN_LID_ROUTED:
+			mad_agent = port_priv->lr_smp_agent;
+			break;
+		case IB_MGMT_CLASS_PERF_MGMT:
+			mad_agent = port_priv->perf_mgmt_agent;
+			break;
+		default:
+			return 1;
 	}
 
-	/* Complete receive up stack */
-	return 1;
+	/* Other fields don't matter so should change signature to just use wc
*/
+	mad_recv_wc.wc = wc;
+	return agent_mad_send(mad_agent, mad, grh, &mad_recv_wc);
 }
 
 static void agent_send_handler(struct ib_mad_agent *mad_agent,
@@ -603,26 +484,6 @@
 	kfree(agent_send_wr->mad);
 }
 
-static void agent_recv_handler(struct ib_mad_agent *mad_agent,
-			       struct ib_mad_recv_wc *mad_recv_wc)
-{
-	struct ib_agent_port_private *port_priv;
-
-	/* Find matching MAD agent */
-	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
-	if (!port_priv) {
-		printk(KERN_ERR SPFX "agent_recv_handler: no matching MAD agent
%p\n",
-		       mad_agent);
-	} else {
-		agent_recv_mad(mad_agent, 
-			       mad_recv_wc->recv_buf->mad,
-			       mad_recv_wc, port_priv->phys_port_cnt);
-	}
-
-	/* Free received MAD */
-	ib_free_recv_mad(mad_recv_wc);
-}
-
 int ib_agent_port_open(struct ib_device *device, int port_num,
 			      int phys_port_cnt)
 {
@@ -663,19 +524,12 @@
 	reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE;
 	reg_req.mgmt_class_version = 1;
 
-	/* SMA needs to receive Get, Set, and TrapRepress methods */
-	bitmap_zero((unsigned long *)&reg_req.method_mask,
IB_MGMT_MAX_METHODS);
-	set_bit(IB_MGMT_METHOD_GET, (unsigned long *)&reg_req.method_mask);
-	set_bit(IB_MGMT_METHOD_SET, (unsigned long *)&reg_req.method_mask);
-	set_bit(IB_MGMT_METHOD_TRAP_REPRESS, 
-		(unsigned long *)&reg_req.method_mask);
-
 	port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num,
 							IB_QPT_SMI,
-						       &reg_req, 0,
+							NULL, 0,
 						       &agent_send_handler,
-						       &agent_recv_handler,
-							NULL);
+							NULL, NULL);
+
 	if (IS_ERR(port_priv->dr_smp_agent)) {
 		ret = PTR_ERR(port_priv->dr_smp_agent);
 		goto error2;
@@ -685,10 +539,9 @@
 	reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
 	port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num,
 							IB_QPT_SMI,
-						       &reg_req, 0,
+							NULL, 0,
 						       &agent_send_handler,
-						       &agent_recv_handler,
-							NULL);
+							NULL, NULL);
 	if (IS_ERR(port_priv->lr_smp_agent)) {
 		ret = PTR_ERR(port_priv->lr_smp_agent);
 		goto error3;
@@ -696,14 +549,11 @@
 
         /* Obtain MAD agent for PerfMgmt class */
         reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
-	clear_bit(IB_MGMT_METHOD_TRAP_REPRESS,
-		  (unsigned long *)&reg_req.method_mask);
-        port_priv->perf_mgmt_agent = ib_register_mad_agent(device,
port_num,
+	port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num,
 							   IB_QPT_GSI,
-						  	  &reg_req, 0,
+							   NULL, 0,
 							  &agent_send_handler,
-							  &agent_recv_handler,
-							   NULL);
+							   NULL, NULL);
 	if (IS_ERR(port_priv->perf_mgmt_agent)) {
 		ret = PTR_ERR(port_priv->perf_mgmt_agent);
 		goto error4;
Index: openib-candidate/src/linux-kernel/infiniband/access/mad.c
===================================================================
--- openib-candidate/src/linux-kernel/infiniband/access/mad.c	(revision
1124)
+++ openib-candidate/src/linux-kernel/infiniband/access/mad.c	(working
copy)
@@ -781,21 +781,13 @@
 			goto out;
 		}
 		version = port_priv->version[mad->mad_hdr.class_version];
-		if (!version) {
-			printk(KERN_ERR PFX "MAD received on port %d for class "
-			       "version %d with no client\n",
-			       port_priv->port_num, mad->mad_hdr.class_version);
+		if (!version)
 			goto out;
-		}
 		class = version->method_table[convert_mgmt_class(
 						mad->mad_hdr.mgmt_class)];
 		if (class)
 			mad_agent = class->agent[mad->mad_hdr.method &
 						 ~IB_MGMT_METHOD_RESP];
-		else
-			printk(KERN_ERR PFX "MAD received on port %d for class "
-			       "%d with no client\n",
-			       port_priv->port_num, mad->mad_hdr.mgmt_class);
 	}
 
 out:
@@ -808,9 +800,7 @@
 			       "%p on port %d\n",
 			       &mad_agent->agent, port_priv->port_num);
 		}
-	} else
-		printk(KERN_NOTICE PFX "No matching mad agent found for "
-		       "received MAD on port %d\n", port_priv->port_num);
+	}
 
 	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 
@@ -934,6 +924,23 @@
 	}
 }
 
+extern int smi_handle_dr_smp_recv(struct ib_smp *smp,
+				  u8 node_type,
+				  int port_num,
+				  int phys_port_cnt);
+extern int smi_check_forward_dr_smp(struct ib_smp *smp);
+extern int smi_handle_dr_smp_send(struct ib_smp *smp,
+				  u8 node_type,
+				  int port_num);
+extern int smi_check_local_dr_smp(struct ib_smp *smp,
+				  struct ib_device *device,
+				  int port_num);
+extern int agent_send(struct ib_mad *mad,
+		      struct ib_grh *grh,
+		      struct ib_wc *wc,
+		      struct ib_device *device,
+		      int port_num);
+
 static void ib_mad_recv_done_handler(struct ib_mad_port_private
*port_priv,
 				     struct ib_wc *wc)
 {
@@ -942,6 +949,7 @@
 	struct ib_mad_private *recv;
 	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent;
+	struct ib_smp *smp;
 	int solicited;
 
 	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
@@ -968,14 +976,69 @@
 	if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num))
 		goto out;
 
-	/* Snoop MAD ? */
-	if (port_priv->device->snoop_mad)
-		if (port_priv->device->snoop_mad(port_priv->device,
-						 (u8)port_priv->port_num,
-						 wc->slid,
-						 recv->header.recv_buf.mad))
+	if (recv->header.recv_buf.mad->mad_hdr.mgmt_class ==
+	    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		smp = (struct ib_smp *)recv->header.recv_buf.mad;
+		if (!smi_handle_dr_smp_recv(smp,
+					    port_priv->device->node_type,
+					    port_priv->port_num,
+					    port_priv->phys_port_cnt))
 			goto out;
+		if (!smi_check_forward_dr_smp(smp))
+			goto out;
+		if (!smi_handle_dr_smp_send(smp,
+					    port_priv->device->node_type,
+					    port_priv->port_num))
+			goto out;
+		if (!smi_check_local_dr_smp(smp,
+					    port_priv->device,
+					    port_priv->port_num))
+			goto out;
+	}
 
+	/* Give driver "right of first refusal" on incoming MAD */
+	if (port_priv->device->process_mad) {
+		struct ib_mad *response;
+		struct ib_grh *grh;
+		int ret;
+
+		response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
+		if (!response) {
+			printk(KERN_ERR PFX "No memory for response MAD\n");
+			/* Is it better to assume that it wouldn't be processed ? */
+			goto out;
+		}
+
+		ret = port_priv->device->process_mad(port_priv->device, 0,
+						     port_priv->port_num,
+						     wc->slid,
+						     recv->header.recv_buf.mad,
+						     response);
+		if ((ret & IB_MAD_RESULT_SUCCESS) &&
+		    (ret & IB_MAD_RESULT_REPLY)) {
+			if (response->mad_hdr.mgmt_class ==
+			    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+				if (!smi_handle_dr_smp_recv(
+						(struct ib_smp *)response,
+						port_priv->device->node_type,
+						port_priv->port_num,
+						port_priv->phys_port_cnt)) {
+					kfree(response);
+					goto out;
+				}
+			}
+			/* Send response */
+			grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh);
+			if (agent_send(response, grh, wc,
+				       port_priv->device,
+				       port_priv->port_num)) {
+				kfree(response);
+				goto out;
+			}
+		} else
+			kfree(response);
+	}
+
 	/* Determine corresponding MAD agent for incoming receive MAD */
 	solicited = solicited_mad(recv->header.recv_buf.mad);
 	mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
@@ -1673,7 +1736,9 @@
  * Open the port
  * Create the QP, PD, MR, and CQ if needed
  */
-static int ib_mad_port_open(struct ib_device *device, int port_num)
+static int ib_mad_port_open(struct ib_device *device,
+			    int port_num,
+			    int num_ports)
 {
 	int ret, cq_size;
 	u64 iova = 0;
@@ -1702,6 +1767,7 @@
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->device = device;
 	port_priv->port_num = port_num;
+	port_priv->phys_port_cnt = num_ports;
 	spin_lock_init(&port_priv->reg_lock);
 
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
@@ -1836,7 +1902,7 @@
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret = ib_mad_port_open(device, cur_port);
+		ret = ib_mad_port_open(device, cur_port, num_ports);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, cur_port);
Index: openib-candidate/src/linux-kernel/infiniband/access/mad_priv.h
===================================================================
---
openib-candidate/src/linux-kernel/infiniband/access/mad_priv.h	(revision
1119)
+++
openib-candidate/src/linux-kernel/infiniband/access/mad_priv.h	(working
copy)
@@ -156,6 +156,7 @@
 	struct list_head port_list;
 	struct ib_device *device;
 	int port_num;
+	int phys_port_cnt;
 	struct ib_cq *cq;
 	struct ib_pd *pd;
 	struct ib_mr *mr;
Index: openib-candidate/src/linux-kernel/infiniband/include/ib_verbs.h
===================================================================
---
openib-candidate/src/linux-kernel/infiniband/include/ib_verbs.h	(revision 1108)
+++
openib-candidate/src/linux-kernel/infiniband/include/ib_verbs.h	(working
copy)
@@ -640,14 +640,10 @@
 enum ib_mad_result {
 	IB_MAD_RESULT_FAILURE  = 0,      /* (!SUCCESS is the important flag)
*/
 	IB_MAD_RESULT_SUCCESS  = 1 << 0, /* MAD was successfully processed  
*/
-	IB_MAD_RESULT_REPLY    = 1 << 1  /* Reply packet needs to be sent   
*/
+	IB_MAD_RESULT_REPLY    = 1 << 1, /* Reply packet needs to be sent   
*/
+	IB_MAD_RESULT_CONSUMED = 1 << 2  /* Packet consumed: stop processing
*/
 };
 
-enum ib_snoop_mad_result {
-	IB_SNOOP_MAD_IGNORED,
-	IB_SNOOP_MAD_CONSUMED
-};
-
 #define IB_DEVICE_NAME_MAX	64
 
 struct ib_device {
Index: roland-merge/src/linux-kernel/infiniband/include/ib_verbs.h
===================================================================
---
roland-merge/src/linux-kernel/infiniband/include/ib_verbs.h	(revision
1125)
+++ roland-merge/src/linux-kernel/infiniband/include/ib_verbs.h	(working
copy)
@@ -656,14 +656,10 @@
 enum ib_mad_result {
 	IB_MAD_RESULT_FAILURE  = 0,      /* (!SUCCESS is the important flag)
*/
 	IB_MAD_RESULT_SUCCESS  = 1 << 0, /* MAD was successfully processed  
*/
-	IB_MAD_RESULT_REPLY    = 1 << 1  /* Reply packet needs to be sent   
*/
+	IB_MAD_RESULT_REPLY    = 1 << 1, /* Reply packet needs to be sent   
*/
+	IB_MAD_RESULT_CONSUMED = 1 << 2  /* Packet consumed: stop processing
*/
 };
 
-enum ib_snoop_mad_result {
-	IB_SNOOP_MAD_IGNORED,
-	IB_SNOOP_MAD_CONSUMED
-};
-
 #define IB_DEVICE_NAME_MAX 64
 
 struct ib_device {
Index: roland-merge/src/linux-kernel/infiniband/core/agent.c
===================================================================
--- roland-merge/src/linux-kernel/infiniband/core/agent.c	(revision
1125)
+++ roland-merge/src/linux-kernel/infiniband/core/agent.c	(working copy)
@@ -29,7 +29,6 @@
 #include <asm/bug.h>
 
 
-
 static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED;
 static LIST_HEAD(ib_agent_port_list);
 
@@ -37,9 +36,9 @@
  * Fixup a directed route SMP for sending.  Return 0 if the SMP should
be
  * discarded.
  */
-static int smi_handle_dr_smp_send(struct ib_smp *smp,
-				  u8 node_type,
-				  int port_num)
+int smi_handle_dr_smp_send(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num)
 {
 	u8 hop_ptr, hop_cnt;
 
@@ -111,23 +110,6 @@
 }
 
 /*
- * Sender side handling of outgoing SMPs.  Fixup the SMP as required by
- * the spec.  Return 0 if the SMP should be dropped.
- */
-static int smi_handle_smp_send(struct ib_smp *smp,
-			       u8 node_type,
-			       int port_num)
-{
-	switch (smp->mgmt_class)
-	{
-	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
-		return smi_handle_dr_smp_send(smp, node_type, port_num);
-	default: 	/* LR SM class */
-		return 1;
-	}
-}
-
-/*
  * Return 1 if the SMP should be handled by the local SMA via
process_mad.
  */
 static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent,
@@ -145,10 +127,10 @@
  * Adjust information for a received SMP.  Return 0 if the SMP should
be
  * dropped.
  */
-static int smi_handle_dr_smp_recv(struct ib_smp *smp,
-				  u8 node_type,
-				  int port_num,
-				  int phys_port_cnt)
+int smi_handle_dr_smp_recv(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num,
+			   int phys_port_cnt)
 {
 	u8 hop_ptr, hop_cnt;
 
@@ -221,29 +203,10 @@
 }
 
 /*
- * Receive side handling SMPs.  Save receive information as required by
- * the spec.  Return 0 if the SMP should be dropped.
- */
-static int smi_handle_smp_recv(struct ib_smp *smp,
-			       u8 node_type,
-			       int port_num,
-			       int phys_port_cnt)
-{
-	switch (smp->mgmt_class)
-	{
-	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
-		return smi_handle_dr_smp_recv(smp, node_type,
-					      port_num, phys_port_cnt);
-	default:	/* LR SM class */
-		return 1;
-	}
-}
-
-/*
  * Return 1 if the received DR SMP should be forwarded to the send
queue.
  * Return 0 if the SMP should be completed up the stack.
  */
-static int smi_check_forward_dr_smp(struct ib_smp *smp)
+int smi_check_forward_dr_smp(struct ib_smp *smp)
 {
 	u8 hop_ptr, hop_cnt;
 
@@ -274,31 +237,6 @@
 	return 0;
 }
 
-/*
- * Return 1 if the received SMP should be forwarded to the send queue.
- * Return 0 if the SMP should be completed up the stack.
- */
-static int smi_check_forward_smp(struct ib_smp *smp)
-{
-	switch (smp->mgmt_class)
-	{
-	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
-		return smi_check_forward_dr_smp(smp);
-	default:	/* LR SM class */
-		return 1;
-	}
-}
-
-static int mad_process_local(struct ib_mad_agent *mad_agent,
-			     struct ib_mad *mad,
-			     struct ib_mad *mad_response,
-			     u16 slid)
-{
-	return mad_agent->device->process_mad(mad_agent->device, 0,
-					      mad_agent->port_num,
-					      slid, mad, mad_response);
-}
-
 static inline struct ib_agent_port_private *
 __ib_get_agent_mad(struct ib_device *device, int port_num,
 		   struct ib_mad_agent *mad_agent)
@@ -339,30 +277,47 @@
 	return entry;
 }
 
-void agent_mad_send(struct ib_mad_agent *mad_agent,
-		    struct ib_mad *mad,
-		    struct ib_grh *grh,
-		    struct ib_mad_recv_wc *mad_recv_wc)
+int smi_check_local_dr_smp(struct ib_smp *smp,
+			   struct ib_device *device,
+			   int port_num)
 {
 	struct ib_agent_port_private *port_priv;
+
+	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	if (!port_priv) {
+		printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not
open\n",
+		       device->name, port_num);
+		return 1;
+	}
+
+	return smi_check_local_smp(port_priv->dr_smp_agent, smp);
+}
+
+static int agent_mad_send(struct ib_mad_agent *mad_agent,
+			  struct ib_mad *mad,
+			  struct ib_grh *grh,
+			  struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_agent_port_private *port_priv;
 	struct ib_agent_send_wr *agent_send_wr;
 	struct ib_sge gather_list;
 	struct ib_send_wr send_wr;
 	struct ib_send_wr *bad_send_wr;
 	struct ib_ah_attr ah_attr;
 	unsigned long flags;
+	int ret = 1;
 
 	/* Find matching MAD agent */
 	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
 	if (!port_priv) {
 		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n",
 		       mad_agent);
-		return;
+		goto out;
 	}
 
 	agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL);
 	if (!agent_send_wr)
-		return;
+		goto out;
 	agent_send_wr->mad = mad;
 
 	/* PCI mapping */
@@ -406,8 +361,8 @@
 	agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr);
 	if (IS_ERR(agent_send_wr->ah)) {
 		printk(KERN_ERR SPFX "No memory for address handle\n");
-		kfree(mad);
-		return;
+		kfree(agent_send_wr);
+		goto out;
 	}
                                                                                 
 	send_wr.wr.ud.ah = agent_send_wr->ah;
@@ -432,120 +387,53 @@
 				 sizeof(struct ib_mad),
 				 PCI_DMA_TODEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
+		kfree(agent_send_wr);
 	} else {
 		list_add_tail(&agent_send_wr->send_list,
 			      &port_priv->send_posted_list);
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+		ret = 0;
 	}
+
+out:
+	return ret;
 }
 
-int smi_send_smp(struct ib_mad_agent *mad_agent,
-		 struct ib_smp *smp,
-		 struct ib_mad_recv_wc *mad_recv_wc,
-		 u16 slid,
-		 int phys_port_cnt)
+int agent_send(struct ib_mad *mad,
+	       struct ib_grh *grh,
+	       struct ib_wc *wc,
+	       struct ib_device *device,
+	       int port_num)
 {
-	struct ib_mad *smp_response;
-	int ret;
+	struct ib_agent_port_private *port_priv;
+	struct ib_mad_agent *mad_agent;
+	struct ib_mad_recv_wc mad_recv_wc;
 
-	if (!smi_handle_smp_send(smp, mad_agent->device->node_type,
-				 mad_agent->port_num)) {
-		/* SMI failed send */
-		return 0;
-	}
-
-	if (smi_check_local_smp(mad_agent, smp)) {
-		smp_response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
-		if (!smp_response)
-			return 0;
-
-		ret = mad_process_local(mad_agent, (struct ib_mad *)smp,
-					smp_response, slid);
-		if (ret & IB_MAD_RESULT_SUCCESS) {
-			if (!smi_handle_smp_recv((struct ib_smp *)smp_response,
-						 mad_agent->device->node_type,
-						 mad_agent->port_num,
-						 phys_port_cnt)) {
-				/* SMI failed receive */
-				kfree(smp_response);
-				return 0;
-			}
-			agent_mad_send(mad_agent, smp_response,
-				       NULL, mad_recv_wc);
-		} else
-			kfree(smp_response);
+	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	if (!port_priv) {
+		printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n",
+		       device->name, port_num);
 		return 1;
 	}
 
-	/* Post the send on the QP */
-	return 1;
-}
-
-int agent_mad_response(struct ib_mad_agent *mad_agent,
-		       struct ib_mad *mad,
-		       struct ib_mad_recv_wc *mad_recv_wc,
-		       u16 slid)
-{
-	struct ib_mad *response;
-	struct ib_grh *grh;
-	int ret;
-
-	response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
-	if (!response)
-		return 0;
-
-	ret = mad_process_local(mad_agent, mad, response, slid);
-	if (ret & IB_MAD_RESULT_SUCCESS) {
-		grh = (void *)mad - sizeof(struct ib_grh);
-		agent_mad_send(mad_agent, response, grh, mad_recv_wc);
-	} else
-		kfree(response);
-	return 1;
-}
-
-int agent_recv_mad(struct ib_mad_agent *mad_agent,
-		   struct ib_mad *mad,
-		   struct ib_mad_recv_wc *mad_recv_wc,
-		   int phys_port_cnt)
-{
-	int port_num;
-
-	/* SM Directed Route or LID Routed class */
-	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ||
-	    mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) {
-		if (mad_agent->device->node_type != IB_NODE_SWITCH)
-			port_num = mad_agent->port_num;
-		else
-			port_num = mad_recv_wc->wc->port_num;
-		if (!smi_handle_smp_recv((struct ib_smp *)mad,
-					 mad_agent->device->node_type,
-					 port_num, phys_port_cnt)) {
-			/* SMI failed receive */
-			return 0;
-		}
-
-		if (smi_check_forward_smp((struct ib_smp *)mad)) {
-			smi_send_smp(mad_agent,
-				    (struct ib_smp *)mad,
-				     mad_recv_wc,
-				     mad_recv_wc->wc->slid,
-				     phys_port_cnt);
-			return 0;
-		}
-
-	} else {
-		/* PerfMgmt class */
-		if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
-			agent_mad_response(mad_agent, mad, mad_recv_wc,
-					   mad_recv_wc->wc->slid);
-		} else {
-			printk(KERN_ERR "agent_recv_mad: Unexpected mgmt class 0x%x
received\n", mad->mad_hdr.mgmt_class);
-		}
-		return 0;
+	/* Get mad agent based on mgmt_class in MAD */
+	switch (mad->mad_hdr.mgmt_class) {
+		case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
+			mad_agent = port_priv->dr_smp_agent;
+			break;
+		case IB_MGMT_CLASS_SUBN_LID_ROUTED:
+			mad_agent = port_priv->lr_smp_agent;
+			break;
+		case IB_MGMT_CLASS_PERF_MGMT:
+			mad_agent = port_priv->perf_mgmt_agent;
+			break;
+		default:
+			return 1;
 	}
 
-	/* Complete receive up stack */
-	return 1;
+	/* Other fields don't matter so should change signature to just use wc
*/
+	mad_recv_wc.wc = wc;
+	return agent_mad_send(mad_agent, mad, grh, &mad_recv_wc);
 }
 
 static void agent_send_handler(struct ib_mad_agent *mad_agent,
@@ -596,26 +484,6 @@
 	kfree(agent_send_wr->mad);
 }
 
-static void agent_recv_handler(struct ib_mad_agent *mad_agent,
-			       struct ib_mad_recv_wc *mad_recv_wc)
-{
-	struct ib_agent_port_private *port_priv;
-
-	/* Find matching MAD agent */
-	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
-	if (!port_priv) {
-		printk(KERN_ERR SPFX "agent_recv_handler: no matching MAD agent
%p\n",
-		       mad_agent);
-	} else {
-		agent_recv_mad(mad_agent, 
-			       mad_recv_wc->recv_buf->mad,
-			       mad_recv_wc, port_priv->phys_port_cnt);
-	}
-
-	/* Free received MAD */
-	ib_free_recv_mad(mad_recv_wc);
-}
-
 int ib_agent_port_open(struct ib_device *device, int port_num,
 			      int phys_port_cnt)
 {
@@ -656,19 +524,12 @@
 	reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE;
 	reg_req.mgmt_class_version = 1;
 
-	/* SMA needs to receive Get, Set, and TrapRepress methods */
-	bitmap_zero((unsigned long *)&reg_req.method_mask,
IB_MGMT_MAX_METHODS);
-	set_bit(IB_MGMT_METHOD_GET, (unsigned long *)&reg_req.method_mask);
-	set_bit(IB_MGMT_METHOD_SET, (unsigned long *)&reg_req.method_mask);
-	set_bit(IB_MGMT_METHOD_TRAP_REPRESS, 
-		(unsigned long *)&reg_req.method_mask);
-
 	port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num,
 							IB_QPT_SMI,
-						       &reg_req, 0,
+							NULL, 0,
 						       &agent_send_handler,
-						       &agent_recv_handler,
-							NULL);
+							NULL, NULL);
+
 	if (IS_ERR(port_priv->dr_smp_agent)) {
 		ret = PTR_ERR(port_priv->dr_smp_agent);
 		goto error2;
@@ -678,10 +539,9 @@
 	reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
 	port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num,
 							IB_QPT_SMI,
-						       &reg_req, 0,
+							NULL, 0,
 						       &agent_send_handler,
-						       &agent_recv_handler,
-							NULL);
+							NULL, NULL);
 	if (IS_ERR(port_priv->lr_smp_agent)) {
 		ret = PTR_ERR(port_priv->lr_smp_agent);
 		goto error3;
@@ -689,14 +549,11 @@
 
         /* Obtain MAD agent for PerfMgmt class */
         reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
-	clear_bit(IB_MGMT_METHOD_TRAP_REPRESS,
-		  (unsigned long *)&reg_req.method_mask);
-        port_priv->perf_mgmt_agent = ib_register_mad_agent(device,
port_num,
+	port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num,
 							   IB_QPT_GSI,
-						  	  &reg_req, 0,
+							   NULL, 0,
 							  &agent_send_handler,
-							  &agent_recv_handler,
-							   NULL);
+							   NULL, NULL);
 	if (IS_ERR(port_priv->perf_mgmt_agent)) {
 		ret = PTR_ERR(port_priv->perf_mgmt_agent);
 		goto error4;
Index: roland-merge/src/linux-kernel/infiniband/core/mad.c
===================================================================
--- roland-merge/src/linux-kernel/infiniband/core/mad.c	(revision 1125)
+++ roland-merge/src/linux-kernel/infiniband/core/mad.c	(working copy)
@@ -747,13 +747,16 @@
 	       struct ib_mad *mad,
 	       int solicited)
 {
-	struct ib_mad_agent_private *entry, *mad_agent = NULL;
-	struct ib_mad_mgmt_class_table *version;
-	struct ib_mad_mgmt_method_table *class;
-	u32 hi_tid;
+	struct ib_mad_agent_private *mad_agent = NULL;
+	unsigned long flags;
 
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
+
 	/* Whether MAD was solicited determines type of routing to MAD client
*/
 	if (solicited) {
+		u32 hi_tid;
+		struct ib_mad_agent_private *entry;
+
 		/* Routing is based on high 32 bits of transaction ID of MAD  */
 		hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32;
 		list_for_each_entry(entry, &port_priv->agent_list, agent_list) {
@@ -762,12 +765,14 @@
 				break;
 			}
 		}
-		if (!mad_agent) {
+		if (!mad_agent)
 			printk(KERN_ERR PFX "No client 0x%x for received MAD "
-			       "on port %d\n", hi_tid, port_priv->port_num);
-			goto out;
-		}
+			       "on port %d\n",
+			       hi_tid, port_priv->port_num);
 	} else {
+		struct ib_mad_mgmt_class_table *version;
+		struct ib_mad_mgmt_method_table *class;
+
 		/* Routing is based on version, class, and method */
 		if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) {
 			printk(KERN_ERR PFX "MAD received with unsupported "
@@ -776,32 +781,29 @@
 			goto out;
 		}
 		version = port_priv->version[mad->mad_hdr.class_version];
-		if (!version) {
-			printk(KERN_ERR PFX "MAD received on port %d for class "
-			       "version %d with no client\n",
-			       port_priv->port_num, mad->mad_hdr.class_version);
+		if (!version)
 			goto out;
-		}
 		class = version->method_table[convert_mgmt_class(
 						mad->mad_hdr.mgmt_class)];
-		if (!class) {
-			printk(KERN_ERR PFX "MAD received on port %d for class "
-			       "%d with no client\n",
-			       port_priv->port_num, mad->mad_hdr.mgmt_class);
-			goto out;
-		}
-		mad_agent = class->agent[mad->mad_hdr.method &
-					 ~IB_MGMT_METHOD_RESP];		
+		if (class)
+			mad_agent = class->agent[mad->mad_hdr.method &
+						 ~IB_MGMT_METHOD_RESP];
 	}
 
 out:
-	if (mad_agent && !mad_agent->agent.recv_handler) {
-		printk(KERN_ERR PFX "No receive handler for client "
-		       "%p on port %d\n",
-		       &mad_agent->agent, port_priv->port_num);
-		mad_agent = NULL;
+	if (mad_agent) {
+		if (mad_agent->agent.recv_handler)
+			atomic_inc(&mad_agent->refcount);
+		else {
+			mad_agent = NULL;
+			printk(KERN_ERR PFX "No receive handler for client "
+			       "%p on port %d\n",
+			       &mad_agent->agent, port_priv->port_num);
+		}
 	}
 
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+
 	return mad_agent;
 }
 
@@ -922,6 +924,23 @@
 	}
 }
 
+extern int smi_handle_dr_smp_recv(struct ib_smp *smp,
+				  u8 node_type,
+				  int port_num,
+				  int phys_port_cnt);
+extern int smi_check_forward_dr_smp(struct ib_smp *smp);
+extern int smi_handle_dr_smp_send(struct ib_smp *smp,
+				  u8 node_type,
+				  int port_num);
+extern int smi_check_local_dr_smp(struct ib_smp *smp,
+				  struct ib_device *device,
+				  int port_num);
+extern int agent_send(struct ib_mad *mad,
+		      struct ib_grh *grh,
+		      struct ib_wc *wc,
+		      struct ib_device *device,
+		      int port_num);
+
 static void ib_mad_recv_done_handler(struct ib_mad_port_private
*port_priv,
 				     struct ib_wc *wc)
 {
@@ -929,9 +948,9 @@
 	struct ib_mad_private_header *mad_priv_hdr;
 	struct ib_mad_private *recv;
 	struct ib_mad_list_head *mad_list;
-	struct ib_mad_agent_private *mad_agent = NULL;
+	struct ib_mad_agent_private *mad_agent;
+	struct ib_smp *smp;
 	int solicited;
-	unsigned long flags;
 
 	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
 	qp_info = mad_list->mad_queue->qp_info;
@@ -957,31 +976,80 @@
 	if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num))
 		goto out;
 
-	/* Snoop MAD ? */
-	if (port_priv->device->snoop_mad)
-		if (port_priv->device->snoop_mad(port_priv->device,
-						 (u8)port_priv->port_num,
-						 wc->slid,
-						 recv->header.recv_buf.mad))
+	if (recv->header.recv_buf.mad->mad_hdr.mgmt_class ==
+	    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		smp = (struct ib_smp *)recv->header.recv_buf.mad;
+		if (!smi_handle_dr_smp_recv(smp,
+					    port_priv->device->node_type,
+					    port_priv->port_num,
+					    port_priv->phys_port_cnt))
 			goto out;
+		if (!smi_check_forward_dr_smp(smp))
+			goto out;
+		if (!smi_handle_dr_smp_send(smp,
+					    port_priv->device->node_type,
+					    port_priv->port_num))
+			goto out;
+		if (!smi_check_local_dr_smp(smp,
+					    port_priv->device,
+					    port_priv->port_num))
+			goto out;
+	}
 
-	spin_lock_irqsave(&port_priv->reg_lock, flags);
+	/* Give driver "right of first refusal" on incoming MAD */
+	if (port_priv->device->process_mad) {
+		struct ib_mad *response;
+		struct ib_grh *grh;
+		int ret;
+
+		response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
+		if (!response) {
+			printk(KERN_ERR PFX "No memory for response MAD\n");
+			/* Is it better to assume that it wouldn't be processed ? */
+			goto out;
+		}
+
+		ret = port_priv->device->process_mad(port_priv->device, 0,
+						     port_priv->port_num,
+						     wc->slid,
+						     recv->header.recv_buf.mad,
+						     response);
+		if ((ret & IB_MAD_RESULT_SUCCESS) &&
+		    (ret & IB_MAD_RESULT_REPLY)) {
+			if (response->mad_hdr.mgmt_class ==
+			    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+				if (!smi_handle_dr_smp_recv(
+						(struct ib_smp *)response,
+						port_priv->device->node_type,
+						port_priv->port_num,
+						port_priv->phys_port_cnt)) {
+					kfree(response);
+					goto out;
+				}
+			}
+			/* Send response */
+			grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh);
+			if (agent_send(response, grh, wc,
+				       port_priv->device,
+				       port_priv->port_num)) {
+				kfree(response);
+				goto out;
+			}
+		} else
+			kfree(response);
+	}
+
 	/* Determine corresponding MAD agent for incoming receive MAD */
 	solicited = solicited_mad(recv->header.recv_buf.mad);
 	mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
 				   solicited);
-	if (!mad_agent) {
-		spin_unlock_irqrestore(&port_priv->reg_lock, flags);
-		printk(KERN_NOTICE PFX "No matching mad agent found for "
-		       "received MAD on port %d\n", port_priv->port_num);
-	} else {
-		atomic_inc(&mad_agent->refcount);
-		spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+	if (mad_agent) {
 		ib_mad_complete_recv(mad_agent, recv, solicited);
+		recv = NULL;	/* recv is freed up via ib_mad_complete_recv */
 	}
 
 out:
-	if (!mad_agent) {
+	if (recv) {
 		/* Should this case be optimized ? */
 		kmem_cache_free(ib_mad_cache, recv);
 	}
@@ -1668,7 +1736,9 @@
  * Open the port
  * Create the QP, PD, MR, and CQ if needed
  */
-static int ib_mad_port_open(struct ib_device *device, int port_num)
+static int ib_mad_port_open(struct ib_device *device,
+			    int port_num,
+			    int num_ports)
 {
 	int ret, cq_size;
 	u64 iova = 0;
@@ -1697,6 +1767,7 @@
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->device = device;
 	port_priv->port_num = port_num;
+	port_priv->phys_port_cnt = num_ports;
 	spin_lock_init(&port_priv->reg_lock);
 
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
@@ -1831,7 +1902,7 @@
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret = ib_mad_port_open(device, cur_port);
+		ret = ib_mad_port_open(device, cur_port, num_ports);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, cur_port);
Index: roland-merge/src/linux-kernel/infiniband/core/mad_priv.h
===================================================================
--- roland-merge/src/linux-kernel/infiniband/core/mad_priv.h	(revision
1125)
+++ roland-merge/src/linux-kernel/infiniband/core/mad_priv.h	(working
copy)
@@ -156,6 +156,7 @@
 	struct list_head port_list;
 	struct ib_device *device;
 	int port_num;
+	int phys_port_cnt;
 	struct ib_cq *cq;
 	struct ib_pd *pd;
 	struct ib_mr *mr;
Index: roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_dev.h
===================================================================
---
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_dev.h	(revision
1125)
+++
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_dev.h	(working
copy)
@@ -349,10 +349,6 @@
 		      u16 slid,
 		      struct ib_mad *in_mad,
 		      struct ib_mad *out_mad);
-enum ib_snoop_mad_result mthca_snoop_mad(struct ib_device *ibdev,
-					 u8 port_num,
-					 u16 slid,
-					 struct ib_mad *mad);
 
 static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
 {
Index:
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_provider.c
===================================================================
---
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_provider.c	(revision 1125)
+++
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_provider.c	(working copy)
@@ -600,7 +600,6 @@
 	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
 	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
 	dev->ib_dev.process_mad          = mthca_process_mad;
-	dev->ib_dev.snoop_mad            = mthca_snoop_mad;
 
 	ret = ib_register_device(&dev->ib_dev);
 	if (ret)
Index: roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c
===================================================================
---
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c	(revision
1125)
+++
roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c	(working
copy)
@@ -79,6 +79,16 @@
 	int err;
 	u8 status;
 
+	/* Forward locally generated traps to the SM */
+	if (in_mad->mad_hdr.mgmt_class	== IB_MGMT_CLASS_SUBN_LID_ROUTED &&
+	    in_mad->mad_hdr.method	== IB_MGMT_METHOD_TRAP		 &&
+	    slid			== 0) {
+
+		/* XXX: forward locally generated MAD to SM */
+
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+	}
+
 	/*
 	 * Only handle SM gets, sets and trap represses for SM class
 	 *
@@ -137,21 +147,6 @@
 	return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
 }
 
-enum ib_snoop_mad_result mthca_snoop_mad(struct ib_device *ibdev,
-					 u8 port_num,
-					 u16 slid,
-					 struct ib_mad *mad)
-{
-	if (mad->mad_hdr.method     != IB_MGMT_METHOD_TRAP               ||
-	    mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ||
-	    slid                    != 0)
-		return IB_SNOOP_MAD_IGNORED;
-
-	/* XXX: forward locally generated MAD to SM */
-
-	return IB_SNOOP_MAD_CONSUMED;
-}
-
 /*
  * Local Variables:
  * c-file-style: "linux"


From halr at voltaire.com  Wed Nov  3 13:59:21 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 16:59:21 -0500
Subject: [openib-general] [PATCH] Cleanup spaces to tabs
In-Reply-To: <Pine.LNX.4.44.0411031043560.4399-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411031043560.4399-100000@DYN318430BLD>
Message-ID: <1099519161.2837.8.camel@hpc-1>

On Wed, 2004-11-03 at 13:45, Krishna Kumar wrote:
> Entire openib cleaned up to remove 8 spaces to replace with
> tabs, just two files though :-)

Any chance I could get you to regenerate this patch with the latest code
? I just made a major change to both mad.c and agent.c so this doesn't
apply too easily and I'm not sure I could manually fix it right now.

Thanks in advance.

-- Hal


From krkumar at us.ibm.com  Wed Nov  3 13:49:17 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 13:49:17 -0800 (PST)
Subject: [openib-general] [PATCH] Cleanup spaces to tabs
In-Reply-To: <1099518873.2837.5.camel@hpc-1>
Message-ID: <Pine.LNX.4.44.0411031348310.7943-100000@DYN318430BLD>

Hal,

Sure, I will regenerate this patch and send in about an hour's time.

- KK


From halr at voltaire.com  Wed Nov  3 14:02:09 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 17:02:09 -0500
Subject: [openib-general] [PATCH] fix memory leak and return	value
	associated with agent_mad_send(response)
In-Reply-To: <200411031056.29522.mashirle@us.ibm.com>
References: <200411031056.29522.mashirle@us.ibm.com>
Message-ID: <1099519329.2837.12.camel@hpc-1>

On Wed, 2004-11-03 at 13:56, Shirley Ma wrote:
> Here is the patch. Please review it.

As I just made a major change to both mad.c and agent.c which changed
how this works. Could I prevail on you to review the latest and provide
a patch to that ?

Thanks.

-- Hal


From mshefty at ichips.intel.com  Wed Nov  3 14:33:58 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 03 Nov 2004 14:33:58 -0800
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
Message-ID: <41895CD6.4020807@ichips.intel.com>

Krishna Kumar wrote:

> Hi,
> 
> I am not sure if this is a good idea, but since I am new to this area,
> here it goes :-)

I think that the idea is valid.  :)

> Section 11.2.6.3,  C11-16 states that resize of qp must be permitted.
> In the patch I am submitting, I don't understand why so many parameters
> are expected by driver/verbs. I thought the qp_handle and ib_qp_attr is
> enough, atleast according to the spec.

I didn't follow what you were trying to reference here.  Are you 
referring to the QP or CQ?

> Along with this, I am going to submit another patch to catch "catastrophic"
> errors in return value of the resize operation. This is due to the need
> to check for 2 special cases : "CQ overrun" and "CQ inaccessible". For
> these two errors, I think the queues should be deallocated and error
> returned. This is in the second patch. I am not sure of the error numbers,
> I guessed it from mthca_eq.c and could be wrong here.

I'm adding in code to handle QP errors and overrun.  If we are unable to 
resize the CQ, we can prevent CQ overrun by limited the number of work 
requests posted to the corresponding QPs, rather than completely 
disabling the port.  I'll have a better idea of what we can do in this 
case when I get more of the code in place.


From mshefty at ichips.intel.com  Wed Nov  3 14:39:03 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 03 Nov 2004 14:39:03 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <524qk6r9k1.fsf@topspin.com>
References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com>
Message-ID: <41895E07.6080804@ichips.intel.com>

Roland Dreier wrote:
> Do the names /dev/infiniband/mthca0/umad1 and so on make sense to
> people?  I thought that userspace verbs support would probably use a
> file like /dev/infiniband/mthca0/verbs, etc.

I think that this approach is good.

- Sean


From johannes at erdfelt.com  Wed Nov  3 14:43:37 2004
From: johannes at erdfelt.com (Johannes Erdfelt)
Date: Wed, 3 Nov 2004 14:43:37 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <524qk6r9k1.fsf@topspin.com>
References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com>
Message-ID: <20041103224337.GS17669@sventech.com>

On Wed, Nov 03, 2004, Roland Dreier <roland at topspin.com> wrote:
> By the way, buried down at the end of the patch is some documentation
> about creating device files:
> 
> +/dev files
> +
> +  To create the appropriate character device files automatically with
> +  udev, a rule like
> +
> +    KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}"
> +
> +  can be used.  This will create nodes such as /dev/infiniband/mthca0/umad1
> +  for port 1 of device mthca0.
> 
> Do the names /dev/infiniband/mthca0/umad1 and so on make sense to
> people?  I thought that userspace verbs support would probably use a
> file like /dev/infiniband/mthca0/verbs, etc.
> 
> In any case, now is probably the time to object before we have legacy
> issues to worry about....

Does the device name need to have the HCA driver name in it? Also, the u
in umad is implied.

Wouldn't it be more appropriate to do something like this:

/dev/infiniband/hca0/mad1

or maybe even:

/dev/ib/hca0/mad1

JE


From xma at us.ibm.com  Wed Nov  3 15:00:13 2004
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 3 Nov 2004 15:00:13 -0800
Subject: [openib-general] [PATCH] fix memory leak and
	return	value	associated with agent_mad_send(response)
In-Reply-To: <1099519329.2837.12.camel@hpc-1>
Message-ID: <OFA70CEB0C.0127ABA5-ON87256F41.007E4D5F-88256F41.007E5D41@us.ibm.com>

> As I just made a major change to both mad.c and agent.c which changed
how this works. Could I prevail on you to review the latest and provide
a patch to that ?

I will take a look at the most recent bit.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041103/ecdd7356/attachment.html>

From krkumar at us.ibm.com  Wed Nov  3 15:52:16 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 15:52:16 -0800 (PST)
Subject: [openib-general] [PATCH] Cleanup spaces to tabs
In-Reply-To: <1099519161.2837.8.camel@hpc-1>
Message-ID: <Pine.LNX.4.44.0411031551350.8254-100000@DYN318430BLD>

Hi Hal,

The same patch on latest bits ....

thx,

- KK

diff -ruNp 1/agent.c 2/agent.c
--- 1/agent.c	2004-11-03 15:50:04.000000000 -0800
+++ 2/agent.c	2004-11-03 15:50:47.000000000 -0800
@@ -189,7 +189,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 		if (hop_ptr == 1) {
 			if (smp->dr_slid == IB_LID_PERMISSIVE) {
 				/* giving SMP to SM - update hop_ptr */
-                                smp->hop_ptr--;
+				smp->hop_ptr--;
 				return 1;
 			}
 			/* smp->hop_ptr updated when sending */
@@ -327,7 +327,7 @@ static int agent_mad_send(struct ib_mad_
 					  PCI_DMA_TODEVICE);
 	gather_list.length = sizeof(struct ib_mad);
 	gather_list.lkey = (*port_priv->mr).lkey;
-
+
 	send_wr.next = NULL;
 	send_wr.opcode = IB_WR_SEND;
 	send_wr.sg_list = &gather_list;
@@ -335,7 +335,7 @@ static int agent_mad_send(struct ib_mad_
 	send_wr.wr.ud.remote_qpn = mad_recv_wc->wc->src_qp; /* DQPN */
 	send_wr.wr.ud.timeout_ms = 0;
 	send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
-
+
 	ah_attr.dlid = mad_recv_wc->wc->slid;
 	ah_attr.port_num = mad_agent->port_num;
 	ah_attr.src_path_bits = mad_recv_wc->wc->dlid_path_bits;
@@ -364,7 +364,7 @@ static int agent_mad_send(struct ib_mad_
 		kfree(agent_send_wr);
 		goto out;
 	}
-
+
 	send_wr.wr.ud.ah = agent_send_wr->ah;
 	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		send_wr.wr.ud.pkey_index = mad_recv_wc->wc->pkey_index;
@@ -441,8 +441,8 @@ static void agent_send_handler(struct ib
 {
 	struct ib_agent_port_private	*port_priv;
 	struct ib_agent_send_wr		*agent_send_wr;
-	struct list_head                *send_wr;
-	unsigned long                   flags;
+	struct list_head		*send_wr;
+	unsigned long			flags;

 	/* Find matching MAD agent */
 	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
@@ -460,7 +460,7 @@ static void agent_send_handler(struct ib
 		       "is empty\n", (unsigned long long) mad_send_wc->wr_id);
 		return;
 	}
-
+
 	agent_send_wr = list_entry(&port_priv->send_posted_list,
 				    struct ib_agent_send_wr,
 				    send_list);
@@ -469,8 +469,8 @@ static void agent_send_handler(struct ib
 				     send_list);

 	/* Remove from posted send MAD list */
-        list_del(&agent_send_wr->send_list);
-        spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+	list_del(&agent_send_wr->send_list);
+	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);

 	/* Unmap PCI */
 	pci_unmap_single(mad_agent->device->dma_device,
@@ -547,8 +547,8 @@ int ib_agent_port_open(struct ib_device
 		goto error3;
 	}

-        /* Obtain MAD agent for PerfMgmt class */
-        reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
+	/* Obtain MAD agent for PerfMgmt class */
+	reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
 	port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num,
 							   IB_QPT_GSI,
 							   NULL, 0,
@@ -606,7 +606,7 @@ int ib_agent_port_close(struct ib_device
 	ib_unregister_mad_agent(port_priv->perf_mgmt_agent);
 	ib_unregister_mad_agent(port_priv->lr_smp_agent);
 	ib_unregister_mad_agent(port_priv->dr_smp_agent);
-        kfree(port_priv);
+	kfree(port_priv);

 	return 0;
 }
diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-03 15:50:04.000000000 -0800
+++ 2/mad.c	2004-11-03 15:50:50.000000000 -0800
@@ -1536,7 +1536,7 @@ static inline int ib_mad_change_qp_state
 	struct ib_qp_attr *attr;
 	int attr_mask;

-        attr =  kmalloc(sizeof *attr, GFP_KERNEL);
+	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr) {
 		printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n");
 		return -ENOMEM;

On Wed, 3 Nov 2004, Hal Rosenstock wrote:

> On Wed, 2004-11-03 at 13:45, Krishna Kumar wrote:
> > Entire openib cleaned up to remove 8 spaces to replace with
> > tabs, just two files though :-)
>
> Any chance I could get you to regenerate this patch with the latest code
> ? I just made a major change to both mad.c and agent.c so this doesn't
> apply too easily and I'm not sure I could manually fix it right now.
>
> Thanks in advance.
>
> -- Hal
>
>
>


From mshefty at ichips.intel.com  Wed Nov  3 16:13:27 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 03 Nov 2004 16:13:27 -0800
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
Message-ID: <41897427.4060604@ichips.intel.com>

Krishna Kumar wrote:

>  	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
> -	if (IS_ERR(qp_info->qp)) {
> -		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
> -		       get_spl_qp_index(qp_type));
> +	if (!IS_ERR(qp_info->qp)) {
> +		struct ib_qp_attr	qp_attr;
> +
> +		ret = ib_query_qp(qp_info->qp, &qp_attr, 0, &qp_init_attr);

Note that the qp_init_attr parameter passed into ib_create_qp should 
return the actual size of the QP that was created.  The call to 
ib_query_qp shouldn't be needed.

- Sean


From krkumar at us.ibm.com  Wed Nov  3 16:27:59 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 16:27:59 -0800 (PST)
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <41895CD6.4020807@ichips.intel.com>
Message-ID: <Pine.LNX.4.44.0411031540140.8200-100000@DYN318430BLD>

On Wed, 3 Nov 2004, Sean Hefty wrote:

> I didn't follow what you were trying to reference here.  Are you
> referring to the QP or CQ?

QP. When I do a query for the QP, all I really need is the qp ptr and
the qp_attr structure to fill in values. What I didn't figure out is
why an attr_mask and ib_qp_init_attr is needed. BTW, I had thought that
ib_qp_init_attr was used for initialization type of attributes, exactly
once the device is passed init attributes, then onwards ib_qp_attr should
be used. So ib_qp_init_attr seems redundant. Or I have understood the
code wrong.

> I'm adding in code to handle QP errors and overrun.  If we are unable to
> resize the CQ, we can prevent CQ overrun by limited the number of work
> requests posted to the corresponding QPs, rather than completely

Actually I read it wrong in this case, probably the code needs to check
only for "inaccessible" which is a critical error since the CEQ cannot be
posted to the CQ even though the CQ is not full.

If you are not already adding the exact same functionality, please let me
know if the following looks correct. I recreated both patches after Hal's
checkin (Patch1 and Patch2 below).

Also, I saw your other mail, and I had looked at the driver and it
didn't modify the final size of the new QP in the init_attr. It used the
structure to do it's work but doesn't update it. I was initially planning
on not using query() and instead rely on this structure getting updated.
The verb interface cannot do it since it qp doesn't contain the size. We
cannot change the driver to change the init structure since potentially
other drivers may not do it, so the reason to do a query to figure the
correct size.

verb create_qp():
if (!IS_ERR(qp)) {
                qp->device      = pd->device;
                qp->pd          = pd;
                qp->send_cq     = qp_init_attr->send_cq;
                qp->recv_cq     = qp_init_attr->recv_cq;
                qp->srq         = qp_init_attr->srq;
                qp->qp_context  = qp_init_attr->qp_context;

                atomic_inc(&pd->usecnt);
                atomic_inc(&qp_init_attr->send_cq->usecnt);
                atomic_inc(&qp_init_attr->recv_cq->usecnt);
                if (qp_init_attr->srq)
                        atomic_inc(&qp_init_attr->srq->usecnt);
        }

driver create_qp():
case IB_QPT_SMI:
        case IB_QPT_GSI:
        {
                qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
                if (!qp)
                        return ERR_PTR(-ENOMEM);

                qp->sq.max    = init_attr->cap.max_send_wr;
                qp->rq.max    = init_attr->cap.max_recv_wr;
                qp->sq.max_gs = init_attr->cap.max_send_sge;
                qp->rq.max_gs = init_attr->cap.max_recv_sge;

                qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0:1;

                err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
                                      to_mcq(init_attr->send_cq),
                                      to_mcq(init_attr->recv_cq),
                                      init_attr->sq_sig_type,
				      init_attr->rq_sig_type,
                                      qp->ibqp.qp_num, init_attr->port_num,
                                      to_msqp(qp));
                break;
	}


thanks,

- KK

--------------------------------------------------------------------------
                                 PATCH1
--------------------------------------------------------------------------

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-03 16:03:25.000000000 -0800
+++ 2/mad.c	2004-11-03 16:03:43.000000000 -0800
@@ -1692,6 +1692,14 @@ static void init_mad_queue(struct ib_mad
 	INIT_LIST_HEAD(&mad_queue->list);
 }

+/*
+ * Allocate one mad QP.
+ *
+ * If the return indicates success, the value returned is the new size
+ * of the queue pair that got created.
+ *
+ * Return > 0 on success and -(ERRNO) on failure. Zero should never happen.
+ */
 static int create_mad_qp(struct ib_mad_port_private *port_priv,
 			 struct ib_mad_qp_info *qp_info,
 			 enum ib_qp_type qp_type)
@@ -1715,15 +1723,23 @@ static int create_mad_qp(struct ib_mad_p
 	qp_init_attr.qp_type = qp_type;
 	qp_init_attr.port_num = port_priv->port_num;
 	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
-	if (IS_ERR(qp_info->qp)) {
-		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
-		       get_spl_qp_index(qp_type));
+	if (!IS_ERR(qp_info->qp)) {
+		struct ib_qp_attr	qp_attr;
+
+		ret = ib_query_qp(qp_info->qp, &qp_attr, 0, &qp_init_attr);
+		if (ret < 0) {
+			/*
+			 * For any error, use the same size we used to
+			 * create the queue.
+			 */
+			ret = qp_init_attr.cap.max_send_wr +
+					qp_init_attr.cap.max_recv_wr;
+		}
+	} else {
 		ret = PTR_ERR(qp_info->qp);
-		goto error;
+		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d err:%d\n",
+		       get_spl_qp_index(qp_type), ret);
 	}
-	return 0;
-
-error:
 	return ret;
 }

@@ -1747,6 +1763,7 @@ static int ib_mad_port_open(struct ib_de
 		.size = (unsigned long) high_memory - PAGE_OFFSET
 	};
 	struct ib_mad_port_private *port_priv;
+	int total_qp_size;
 	unsigned long flags;

 	/* First, check if port already open at MAD layer */
@@ -1797,11 +1814,25 @@ static int ib_mad_port_open(struct ib_de
 	}

 	ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI);
-	if (ret)
+	if (ret <= 0)
 		goto error6;
+	total_qp_size = ret;
+
 	ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI);
-	if (ret)
+	if (ret <= 0)
 		goto error7;
+	total_qp_size += ret;
+
+	/* Resize if the total QP[0,1] size is greater than CQ size. */
+	if (total_qp_size > cq_size) {
+		printk(KERN_DEBUG PFX "ib_mad_port_open: increasing size of "
+		       "CQ from %d to %d\n", cq_size, total_qp_size);
+		if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) {
+			printk(KERN_DEBUG PFX "Couldn't increase CQ size - "
+			       "err:%d\n", ret);
+			/* continue, not an error */
+		}
+	}

 	spin_lock_init(&port_priv->reg_lock);
 	INIT_LIST_HEAD(&port_priv->agent_list);

----------------------------------------------------------------------------
                                     PATCH2
----------------------------------------------------------------------------

diff -ruNp 2/mad.c 3/mad.c
--- 2/mad.c	2004-11-03 16:03:43.000000000 -0800
+++ 3/mad.c	2004-11-03 16:17:54.000000000 -0800
@@ -1749,6 +1749,21 @@ static void destroy_mad_qp(struct ib_mad
 }

 /*
+ * Overrun and Inaccessible errors cannot be handled by QP resize operation.
+ */
+static inline int is_catastrophic_error(int err)
+{
+#define	CQ_ACCESS_ERROR		0x11
+
+	switch (err) {
+	default:	/* OK */
+		return 0;
+	case CQ_ACCESS_ERROR:
+		return 1;
+	}
+}
+
+/*
  * Open the port
  * Create the QP, PD, MR, and CQ if needed
  */
@@ -1830,6 +1845,10 @@ static int ib_mad_port_open(struct ib_de
 		if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) {
 			printk(KERN_DEBUG PFX "Couldn't increase CQ size - "
 			       "err:%d\n", ret);
+			if (is_catastrophic_error(ret)) {
+				/* Clean up qp_info[0,1] */
+				goto error8;
+			}
 			/* continue, not an error */
 		}
 	}


From mshefty at ichips.intel.com  Wed Nov  3 16:54:46 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 03 Nov 2004 16:54:46 -0800
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411031540140.8200-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411031540140.8200-100000@DYN318430BLD>
Message-ID: <41897DD6.8060100@ichips.intel.com>

Krishna Kumar wrote:
> QP. When I do a query for the QP, all I really need is the qp ptr and
> the qp_attr structure to fill in values. What I didn't figure out is
> why an attr_mask and ib_qp_init_attr is needed. BTW, I had thought that
> ib_qp_init_attr was used for initialization type of attributes, exactly
> once the device is passed init attributes, then onwards ib_qp_attr should
> be used. So ib_qp_init_attr seems redundant. Or I have understood the
> code wrong.

The mask allows the query to be a little more selective about what data 
it is trying to access, which can potentially avoid accessing the hardware.

The qp_attr and qp_init_attr contain different data, so both are 
returned from the query call.  To have ib_query_qp return only qp_attr, 
we would need to add the fields from qp_init_attr to it.

> If you are not already adding the exact same functionality, please let me
> know if the following looks correct. I recreated both patches after Hal's
> checkin (Patch1 and Patch2 below).

I am not adding this same functionality, and I'm coding around where 
your patch would go.

> Also, I saw your other mail, and I had looked at the driver and it
> didn't modify the final size of the new QP in the init_attr. It used the
> structure to do it's work but doesn't update it. I was initially planning
> on not using query() and instead rely on this structure getting updated.
> The verb interface cannot do it since it qp doesn't contain the size. We
> cannot change the driver to change the init structure since potentially
> other drivers may not do it, so the reason to do a query to figure the
> correct size.

The original call to ib_create_qp took a third parameter, a qp_cap 
structure, for output.  This structure contained the actual QP settings 
returned from the ib_create_qp call.  I assumed that by removing this 
parameter, the capabilities would be returned directly in the 
qp_init_attr structure.  If this is not the case, then the driver should 
probably change to do that.   This matches what is defined by verbs, so 
I think that it's safe to do it.

>                 qp->sq.max    = init_attr->cap.max_send_wr;
>                 qp->rq.max    = init_attr->cap.max_recv_wr;
>                 qp->sq.max_gs = init_attr->cap.max_send_sge;
>                 qp->rq.max_gs = init_attr->cap.max_recv_sge;
> 
>                 err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),

I haven't looked at the mthca_alloc_sqp call in more detail, but if it 
doesn't create a QP larger than that specified, then it wouldn't need to 
change the qp_cap fields.


From krkumar at us.ibm.com  Wed Nov  3 17:33:36 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 17:33:36 -0800 (PST)
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <41897DD6.8060100@ichips.intel.com>
Message-ID: <Pine.LNX.4.44.0411031731100.9159-100000@DYN318430BLD>

Hi Sean,

I just checked the spec for create_qp on pg567 and found that it expects
the verb/driver to modify this field, as you indicated. So I will go ahead
and submit a new patch to fix this.

Thanks for your input,

- KK

On Wed, 3 Nov 2004, Sean Hefty wrote:

> Krishna Kumar wrote:
> > QP. When I do a query for the QP, all I really need is the qp ptr and
> > the qp_attr structure to fill in values. What I didn't figure out is
> > why an attr_mask and ib_qp_init_attr is needed. BTW, I had thought that
> > ib_qp_init_attr was used for initialization type of attributes, exactly
> > once the device is passed init attributes, then onwards ib_qp_attr should
> > be used. So ib_qp_init_attr seems redundant. Or I have understood the
> > code wrong.
>
> The mask allows the query to be a little more selective about what data
> it is trying to access, which can potentially avoid accessing the hardware.
>
> The qp_attr and qp_init_attr contain different data, so both are
> returned from the query call.  To have ib_query_qp return only qp_attr,
> we would need to add the fields from qp_init_attr to it.
>
> > If you are not already adding the exact same functionality, please let me
> > know if the following looks correct. I recreated both patches after Hal's
> > checkin (Patch1 and Patch2 below).
>
> I am not adding this same functionality, and I'm coding around where
> your patch would go.
>
> > Also, I saw your other mail, and I had looked at the driver and it
> > didn't modify the final size of the new QP in the init_attr. It used the
> > structure to do it's work but doesn't update it. I was initially planning
> > on not using query() and instead rely on this structure getting updated.
> > The verb interface cannot do it since it qp doesn't contain the size. We
> > cannot change the driver to change the init structure since potentially
> > other drivers may not do it, so the reason to do a query to figure the
> > correct size.
>
> The original call to ib_create_qp took a third parameter, a qp_cap
> structure, for output.  This structure contained the actual QP settings
> returned from the ib_create_qp call.  I assumed that by removing this
> parameter, the capabilities would be returned directly in the
> qp_init_attr structure.  If this is not the case, then the driver should
> probably change to do that.   This matches what is defined by verbs, so
> I think that it's safe to do it.
>
> >                 qp->sq.max    = init_attr->cap.max_send_wr;
> >                 qp->rq.max    = init_attr->cap.max_recv_wr;
> >                 qp->sq.max_gs = init_attr->cap.max_send_sge;
> >                 qp->rq.max_gs = init_attr->cap.max_recv_sge;
> >
> >                 err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
>
> I haven't looked at the mthca_alloc_sqp call in more detail, but if it
> doesn't create a QP larger than that specified, then it wouldn't need to
> change the qp_cap fields.
>
>
>


From krkumar at us.ibm.com  Wed Nov  3 17:44:01 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 17:44:01 -0800 (PST)
Subject: [openib-general] [PATCH 1/2] Resize CQ
Message-ID: <Pine.LNX.4.44.0411031739070.9159-100000@DYN318430BLD>

This is after incorporating feedback from Sean. Compiles cleanly.

Thanks,

- KK

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-03 16:03:25.000000000 -0800
+++ 2/mad.c	2004-11-03 17:37:31.000000000 -0800
@@ -1692,6 +1692,14 @@ static void init_mad_queue(struct ib_mad
 	INIT_LIST_HEAD(&mad_queue->list);
 }

+/*
+ * Allocate one mad QP.
+ *
+ * If the return indicates success, the value returned is the new size
+ * of the queue pair that got created.
+ *
+ * Return > 0 on success and -(ERRNO) on failure. Zero should never happen.
+ */
 static int create_mad_qp(struct ib_mad_port_private *port_priv,
 			 struct ib_mad_qp_info *qp_info,
 			 enum ib_qp_type qp_type)
@@ -1715,15 +1723,18 @@ static int create_mad_qp(struct ib_mad_p
 	qp_init_attr.qp_type = qp_type;
 	qp_init_attr.port_num = port_priv->port_num;
 	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
-	if (IS_ERR(qp_info->qp)) {
-		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
-		       get_spl_qp_index(qp_type));
+	if (!IS_ERR(qp_info->qp)) {
+		/*
+		 * Driver should have modified the cap max_* fields
+		 * if it increased the qp send/recv size.
+		 */
+		ret = qp_init_attr.cap.max_send_wr +
+					qp_init_attr.cap.max_recv_wr;
+	} else {
 		ret = PTR_ERR(qp_info->qp);
-		goto error;
+		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d err:%d\n",
+		       get_spl_qp_index(qp_type), ret);
 	}
-	return 0;
-
-error:
 	return ret;
 }

@@ -1747,6 +1758,7 @@ static int ib_mad_port_open(struct ib_de
 		.size = (unsigned long) high_memory - PAGE_OFFSET
 	};
 	struct ib_mad_port_private *port_priv;
+	int total_qp_size;
 	unsigned long flags;

 	/* First, check if port already open at MAD layer */
@@ -1797,11 +1809,25 @@ static int ib_mad_port_open(struct ib_de
 	}

 	ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI);
-	if (ret)
+	if (ret <= 0)
 		goto error6;
+	total_qp_size = ret;
+
 	ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI);
-	if (ret)
+	if (ret <= 0)
 		goto error7;
+	total_qp_size += ret;
+
+	/* Resize if the total size of QP[0,1] is greater than CQ size. */
+	if (total_qp_size > cq_size) {
+		printk(KERN_DEBUG PFX "ib_mad_port_open: Increasing size of "
+		       "CQ from %d to %d\n", cq_size, total_qp_size);
+		if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) {
+			printk(KERN_DEBUG PFX "Couldn't increase CQ size - "
+			       "err:%d\n", ret);
+			/* continue, not an error */
+		}
+	}

 	spin_lock_init(&port_priv->reg_lock);
 	INIT_LIST_HEAD(&port_priv->agent_list);


From krkumar at us.ibm.com  Wed Nov  3 17:48:52 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 17:48:52 -0800 (PST)
Subject: [openib-general] [PATCH 2/2] Implement error handling in resize
	failure.
Message-ID: <Pine.LNX.4.44.0411031744050.9159-100000@DYN318430BLD>

The only issue is whether the code below for CQ_ACCESS_ERROR is
correct. I have taken it from mthca_eq.c :

        MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR    = 0x11

thanks,

- KK

diff -ruNp 2/mad.c 3/mad.c
--- 2/mad.c	2004-11-03 17:37:31.000000000 -0800
+++ 3/mad.c	2004-11-03 17:38:40.000000000 -0800
@@ -1744,6 +1744,21 @@ static void destroy_mad_qp(struct ib_mad
 }

 /*
+ * "Inaccessible" error cannot be handled by QP resize operation.
+ */
+static inline int is_catastrophic_error(int err)
+{
+#define	CQ_ACCESS_ERROR		0x11
+
+	switch (err) {
+	default:	/* OK */
+		return 0;
+	case CQ_ACCESS_ERROR:
+		return 1;
+	}
+}
+
+/*
  * Open the port
  * Create the QP, PD, MR, and CQ if needed
  */
@@ -1825,6 +1840,10 @@ static int ib_mad_port_open(struct ib_de
 		if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) {
 			printk(KERN_DEBUG PFX "Couldn't increase CQ size - "
 			       "err:%d\n", ret);
+			if (is_catastrophic_error(ret)) {
+				/* Clean up qp[0,1] */
+				goto error8;
+			}
 			/* continue, not an error */
 		}
 	}


From krkumar at us.ibm.com  Wed Nov  3 18:23:58 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Wed, 3 Nov 2004 18:23:58 -0800 (PST)
Subject: [openib-general] [PATCH] Reorganize and clean up debug messages in
	find_mad_agent()
In-Reply-To: <ORSMSX401FRaqbC8wSA00000005@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.44.0411031810560.10109-100000@DYN318430BLD>

Now messages are printed if either :

1. mad_agent is not found.
2. mad_agent is found but doesn't have a handler.

Printing messages during errors in the process of finding the
mad_agent has been removed.

Thanks,

- KK

On Wed, 3 Nov 2004, Sean Hefty wrote:

> Thanks for the patch.  If you can do something with the printk's, that
> would be good.  They should be KERN_NOTICE, but we may want to consider
> just removing them.

diff -ruNp 5/mad.c 6/mad.c
--- 5/mad.c	2004-11-03 17:56:54.000000000 -0800
+++ 6/mad.c	2004-11-03 18:17:04.000000000 -0800
@@ -752,34 +752,33 @@ find_mad_agent(struct ib_mad_port_privat

 	spin_lock_irqsave(&port_priv->reg_lock, flags);

-	/* Whether MAD was solicited determines type of routing to MAD client */
+	/*
+	 * Whether MAD was solicited determines type of routing to
+	 * MAD client.
+	 */
 	if (solicited) {
 		u32 hi_tid;
 		struct ib_mad_agent_private *entry;

-		/* Routing is based on high 32 bits of transaction ID of MAD  */
+		/*
+		 * Routing is based on high 32 bits of transaction ID
+		 * of MAD.
+		 */
 		hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32;
-		list_for_each_entry(entry, &port_priv->agent_list, agent_list) {
+		list_for_each_entry(entry, &port_priv->agent_list,
+				    agent_list) {
 			if (entry->agent.hi_tid == hi_tid) {
 				mad_agent = entry;
 				break;
 			}
 		}
-		if (!mad_agent)
-			printk(KERN_ERR PFX "No client 0x%x for received MAD "
-			       "on port %d\n",
-			       hi_tid, port_priv->port_num);
 	} else {
 		struct ib_mad_mgmt_class_table *version;
 		struct ib_mad_mgmt_method_table *class;

 		/* Routing is based on version, class, and method */
-		if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) {
-			printk(KERN_ERR PFX "MAD received with unsupported "
-			       "class version %d on port %d\n",
-			       mad->mad_hdr.class_version, port_priv->port_num);
+		if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION)
 			goto out;
-		}
 		version = port_priv->version[mad->mad_hdr.class_version];
 		if (!version)
 			goto out;
@@ -790,18 +789,19 @@ find_mad_agent(struct ib_mad_port_privat
 						 ~IB_MGMT_METHOD_RESP];
 	}

-out:
 	if (mad_agent) {
 		if (mad_agent->agent.recv_handler)
 			atomic_inc(&mad_agent->refcount);
 		else {
-			mad_agent = NULL;
-			printk(KERN_ERR PFX "No receive handler for client "
+			printk(KERN_NOTICE PFX "No receive handler for client "
 			       "%p on port %d\n",
 			       &mad_agent->agent, port_priv->port_num);
+			mad_agent = NULL;
 		}
-	}
-
+	} else
+		printk(KERN_NOTICE PFX "No client for received MAD on "
+		       "port %d\n", port_priv->port_num);
+out:
 	spin_unlock_irqrestore(&port_priv->reg_lock, flags);

 	return mad_agent;


From roland at topspin.com  Wed Nov  3 18:54:09 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 18:54:09 -0800
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD> (Krishna
	Kumar's message of "Wed, 3 Nov 2004 13:23:32 -0800 (PST)")
References: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
Message-ID: <52zn1ypbn2.fsf@topspin.com>

Not sure what the goal is here, but I should point out that current
mthca code does not implement resizing either CQs or QPs.

However I'm not sure I understand why the MAD layer wants to resize
these objects -- given that the number of QPs is known in advance and
that the MAD layer can choose how many work requests to post per QP,
I'm not sure what is gained by trying to resize things dynamically.

 - Roland


From roland at topspin.com  Wed Nov  3 18:59:06 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 18:59:06 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <20041103224337.GS17669@sventech.com> (Johannes Erdfelt's
	message of "Wed, 3 Nov 2004 14:43:37 -0800")
References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com>
	<20041103224337.GS17669@sventech.com>
Message-ID: <52vfcmpbet.fsf@topspin.com>

    Johannes> Does the device name need to have the HCA driver name in
    Johannes> it? Also, the u in umad is implied.

Good point, I'll change the docs to suggest no "u."

    Johannes> Wouldn't it be more appropriate to do something like
    Johannes> this:

    Johannes> /dev/infiniband/hca0/mad1

Maybe, but:
 - How does userspace know which device hca0 corresponds to?  Right
   now, mthca0 can be looked up under /sys/class/infiniband.
 - Do we need to do switch0 etc. for switches?

Of course the mthca driver could be updated to register itself using
hcaN names instead of mthcaN names, which would solve things fairly
transparently.

    Johannes> or maybe even:

    Johannes> /dev/ib/hca0/mad1

I don't like /dev/ib/ because I think "ib" is a little generic.  So I
prefer the more verbose but unambiguous "infiniband" name.

 - Roland


From roland at topspin.com  Wed Nov  3 20:13:35 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 03 Nov 2004 20:13:35 -0800
Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes
	(both branches)
In-Reply-To: <1099518873.2837.5.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 03 Nov 2004 16:54:34 -0500")
References: <1099518873.2837.5.camel@hpc-1>
Message-ID: <52ekjap7yo.fsf@topspin.com>

Can you resend either with a different mailer or as an attachment?
The patch was pretty line-wrapped.

 - R.


From halr at voltaire.com  Wed Nov  3 20:44:36 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 03 Nov 2004 23:44:36 -0500
Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes	(both
	branches)
In-Reply-To: <52ekjap7yo.fsf@topspin.com>
References: <1099518873.2837.5.camel@hpc-1> <52ekjap7yo.fsf@topspin.com>
Message-ID: <1099543475.10754.3.camel@hpc-1>

On Wed, 2004-11-03 at 23:13, Roland Dreier wrote:
> Can you resend either with a different mailer or as an attachment?
> The patch was pretty line-wrapped.

Sorry. My mailer in some cases in Evolution 1.2.2-4. Not sure if this is
fixed in newer versions or whether this is a configuration thing. Anhow,
let's try as an attachment for now.

I'm sure you know this but you will want to skip the changes to
openib-candidate as they have already been applied.

-- Hal

 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-plm
Type: text/x-patch
Size: 41063 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041103/451c5bea/attachment.bin>

From noohgnas at gmail.com  Wed Nov  3 21:52:06 2004
From: noohgnas at gmail.com (Sang-Hoon,Lee)
Date: Thu, 4 Nov 2004 14:52:06 +0900
Subject: [openib-general] question for ib_srp
Message-ID: <a28457e404110321528b9a468@mail.gmail.com>

hi all..

I've question for ib_srp usage. I could't use ib_srp module.
As you know, I think modprobe ib_srp is common usage.

I'd run like this...
# modprobe ib_srp target_bindings="0002c90108a06551" srp_tracelevel=4
use_srp_indirect_addressing=1 ib_ports_mask=1
then,
some errors were shown at /var/log/message below:

<start>
kernel: ib_srp: module license 'unspecified' taints kernel.
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:206]max
targets reported to scsi 64
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:207]max
luns reported to scsi 256
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:209]max
cmds(including aborts) per lun 32
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:211]max
outstanding ios per target 259
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:226]Found
HCA 0 ee743220
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:256]SRP
Initiator GUID: 2c901081e67c0 for hca 1
kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:309]Pool
Create max pages 0x12 pool size 0x4000
kernel: [SRPTP][srp_dm_init][drivers/infiniband/ulp/srp/srp_dm.c:1614]Registering
async events handler for HCA 0
kernel: Target Binding 0 to Target 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1131]Refreshing
HCA/port info
kernel: [SRPTP][srp_register_out_of_service][drivers/infiniband/ulp/srp/srp_dm.c:1333]Registering
hca 1 local port 1 for IB out of service traps
kernel: [SRPTP][srp_register_in_service][drivers/infiniband/ulp/srp/srp_dm.c:1283]Registering
hca 1 local port 1 for IB in service traps
kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM
Query Initiated on hca 1 local port 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][srp_out_of_service_completion][drivers/infiniband/ulp/srp/srp_dm.c:1226]Out
of service trap for hca 1 port 1 complete
kernel: [SRPTP][srp_in_service_completion][drivers/infiniband/ulp/srp/srp_dm.c:1258]In
service trap for hca 1 port 1 complete
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 5 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
0, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 0
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
2, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 2
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
3, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 3
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
4, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 4
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
5, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 5
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
6, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 6
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
7, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 7
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
8, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 8
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
9, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 9
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
10, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 10
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
11, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 11
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
12, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 12
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
13, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 13
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
14, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 14
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
15, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 15
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
16, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 16
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
17, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 17
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
18, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 18
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
19, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 19
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
20, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 20
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
21, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 21
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
22, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 22
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
23, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 23
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
24, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 24
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
25, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 25
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
26, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 26
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
27, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 27
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
28, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 28
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
29, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 29
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
30, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 30
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
31, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 31
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
32, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 32
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
33, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 33
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
34, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 34
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
35, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 35
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
36, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 36
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
37, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 37
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
38, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 38
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
39, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 39
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
40, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 40
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
41, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 41
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
42, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 42
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
43, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 43
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
44, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 44
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
45, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 45
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
46, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 46
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
47, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 47
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
48, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 48
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
49, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 49
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
50, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 50
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
51, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 51
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
52, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 52
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
53, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 53
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
54, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 54
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
55, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 55
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
56, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 56
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
57, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 57
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
58, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 58
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
59, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 59
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
60, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 60
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
61, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 61
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
62, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 62
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
63, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 63
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 4 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 5 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM
Client timeout on hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM
Client Query complete hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:584]Restarting
DM Query on hca 1 port 1 timeout, retry count 1
kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM
Query Initiated on hca 1 local port 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 5 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 6 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 6 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM
Client timeout on hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM
Client Query complete hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:584]Restarting
DM Query on hca 1 port 1 timeout, retry count 2
kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM
Query Initiated on hca 1 local port 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 6 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 6 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 5 times
kernel: [SRPTP][srp_host_init][drivers/infiniband/ulp/srp/srp_host.c:1607]0
active connections 0 pending connections
kernel:
kernel: scsi3 : <NULL>
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM
Client timeout on hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM
Client Query complete hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:584]Restarting
DM Query on hca 1 port 1 timeout, retry count 3
kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM
Query Initiated on hca 1 local port 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 5 times
kernel: [SRPTP][srp_host_abort_eh][drivers/infiniband/ulp/srp/srp_host.c:2522]Abort
SCpnt ec912200 on target 1
kernel: [SRPTP][srp_host_device_reset_eh][drivers/infiniband/ulp/srp/srp_host.c:2697]Device
reset...target 1
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 1
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:767]
Sending IO back to scsi from pending list ioq eba4d290
kernel: bad: scheduling while atomic!
kernel: Call Trace:
kernel:  [<c02bd8b6>] schedule+0x5b2/0x5b7
kernel:  [<c011eac5>] process_timeout+0x0/0x9
kernel:  [<c011349c>] wake_up_process+0x1e/0x22
kernel:  [<c01131e0>] recalc_task_prio+0xb2/0x1ea
kernel:  [<c02bd10d>] __down+0x99/0x112
kernel:  [<c0113d99>] default_wake_function+0x0/0x12
kernel:  [<c02bd2dc>] __down_failed+0x8/0xc
kernel:  [<c021da61>] .text.lock.scsi_error+0x23/0x46
kernel:  [<c021c625>] scsi_eh_done+0x0/0x49
kernel:  [<c021c608>] scsi_eh_times_out+0x0/0x1d
kernel:  [<c021caf3>] scsi_eh_tur+0x93/0xc8
kernel:  [<c021cebc>] scsi_eh_bus_device_reset+0xc8/0xd2
kernel:  [<c021d5d1>] scsi_eh_ready_devs+0x4f/0x93
kernel:  [<c021d760>] scsi_unjam_host+0xc8/0xd1
kernel:  [<c021d83a>] scsi_error_handler+0xd1/0x10a
kernel:  [<c021d769>] scsi_error_handler+0x0/0x10a
kernel:  [<c0103c91>] kernel_thread_helper+0x5/0xb
kernel:
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 6 times
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
kernel: [SRPTP][srp_host_abort_eh][drivers/infiniband/ulp/srp/srp_host.c:2522]Abort
SCpnt ec912200 on target 1
kernel: [SRPTP][srp_host_reset_eh][drivers/infiniband/ulp/srp/srp_host.c:2749]Host
reset
kernel: bad: scheduling while atomic!
kernel: Call Trace:
kernel:  [<c02bd8b6>] schedule+0x5b2/0x5b7
kernel:  [<c0113ddc>] __wake_up_common+0x31/0x50
kernel:  [<c02bd10d>] __down+0x99/0x112
kernel:  [<c0113d99>] default_wake_function+0x0/0x12
kernel:  [<c02bd2dc>] __down_failed+0x8/0xc
kernel:  [<c021da75>] .text.lock.scsi_error+0x37/0x46
kernel:  [<c021d281>] scsi_sleep_done+0x0/0x11
kernel:  [<c021d01b>] scsi_try_host_reset+0x8f/0xc6
kernel:  [<c021d19b>] scsi_eh_host_reset+0x48/0xae
kernel:  [<c021d5f5>] scsi_eh_ready_devs+0x73/0x93
kernel:  [<c021d760>] scsi_unjam_host+0xc8/0xd1
kernel:  [<c021d83a>] scsi_error_handler+0xd1/0x10a
kernel:  [<c021d769>] scsi_error_handler+0x0/0x10a
kernel:  [<c0103c91>] kernel_thread_helper+0x5/0xb
kernel:
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number
of active dm_queries 1
last message repeated 3 times
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM
Client timeout on hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM
Client Query complete hca 1 port 1
kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:594]DM
Client timeout on hca 1 port 1, retry count 4 exceeded
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping
all targets, that are in need of a connection
kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target
1, no active connection
kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target
1, no paths available
kernel: bad: scheduling while atomic!
kernel: Call Trace:
kernel:  [<c02bd8b6>] schedule+0x5b2/0x5b7
kernel:  [<c0113e75>] __wake_up_locked+0x22/0x26
kernel:  [<c02bd10d>] __down+0x99/0x112
kernel:  [<c0113d99>] default_wake_function+0x0/0x12
kernel:  [<c02bd2dc>] __down_failed+0x8/0xc
kernel:  [<c021da61>] .text.lock.scsi_error+0x23/0x46
kernel:  [<c021c625>] scsi_eh_done+0x0/0x49
kernel:  [<c021c608>] scsi_eh_times_out+0x0/0x1d
kernel:  [<c021caf3>] scsi_eh_tur+0x93/0xc8
kernel:  [<c021d1e0>] scsi_eh_host_reset+0x8d/0xae
kernel:  [<c021d5f5>] scsi_eh_ready_devs+0x73/0x93
kernel:  [<c021d760>] scsi_unjam_host+0xc8/0xd1
kernel:  [<c021d83a>] scsi_error_handler+0xd1/0x10a
kernel:  [<c021d769>] scsi_error_handler+0x0/0x10a
kernel:  [<c0103c91>] kernel_thread_helper+0x5/0xb
kernel:
kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target
1, no connection timeout
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing
the pending queue back to scsi for target 1
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:767]
Sending IO back to scsi from pending list ioq eba4d290
kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:767]
Sending IO back to scsi from pending list ioq eba4de10
kernel: scsi: Device offlined - not ready after error recovery: host 3
channel 0 id 1 lun 0
kernel: bad: scheduling while atomic!
kernel: Call Trace:
kernel:  [<c02bd8b6>] schedule+0x5b2/0x5b7
kernel:  [<c01e6a07>] as_next_request+0x33/0x3c
kernel:  [<c01dec55>] elv_next_request+0x10/0xfc
kernel:  [<c02bd243>] __down_interruptible+0xbd/0x14e
kernel:  [<c0113d99>] default_wake_function+0x0/0x12
kernel:  [<c02bd2e7>] __down_failed_interruptible+0x7/0xc
kernel:  [<c021da7f>] .text.lock.scsi_error+0x41/0x46
kernel:  [<c021d769>] scsi_error_handler+0x0/0x10a
kernel:  [<c0103c91>] kernel_thread_helper+0x5/0xb
kernel:
kernel: srp_host: target_bindings=2c90108a0655.1
</end>

The infiniband device drivers were loaded on both initiator and target machine.
What's the problem whthin initiator or target?

This is common loaded module list on init and target.

<start>
ib_dm_client           24764  0
ib_cm                  52312  0 
ib_useraccess          12484  0
ib_ipoib               66188  0
ib_sa_client           30216  3 ib_srp,ib_dm_client,ib_ipoib
ib_client_query        15392  4 ib_srp,ib_dm_client,ib_ipoib,ib_sa_client
ib_poll                17080  3 ib_dm_client,ib_cm,ib_client_query
ib_tavor               33284  5
mod_vapi              157688  1 ib_tavor
mod_vipkl             223932  1 mod_vapi
ib_mad                 25100  4 ib_cm,ib_useraccess,ib_client_query,ib_tavor
mod_mpga               24576  1 mod_vapi
mod_thh               272160  1 mod_vapi
mod_vapi_common        87808  4 ib_tavor,mod_vapi,mod_vipkl,mod_thh
mosal                 126792  5
mod_vapi,mod_vipkl,mod_mpga,mod_thh,mod_vapi_common
mod_hh                 16696  2 mod_vipkl,mod_thh
ib_core               247316  8
ib_srp,ib_dm_client,ib_cm,ib_useraccess,ib_ipoib,ib_sa_client,ib_tavor,ib_mad
ib_services            17860  11
ib_srp,ib_dm_client,ib_cm,ib_useraccess,ib_ipoib,ib_sa_client,ib_client_query,ib_poll,ib_tavor,ib_mad,ib_core
</end>

Would you tell me about what *I'll do* to normal operate on the ib_srp ?


/best regards


From mst at mellanox.co.il  Thu Nov  4 02:41:20 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 12:41:20 +0200
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <52zn1ypbn2.fsf@topspin.com>
References: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
	<52zn1ypbn2.fsf@topspin.com>
Message-ID: <20041104104120.GA2177@mellanox.co.il>

If the max. number of QPs is very big, you may want
the actual CQ size to grow gradually with demand.


Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ":
> Not sure what the goal is here, but I should point out that current
> mthca code does not implement resizing either CQs or QPs.
> 
> However I'm not sure I understand why the MAD layer wants to resize
> these objects -- given that the number of QPs is known in advance and
> that the MAD layer can choose how many work requests to post per QP,
> I'm not sure what is gained by trying to resize things dynamically.
> 
>  - Roland
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at mellanox.co.il  Thu Nov  4 02:46:05 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 12:46:05 +0200
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <20041103224337.GS17669@sventech.com>
References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com>
	<20041103224337.GS17669@sventech.com>
Message-ID: <20041104104605.GB2177@mellanox.co.il>

Hello!
Quoting r. Johannes Erdfelt (johannes at erdfelt.com) "Re: [openib-general] [PATCH] Initial checkin of userspace MAD access":
> On Wed, Nov 03, 2004, Roland Dreier <roland at topspin.com> wrote:
> > By the way, buried down at the end of the patch is some documentation
> > about creating device files:
> > 
> > +/dev files
> > +
> > +  To create the appropriate character device files automatically with
> > +  udev, a rule like
> > +
> > +    KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}"
> > +
> > +  can be used.  This will create nodes such as /dev/infiniband/mthca0/umad1
> > +  for port 1 of device mthca0.
> > 
> > Do the names /dev/infiniband/mthca0/umad1 and so on make sense to
> > people?  I thought that userspace verbs support would probably use a
> > file like /dev/infiniband/mthca0/verbs, etc.
> > 
> > In any case, now is probably the time to object before we have legacy
> > issues to worry about....
> 
> Does the device name need to have the HCA driver name in it? Also, the u
> in umad is implied.
> 
> Wouldn't it be more appropriate to do something like this:
> 
> /dev/infiniband/hca0/mad1
> 
> or maybe even:
> 
> /dev/ib/hca0/mad1
> 
> JE

I'd suggest

/dev/ib/hca0/ports/1/mad

Then the user can give the file name like
/dev/ib/hca0/ports/1 to opensm directly, open sm will
just append "/mad", and it is also easier to find out
how many ports in hca0 without switching to sysfs.

MST


From mst at mellanox.co.il  Thu Nov  4 05:03:05 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 15:03:05 +0200
Subject: [openib-general] announcement: mstflint flash burning package
	uploaded
Message-ID: <20041104130305.GA2735@mellanox.co.il>

Hello!
I have uploaded an mstflint flash burning package to openib.org.
You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/

This is an update to the original flint utility that makes it possible to 
perform flash burning without loading special kernel level drivers,
performing device pci memory access by means of the 
inding the device physical address in /proc/bus/pci/devices and
then performing access through the standard /dev/mem file.
There is also support for access through the configuration space,
by writes to special files in /proc/bus/pci.

See https://openib.org/svn/trunk/contrib/mellanox/mstflint/README
for installation details.

Feedback wellcome.
MST


From halr at voltaire.com  Thu Nov  4 06:24:18 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 09:24:18 -0500
Subject: [openib-general] [PATCH] Cleanup spaces to tabs
In-Reply-To: <Pine.LNX.4.44.0411031551350.8254-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411031551350.8254-100000@DYN318430BLD>
Message-ID: <1099578258.15107.1.camel@hpc-1>

On Wed, 2004-11-03 at 18:52, Krishna Kumar wrote:
> Hi Hal,
> 
> The same patch on latest bits ....
Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu Nov  4 06:51:58 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 09:51:58 -0500
Subject: [openib-general] [PATCH] Reorganize and clean up debug	messages in
	find_mad_agent()
In-Reply-To: <Pine.LNX.4.44.0411031810560.10109-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411031810560.10109-100000@DYN318430BLD>
Message-ID: <1099579918.2943.8.camel@hpc-1>

On Wed, 2004-11-03 at 21:23, Krishna Kumar wrote: 
> Now messages are printed if either :
> 
> 1. mad_agent is not found.
> 2. mad_agent is found but doesn't have a handler.
> 
> Printing messages during errors in the process of finding the
> mad_agent has been removed.

Thanks. Applied.

-- Hal


From roland at topspin.com  Thu Nov  4 07:08:04 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 07:08:04 -0800
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <20041104104120.GA2177@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 4 Nov 2004 12:41:20 +0200")
References: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
	<52zn1ypbn2.fsf@topspin.com> <20041104104120.GA2177@mellanox.co.il>
Message-ID: <521xf9ps8b.fsf@topspin.com>

    Michael> If the max. number of QPs is very big, you may want the
    Michael> actual CQ size to grow gradually with demand.

sure but there are only 2 soecial qps per port.


From halr at voltaire.com  Thu Nov  4 07:15:11 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 10:15:11 -0500
Subject: [openib-general] [PATCH] agent: Change calling argument to
	agent_mad_send
Message-ID: <1099581311.2943.16.camel@hpc-1>

agent: Change calling argument to agent_mad_send
Rather than taking a struct ib_mad_recv_wc *, take a struct ib_wc *

Index: agent.c
===================================================================
--- agent.c	(revision 1131)
+++ agent.c	(working copy)
@@ -296,7 +296,7 @@
 static int agent_mad_send(struct ib_mad_agent *mad_agent,
 			  struct ib_mad *mad,
 			  struct ib_grh *grh,
-			  struct ib_mad_recv_wc *mad_recv_wc)
+			  struct ib_wc *wc)
 {
 	struct ib_agent_port_private *port_priv;
 	struct ib_agent_send_wr *agent_send_wr;
@@ -332,17 +332,17 @@
 	send_wr.opcode = IB_WR_SEND;
 	send_wr.sg_list = &gather_list;
 	send_wr.num_sge = 1;
-	send_wr.wr.ud.remote_qpn = mad_recv_wc->wc->src_qp; /* DQPN */
+	send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */
 	send_wr.wr.ud.timeout_ms = 0;
 	send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
 
-	ah_attr.dlid = mad_recv_wc->wc->slid;
+	ah_attr.dlid = wc->slid;
 	ah_attr.port_num = mad_agent->port_num;
-	ah_attr.src_path_bits = mad_recv_wc->wc->dlid_path_bits;
-	ah_attr.sl = mad_recv_wc->wc->sl;
+	ah_attr.src_path_bits = wc->dlid_path_bits;
+	ah_attr.sl = wc->sl;
 	ah_attr.static_rate = 0;
 	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
-		if (mad_recv_wc->wc->wc_flags & IB_WC_GRH) {
+		if (wc->wc_flags & IB_WC_GRH) {
 			ah_attr.ah_flags = IB_AH_GRH;
 			ah_attr.grh.sgid_index = 0; /* Should sgid be looked up
 ? */
@@ -351,7 +351,7 @@
 			ah_attr.grh.traffic_class = (be32_to_cpup(&grh->version_tclass_flow)
>> 20) & 0xff;
 			memcpy(ah_attr.grh.dgid.raw, grh->sgid.raw, sizeof(struct ib_grh));
 		} else {
-			ah_attr.ah_flags = 0; /* No GRH */
+			ah_attr.ah_flags = 0; /* No GRH for SM class */
 		}
 	} else {
 		/* Directed route or LID routed SM class */
@@ -367,7 +367,7 @@
 
 	send_wr.wr.ud.ah = agent_send_wr->ah;
 	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
-		send_wr.wr.ud.pkey_index = mad_recv_wc->wc->pkey_index;
+		send_wr.wr.ud.pkey_index = wc->pkey_index;
 		send_wr.wr.ud.remote_qkey = IB_QP1_QKEY;
 	} else {
 		send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */
@@ -407,7 +407,6 @@
 {
 	struct ib_agent_port_private *port_priv;
 	struct ib_mad_agent *mad_agent;
-	struct ib_mad_recv_wc mad_recv_wc;
 
 	port_priv = ib_get_agent_mad(device, port_num, NULL);
 	if (!port_priv) {
@@ -431,9 +430,7 @@
 			return 1;
 	}
 
-	/* Other fields don't matter so should change signature to just use wc
*/
-	mad_recv_wc.wc = wc;
-	return agent_mad_send(mad_agent, mad, grh, &mad_recv_wc);
+	return agent_mad_send(mad_agent, mad, grh, wc);
 }
 
 static void agent_send_handler(struct ib_mad_agent *mad_agent,


From halr at voltaire.com  Thu Nov  4 07:40:47 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 10:40:47 -0500
Subject: [openib-general] [PATCH] agent: Minor modifications to
	smi_check_local_xxx routines
Message-ID: <1099582847.2837.2.camel@hpc-1>

agent: Minor modifications to smi_check_local_xxx routines

Index: agent.c
===================================================================
-- agent.c (revision 1133)
+++ agent.c (working copy)
@@ -117,8 +117,7 @@
{
/* C14-9:3 -- We're at the end of the DR segment of path */
/* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */
- return ((smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) ||
- (mad_agent->device->process_mad &&
+ return ((mad_agent->device->process_mad &&
!ib_get_smp_direction(smp) &&
(smp->hop_ptr == smp->hop_cnt + 1)));
}
@@ -283,6 +282,8 @@
{
struct ib_agent_port_private *port_priv;

+ if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+ return 1;
port_priv = ib_get_agent_mad(device, port_num, NULL);
if (!port_priv) {
printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not open\n",


From mst at mellanox.co.il  Thu Nov  4 07:44:01 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 17:44:01 +0200
Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ
In-Reply-To: <521xf9ps8b.fsf@topspin.com>
References: <Pine.LNX.4.44.0411031300450.6866-100000@DYN318430BLD>
	<52zn1ypbn2.fsf@topspin.com> <20041104104120.GA2177@mellanox.co.il>
	<521xf9ps8b.fsf@topspin.com>
Message-ID: <20041104154400.GB3499@mellanox.co.il>

Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ":
>     Michael> If the max. number of QPs is very big, you may want the
>     Michael> actual CQ size to grow gradually with demand.
> 
> sure but there are only 2 soecial qps per port.

Of course, it is only relevant for regular qps.

mst


From rminnich at lanl.gov  Thu Nov  4 07:57:30 2004
From: rminnich at lanl.gov (Ronald G. Minnich)
Date: Thu, 4 Nov 2004 08:57:30 -0700 (MST)
Subject: [openib-general] announcement: mstflint flash burning package
	uploaded
In-Reply-To: <20041104130305.GA2735@mellanox.co.il>
References: <20041104130305.GA2735@mellanox.co.il>
Message-ID: <Pine.LNX.4.58.0411040857080.1002@linux.site>


On Thu, 4 Nov 2004, Michael S. Tsirkin wrote:

> I have uploaded an mstflint flash burning package to openib.org.
> You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/

neat. How does this differ from tvflash that Roland wrote?

thanks

ron


From halr at voltaire.com  Thu Nov  4 08:04:27 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 11:04:27 -0500
Subject: [openib-general] SM and smi
In-Reply-To: <52lldltdve.fsf@topspin.com>
References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com>
Message-ID: <1099584267.2837.14.camel@hpc-1>

On Mon, 2004-11-01 at 17:15, Roland Dreier wrote:
> I think SMI processing should be applied to all DR SMPs passed to
> ib_post_send_mad().  

This requires the MAD layer to peek into the outgoing MAD but all it has
is the DMA address in the sg-list and the last time I tried to do this
(go from DMA address to a VA), the approach used to do this was in the
process of being deprecated. Last time, I think the need to do this was
obviated. Is there an acceptable alternative ?

> This is what the Topspin stack does and I believe it is what OpenSM expects.

I think we could change what OpenSM expected if needed so this does not
appear to me to be a determining factor.

-- Hal


From roland at topspin.com  Thu Nov  4 08:14:28 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 08:14:28 -0800
Subject: [openib-general] SM and smi
In-Reply-To: <1099584267.2837.14.camel@hpc-1> (Hal Rosenstock's message of
	"Thu, 04 Nov 2004 11:04:27 -0500")
References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com>
	<1099584267.2837.14.camel@hpc-1>
Message-ID: <52wtx1oal7.fsf@topspin.com>

    Hal> This requires the MAD layer to peek into the outgoing MAD but
    Hal> all it has is the DMA address in the sg-list and the last
    Hal> time I tried to do this (go from DMA address to a VA), the
    Hal> approach used to do this was in the process of being
    Hal> deprecated. Last time, I think the need to do this was
    Hal> obviated. Is there an acceptable alternative ?

I thought the wr.ud.mad_hdr member was added for this sort of thing...

 - R.


From roland at topspin.com  Thu Nov  4 08:15:21 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 08:15:21 -0800
Subject: [openib-general] announcement: mstflint flash burning package
	uploaded
In-Reply-To: <Pine.LNX.4.58.0411040857080.1002@linux.site> (Ronald G.
	Minnich's message of "Thu, 4 Nov 2004 08:57:30 -0700 (MST)")
References: <20041104130305.GA2735@mellanox.co.il>
	<Pine.LNX.4.58.0411040857080.1002@linux.site>
Message-ID: <52sm7poajq.fsf@topspin.com>

    Ronald> neat. How does this differ from tvflash that Roland wrote?

correction: I just cleaned up the code.  Kamen and Johannes here at
Topspin did most of the real work in writing tvflash...

 - R


From roland at topspin.com  Thu Nov  4 08:15:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 08:15:54 -0800
Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access
In-Reply-To: <20041104104605.GB2177@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 4 Nov 2004 12:46:05 +0200")
References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com>
	<20041103224337.GS17669@sventech.com>
	<20041104104605.GB2177@mellanox.co.il>
Message-ID: <52oeidoait.fsf@topspin.com>

    Michael> /dev/ib/hca0/ports/1/mad

I like this idea.

Thanks,
  Roland


From mst at mellanox.co.il  Thu Nov  4 08:17:25 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 18:17:25 +0200
Subject: [openib-general] announcement: mstflint flash burning package
	uploaded
In-Reply-To: <Pine.LNX.4.58.0411040857080.1002@linux.site>
References: <20041104130305.GA2735@mellanox.co.il>
	<Pine.LNX.4.58.0411040857080.1002@linux.site>
Message-ID: <20041104161725.GB2550@mellanox.co.il>

Hello!
Quoting r. Ronald G. Minnich (rminnich at lanl.gov) "Re: [openib-general] announcement: mstflint flash burning package uploaded":
> 
> 
> On Thu, 4 Nov 2004, Michael S. Tsirkin wrote:
> 
> > I have uploaded an mstflint flash burning package to openib.org.
> > You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/
> 
> neat. How does this differ from tvflash that Roland wrote?
> 
> thanks
> 
> ron

Supports wider range of cards produced by mellanox,
supports integration with ib management tools that we are developing,
is based on flint code that we use in production environment.

MST


From mst at mellanox.co.il  Thu Nov  4 08:23:33 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 18:23:33 +0200
Subject: [openib-general] announcement: mstflint flash burning package
	uploaded
In-Reply-To: <52sm7poajq.fsf@topspin.com>
References: <20041104130305.GA2735@mellanox.co.il>
	<Pine.LNX.4.58.0411040857080.1002@linux.site>
	<52sm7poajq.fsf@topspin.com>
Message-ID: <20041104162333.GE2550@mellanox.co.il>

Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] announcement: mstflint flash burning package uploaded":
>     Ronald> neat. How does this differ from tvflash that Roland wrote?
> 
> correction: I just cleaned up the code.

Something mstflint could benefit from, too :)

MST


From mst at mellanox.co.il  Thu Nov  4 08:25:47 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 4 Nov 2004 18:25:47 +0200
Subject: [openib-general] announcement: mstflint flash burning package
	uploaded
In-Reply-To: <Pine.LNX.4.58.0411040857080.1002@linux.site>
References: <20041104130305.GA2735@mellanox.co.il>
	<Pine.LNX.4.58.0411040857080.1002@linux.site>
Message-ID: <20041104162547.GF2550@mellanox.co.il>

Hello!
Quoting r. Ronald G. Minnich (rminnich at lanl.gov) "Re: [openib-general] announcement: mstflint flash burning package uploaded":
> 
> 
> On Thu, 4 Nov 2004, Michael S. Tsirkin wrote:
> 
> > I have uploaded an mstflint flash burning package to openib.org.
> > You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/
> 
> neat. How does this differ from tvflash that Roland wrote?

Clarification: its basically (a later revision of) the same flint utility
you already have.

All I did was replace calls to kernel driver by access to /dev/mem and
friends.

My code is in mtcr.h, flint.cpp is taken from production flint as is.

MST


From roland at topspin.com  Thu Nov  4 09:37:18 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 09:37:18 -0800
Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes
	(both branches)
In-Reply-To: <1099543475.10754.3.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 03 Nov 2004 23:44:36 -0500")
References: <1099518873.2837.5.camel@hpc-1> <52ekjap7yo.fsf@topspin.com>
	<1099543475.10754.3.camel@hpc-1>
Message-ID: <527jp1o6r5.fsf@topspin.com>

OK, I merged the MAD code in my branch up to r1135 and applied this
patch (there was one missing chunk in ib_verbs.h to remove the
snoop_mad method from struct ib_device, which I added by hand).

Thanks,
  Roland


From roland at topspin.com  Thu Nov  4 09:46:00 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 09:46:00 -0800
Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk
Message-ID: <52y8hhmrs7.fsf@topspin.com>

I have just copied the roland-merge branch to

    https://openib.org/svn/gen2/trunk

This tree will become the main development tree and will be used to
create the tree we will submit to the kernel for inclusion.  Please
use this tree for testing and as the base for all patches.

I will be cleaning up this tree (mostly deleting code that does not
build any more, etc) over the next few days.

Thanks,
  Roland


From krkumar at us.ibm.com  Thu Nov  4 09:44:28 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Thu, 4 Nov 2004 09:44:28 -0800 (PST)
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <OFDFB6921A.6ADBD2DB-ON88256F42.005EDF62-88256F42.005EE695@us.ibm.com>
Message-ID: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>

On Thu, 4 Nov 2004, Roland Dreier wrote:

> Not sure what the goal is here, but I should point out that current
> mthca code does not implement resizing either CQs or QPs.

Yes, I agree on that. Infact the verbs layer will return ENOSYS for
mthca driver. But I was assuming that any other driver by a different
hardware vendor can support this call (mthca over time could support
this call too ?).

> However I'm not sure I understand why the MAD layer wants to resize
> these objects -- given that the number of QPs is known in advance and
> that the MAD layer can choose how many work requests to post per QP,
> I'm not sure what is gained by trying to resize things dynamically.

Actually, I haven't really implemented the "dynamically" part, where you
resize the CQ during operation. The spec said that when you create a QP,
it can be larger than what you specified. If so, I see good value in
increasing the size of the associated CQ, if it is supported by the
driver.

Thanks,

- KK


From xma at us.ibm.com  Thu Nov  4 10:14:49 2004
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 4 Nov 2004 10:14:49 -0800
Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk
In-Reply-To: <52y8hhmrs7.fsf@topspin.com>
Message-ID: <OF6A1763BD.112A50ED-ON87256F42.00640924-88256F42.00643C72@us.ibm.com>


So everybody should start working on this tree. What's the difference
between openib-candidate and trunk under gen2?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638


             Roland Dreier                                                 
             <roland at topspin.c                                             
             om>                                                        To 
             Sent by:                  openib-general at openib.org           
             openib-general-bo                                          cc 
             unces at openib.org                                              
                                                                   Subject 
                                       [openib-general] [ANNOUNCE] Opening 
             11/04/2004 09:46          of gen2 trunk                       
             AM                                                            
                                                                           
                                                                           
I have just copied the roland-merge branch to

    https://openib.org/svn/gen2/trunk

This tree will become the main development tree and will be used to
create the tree we will submit to the kernel for inclusion.  Please
use this tree for testing and as the base for all patches.

I will be cleaning up this tree (mostly deleting code that does not
build any more, etc) over the next few days.

Thanks,
  Roland

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041104/260a5a90/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041104/260a5a90/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic22236.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041104/260a5a90/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041104/260a5a90/attachment-0002.gif>

From halr at voltaire.com  Thu Nov  4 10:24:47 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 13:24:47 -0500
Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk
In-Reply-To: <OF6A1763BD.112A50ED-ON87256F42.00640924-88256F42.00643C72@us.ibm.com>
References: <OF6A1763BD.112A50ED-ON87256F42.00640924-88256F42.00643C72@us.ibm.com>
Message-ID: <1099592687.3890.2.camel@perr-t30.us.voltaire.com>

On Thu, 2004-11-04 at 13:14, Shirley Ma wrote:
> So everybody should start working on this tree. What's the difference
> between openib-candidate and trunk under gen2?

trunk is much more complete with IPoIB. This is what is heading towards
being pushed to the 2.6 kernel. Yes, you should use this tree :-)

-- Hal


From mashirle at us.ibm.com  Thu Nov  4 12:44:03 2004
From: mashirle at us.ibm.com (Shirley Ma)
Date: Thu, 4 Nov 2004 12:44:03 -0800
Subject: [openib-general] [PATCH]fix memory leak associated with
	agent_send_handler() in gen2/trunk
Message-ID: <200411041244.03710.mashirle@us.ibm.com>

Please review this patch.

diff -urN infiniband/core/agent.c infiniband.patch/core/agent.c
--- infiniband/core/agent.c	2004-11-04 10:35:20.000000000 -0800
+++ infiniband.patch/core/agent.c	2004-11-04 12:35:55.916027072 -0800
@@ -480,6 +480,7 @@
 
 	/* Release allocated memory */
 	kfree(agent_send_wr->mad);
+	kfree(agent_send_wr);
 }
 
 int ib_agent_port_open(struct ib_device *device, int port_num,

-- 
Thanks
Shirley Ma
IBM Linux Technology Center


From halr at voltaire.com  Thu Nov  4 13:02:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 16:02:39 -0500
Subject: [openib-general] [PATCH]fix memory leak associated with
	agent_send_handler() in gen2/trunk
In-Reply-To: <200411041244.03710.mashirle@us.ibm.com>
References: <200411041244.03710.mashirle@us.ibm.com>
Message-ID: <1099602159.3110.3.camel@hpc-1>

On Thu, 2004-11-04 at 15:44, Shirley Ma wrote:
> Please review this patch.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu Nov  4 13:34:06 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 16:34:06 -0500
Subject: [openib-general] [PATCH] mad: Remove print of "No client for
	received MAD"
Message-ID: <1099604046.3110.6.camel@hpc-1>

mad: Removed print of "No client for received MAD" as this can be a
normal case

Index: mad.c
===================================================================
--- mad.c	(revision 1139)
+++ mad.c	(working copy)
@@ -798,9 +798,7 @@
 			       &mad_agent->agent, port_priv->port_num);
 			mad_agent = NULL;
 		}
-	} else
-		printk(KERN_NOTICE PFX "No client for received MAD on "
-		       "port %d\n", port_priv->port_num);
+	}
 out:
 	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 

From krkumar at us.ibm.com  Thu Nov  4 13:31:28 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Thu, 4 Nov 2004 13:31:28 -0800 (PST)
Subject: [openib-general] [PATCH] fix memory leak in ib_mad_recv_done_handler
Message-ID: <Pine.LNX.4.44.0411041329440.13485-100000@DYN318430BLD>

Also, updated a comment so that it is known that the recv_handler
(when it is implemented) is in charge of freeing up recv during it's
processing. Applies to gen2/trunk.

Thanks,

- KK

diff -ruNp 1/mad.c 2/mad.c
--- 1/mad.c	2004-11-04 10:38:30.000000000 -0800
+++ 2/mad.c	2004-11-04 13:26:39.000000000 -0800
@@ -1045,14 +1045,16 @@ static void ib_mad_recv_done_handler(str
 				   solicited);
 	if (mad_agent) {
 		ib_mad_complete_recv(mad_agent, recv, solicited);
-		recv = NULL;	/* recv is freed up via ib_mad_complete_recv */
+		/*
+		 * recv is freed up in error cases in ib_mad_complete_recv
+		 * or via recv_handler in ib_mad_complete_recv().
+		 */
+		recv = NULL;
 	}

 out:
-	if (recv) {
-		/* Should this case be optimized ? */
-		kmem_cache_free(ib_mad_cache, recv);
-	}
+	if (recv)
+		ib_free_recv_mad(&recv->header.recv_wc);

 	/* Post another receive request for this QP */
 	ib_mad_post_receive_mad(qp_info);


From halr at voltaire.com  Thu Nov  4 14:17:32 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 04 Nov 2004 17:17:32 -0500
Subject: [openib-general] [PATCH] fix memory leak in
	ib_mad_recv_done_handler
In-Reply-To: <Pine.LNX.4.44.0411041329440.13485-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411041329440.13485-100000@DYN318430BLD>
Message-ID: <1099606652.2834.0.camel@hpc-1>

On Thu, 2004-11-04 at 16:31, Krishna Kumar wrote: 
> Also, updated a comment so that it is known that the recv_handler
> (when it is implemented) is in charge of freeing up recv during it's
> processing. Applies to gen2/trunk.

Thanks. Applied the commentary part of the change.

The memory leak "fix" needs some work as the MAD layer now oops on a
NULL pointer dereference at virtual address 0.

I omitted this one line change (for now):

-               kmem_cache_free(ib_mad_cache, recv);
+               ib_free_recv_mad(&recv->header.recv_wc);

-- Hal


From mshefty at ichips.intel.com  Thu Nov  4 14:12:15 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 04 Nov 2004 14:12:15 -0800
Subject: [openib-general] Reusing receive MADs
Message-ID: <418AA93F.1060602@ichips.intel.com>

Is there any interest among people to reuse receive MADs?  I.e. once 
allocated and mapped, the receive MAD and work request would be 
re-posted to the QP when freed.

I ask because if people are interested in such an optimization at some 
point in the future, it will affect how I structure send queue overrun 
handling.

- Sean


From roland at topspin.com  Thu Nov  4 14:20:01 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 04 Nov 2004 14:20:01 -0800
Subject: [openib-general] Reusing receive MADs
In-Reply-To: <418AA93F.1060602@ichips.intel.com> (Sean Hefty's message of
	"Thu, 04 Nov 2004 14:12:15 -0800")
References: <418AA93F.1060602@ichips.intel.com>
Message-ID: <52u0s5l0j2.fsf@topspin.com>

    Sean> Is there any interest among people to reuse receive MADs?
    Sean> I.e. once allocated and mapped, the receive MAD and work
    Sean> request would be re-posted to the QP when freed.

I'm not sure this is that useful...  MAD processing is not such a
super-hot path that we need to keep per-CPU lists of cache-hot buffers
(as is done for sk_buffs), and the kernel slab code should do a pretty
good job of reusing buffers anyway.

(The receive buffer needs to be unmapped before passing to the
consumer anyway so there's not a saving there)

 - R.


From halr at voltaire.com  Fri Nov  5 05:31:24 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 08:31:24 -0500
Subject: [openib-general] [PATCH] mad: Restructure smi as shared between mad
	and agent
Message-ID: <1099661482.14234.3.camel@hpc-1>

mad: Restructure smi as shared between mad and agent
(adds new files smi.h, smi.c, and agent.h)

Index: agent.c
===================================================================
--- agent.c	(revision 1161)
+++ agent.c	(working copy)
@@ -24,6 +24,7 @@
 */
 
 #include <ib_smi.h>
+#include "smi.h"
 #include "agent_priv.h"
 #include "mad_priv.h"
 #include <asm/bug.h>
@@ -32,210 +33,7 @@
 static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED;
 static LIST_HEAD(ib_agent_port_list);
 
-/*
- * Fixup a directed route SMP for sending.  Return 0 if the SMP should be
- * discarded.
- */
-int smi_handle_dr_smp_send(struct ib_smp *smp,
-			   u8 node_type,
-			   int port_num)
-{
-	u8 hop_ptr, hop_cnt;
 
-	hop_ptr = smp->hop_ptr;
-	hop_cnt = smp->hop_cnt;
-
-	/* See section 14.2.2.2, Vol 1 IB spec */
-	if (!ib_get_smp_direction(smp)) {
-		/* C14-9:1 */
-		if (hop_cnt && hop_ptr == 0) {
-			smp->hop_ptr++;
-			return (smp->initial_path[smp->hop_ptr] == 
-				port_num);
-		}
-
-		/* C14-9:2 */
-		if (hop_ptr && hop_ptr < hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
-				return 0;
-			
-			/* smp->return_path set when received */
-			smp->hop_ptr++;
-			return (smp->initial_path[smp->hop_ptr] == 
-				port_num);
-		}
-
-		/* C14-9:3 -- We're at the end of the DR segment of path */
-		if (hop_ptr == hop_cnt) {
-			/* smp->return_path set when received */
-			smp->hop_ptr++;
-			return (node_type == IB_NODE_SWITCH ||
-				smp->dr_dlid == IB_LID_PERMISSIVE);
-		}
-
-		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */
-		/* C14-9:5 -- Fail unreasonable hop pointer. */
-		return (hop_ptr == hop_cnt + 1);
-
-	} else {
-		/* C14-13:1 */
-		if (hop_cnt && hop_ptr == hop_cnt + 1) {
-			smp->hop_ptr--;
-			return (smp->return_path[smp->hop_ptr] == 
-				port_num);
-		}
-
-		/* C14-13:2 */
-		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
-				return 0;
-
-			smp->hop_ptr--;
-			return (smp->return_path[smp->hop_ptr] == 
-				port_num);
-		}
-
-		/* C14-13:3 -- at the end of the DR segment of path */
-		if (hop_ptr == 1) {
-			smp->hop_ptr--;
-			/* C14-13:3 -- SMPs destined for SM shouldn't be here */
-			return (node_type == IB_NODE_SWITCH ||
-				smp->dr_slid == IB_LID_PERMISSIVE);
-		}
-
-		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */
-		/* C14-13:5 -- Check for unreasonable hop pointer. */
-		return 0;
-	}
-}
-
-/*
- * Return 1 if the SMP should be handled by the local SMA via process_mad.
- */
-static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent,
-				      struct ib_smp *smp)
-{
-	/* C14-9:3 -- We're at the end of the DR segment of path */
-	/* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */
-	return ((mad_agent->device->process_mad &&
-		!ib_get_smp_direction(smp) &&
-		(smp->hop_ptr == smp->hop_cnt + 1)));
-}
-
-/*
- * Adjust information for a received SMP.  Return 0 if the SMP should be
- * dropped.
- */
-int smi_handle_dr_smp_recv(struct ib_smp *smp,
-			   u8 node_type,
-			   int port_num,
-			   int phys_port_cnt)
-{
-	u8 hop_ptr, hop_cnt;
-
-	hop_ptr = smp->hop_ptr;
-	hop_cnt = smp->hop_cnt;
-
-	/* See section 14.2.2.2, Vol 1 IB spec */
-	if (!ib_get_smp_direction(smp)) {
-		/* C14-9:1 -- sender should have incremented hop_ptr */
-		if (hop_cnt && hop_ptr == 0)
-			return 0;
-
-		/* C14-9:2 -- intermediate hop */
-		if (hop_ptr && hop_ptr < hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
-				return 0;
-
-			smp->return_path[hop_ptr] = port_num;
-			/* smp->hop_ptr updated when sending */
-			return (smp->initial_path[hop_ptr+1] <= phys_port_cnt);
-		}
-
-		/* C14-9:3 -- We're at the end of the DR segment of path */
-		if (hop_ptr == hop_cnt) {
-			if (hop_cnt)
-				smp->return_path[hop_ptr] = port_num;
-			/* smp->hop_ptr updated when sending */
-
-			return (node_type == IB_NODE_SWITCH ||
-				smp->dr_dlid == IB_LID_PERMISSIVE);
-		}
-		
-		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */
-		/* C14-9:5 -- fail unreasonable hop pointer. */
-		return (hop_ptr == hop_cnt + 1);
-
-	} else {
-
-		/* C14-13:1 */
-		if (hop_cnt && hop_ptr == hop_cnt + 1) {
-			smp->hop_ptr--;
-			return (smp->return_path[smp->hop_ptr] ==
-				port_num);
-		}
-
-		/* C14-13:2 */
-		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
-				return 0;
-
-			/* smp->hop_ptr updated when sending */
-			return (smp->return_path[hop_ptr-1] <= phys_port_cnt);
-		}
-
-		/* C14-13:3 -- We're at the end of the DR segment of path */
-		if (hop_ptr == 1) {
-			if (smp->dr_slid == IB_LID_PERMISSIVE) {
-				/* giving SMP to SM - update hop_ptr */
-				smp->hop_ptr--;
-				return 1;
-			}
-			/* smp->hop_ptr updated when sending */
-			return (node_type == IB_NODE_SWITCH);
-		}
-
-		/* C14-13:4 -- hop_ptr = 0 -> give to SM. */
-		/* C14-13:5 -- Check for unreasonable hop pointer. */
-		return (hop_ptr == 0);
-	}
-}
-
-/*
- * Return 1 if the received DR SMP should be forwarded to the send queue.
- * Return 0 if the SMP should be completed up the stack.
- */
-int smi_check_forward_dr_smp(struct ib_smp *smp)
-{
-	u8 hop_ptr, hop_cnt;
-
-	hop_ptr = smp->hop_ptr;
-	hop_cnt = smp->hop_cnt;
-
-	if (!ib_get_smp_direction(smp)) {
-		/* C14-9:2 -- intermediate hop */
-		if (hop_ptr && hop_ptr < hop_cnt)
-			return 1;
-
-		/* C14-9:3 -- at the end of the DR segment of path */
-		if (hop_ptr == hop_cnt)
-			return (smp->dr_dlid == IB_LID_PERMISSIVE);
-
-		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */
-		if (hop_ptr == hop_cnt + 1)
-			return 1;
-	} else {
-		/* C14-13:2 */
-		if (2 <= hop_ptr && hop_ptr <= hop_cnt)
-			return 1;
-
-		/* C14-13:3 -- at the end of the DR segment of path */
-		if (hop_ptr == 1)
-			return (smp->dr_slid != IB_LID_PERMISSIVE);
-	}
-	return 0;
-}
-
 static inline struct ib_agent_port_private *
 __ib_get_agent_mad(struct ib_device *device, int port_num,
 		   struct ib_mad_agent *mad_agent)
Index: mad.c
===================================================================
--- mad.c	(revision 1161)
+++ mad.c	(working copy)
@@ -56,6 +56,8 @@
 
 #include <ib_mad.h>
 #include "mad_priv.h"
+#include "smi.h"
+#include "agent.h"
 
 #include <linux/smp_lock.h>
 #include <linux/interrupt.h>
@@ -922,23 +924,6 @@
 	}
 }
 
-extern int smi_handle_dr_smp_recv(struct ib_smp *smp,
-				  u8 node_type,
-				  int port_num,
-				  int phys_port_cnt);
-extern int smi_check_forward_dr_smp(struct ib_smp *smp);
-extern int smi_handle_dr_smp_send(struct ib_smp *smp,
-				  u8 node_type,
-				  int port_num);
-extern int smi_check_local_dr_smp(struct ib_smp *smp,
-				  struct ib_device *device,
-				  int port_num);
-extern int agent_send(struct ib_mad *mad,
-		      struct ib_grh *grh,
-		      struct ib_wc *wc,
-		      struct ib_device *device,
-		      int port_num);
-
 static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv,
 				     struct ib_wc *wc)
 {
@@ -1877,12 +1862,6 @@
 	return 0;
 }
 
-
-extern int ib_agent_port_open(struct ib_device *device, int port_num,
-			      int phys_port_cnt);
-extern int ib_agent_port_close(struct ib_device *device, int port_num);
-
-
 static void ib_mad_init_device(struct ib_device *device)
 {
 	int ret, num_ports, cur_port, i, ret2;
Index: agent.h
===================================================================
--- agent.h	(revision 0)
+++ agent.h	(revision 0)
@@ -0,0 +1,41 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#ifndef __AGENT_H_
+#define __AGENT_H_
+
+extern int ib_agent_port_open(struct ib_device *device,
+			      int port_num,
+			      int phys_port_cnt);
+
+extern int ib_agent_port_close(struct ib_device *device, int port_num);
+
+extern int agent_send(struct ib_mad *mad,
+		      struct ib_grh *grh,
+		      struct ib_wc *wc,
+		      struct ib_device *device,
+		      int port_num);
+
+#endif	/* __AGENT_H_ */
Index: smi.c
===================================================================
--- smi.c	(revision 0)
+++ smi.c	(revision 0)
@@ -0,0 +1,219 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#include <ib_smi.h>
+
+
+/*
+ * Fixup a directed route SMP for sending.  Return 0 if the SMP should be
+ * discarded.
+ */
+int smi_handle_dr_smp_send(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num)
+{
+	u8 hop_ptr, hop_cnt;
+
+	hop_ptr = smp->hop_ptr;
+	hop_cnt = smp->hop_cnt;
+
+	/* See section 14.2.2.2, Vol 1 IB spec */
+	if (!ib_get_smp_direction(smp)) {
+		/* C14-9:1 */
+		if (hop_cnt && hop_ptr == 0) {
+			smp->hop_ptr++;
+			return (smp->initial_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-9:2 */
+		if (hop_ptr && hop_ptr < hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+			
+			/* smp->return_path set when received */
+			smp->hop_ptr++;
+			return (smp->initial_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-9:3 -- We're at the end of the DR segment of path */
+		if (hop_ptr == hop_cnt) {
+			/* smp->return_path set when received */
+			smp->hop_ptr++;
+			return (node_type == IB_NODE_SWITCH ||
+				smp->dr_dlid == IB_LID_PERMISSIVE);
+		}
+
+		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */
+		/* C14-9:5 -- Fail unreasonable hop pointer. */
+		return (hop_ptr == hop_cnt + 1);
+
+	} else {
+		/* C14-13:1 */
+		if (hop_cnt && hop_ptr == hop_cnt + 1) {
+			smp->hop_ptr--;
+			return (smp->return_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-13:2 */
+		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+
+			smp->hop_ptr--;
+			return (smp->return_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-13:3 -- at the end of the DR segment of path */
+		if (hop_ptr == 1) {
+			smp->hop_ptr--;
+			/* C14-13:3 -- SMPs destined for SM shouldn't be here */
+			return (node_type == IB_NODE_SWITCH ||
+				smp->dr_slid == IB_LID_PERMISSIVE);
+		}
+
+		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */
+		/* C14-13:5 -- Check for unreasonable hop pointer. */
+		return 0;
+	}
+}
+
+/*
+ * Adjust information for a received SMP.  Return 0 if the SMP should be
+ * dropped.
+ */
+int smi_handle_dr_smp_recv(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num,
+			   int phys_port_cnt)
+{
+	u8 hop_ptr, hop_cnt;
+
+	hop_ptr = smp->hop_ptr;
+	hop_cnt = smp->hop_cnt;
+
+	/* See section 14.2.2.2, Vol 1 IB spec */
+	if (!ib_get_smp_direction(smp)) {
+		/* C14-9:1 -- sender should have incremented hop_ptr */
+		if (hop_cnt && hop_ptr == 0)
+			return 0;
+
+		/* C14-9:2 -- intermediate hop */
+		if (hop_ptr && hop_ptr < hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+
+			smp->return_path[hop_ptr] = port_num;
+			/* smp->hop_ptr updated when sending */
+			return (smp->initial_path[hop_ptr+1] <= phys_port_cnt);
+		}
+
+		/* C14-9:3 -- We're at the end of the DR segment of path */
+		if (hop_ptr == hop_cnt) {
+			if (hop_cnt)
+				smp->return_path[hop_ptr] = port_num;
+			/* smp->hop_ptr updated when sending */
+
+			return (node_type == IB_NODE_SWITCH ||
+				smp->dr_dlid == IB_LID_PERMISSIVE);
+		}
+		
+		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */
+		/* C14-9:5 -- fail unreasonable hop pointer. */
+		return (hop_ptr == hop_cnt + 1);
+
+	} else {
+
+		/* C14-13:1 */
+		if (hop_cnt && hop_ptr == hop_cnt + 1) {
+			smp->hop_ptr--;
+			return (smp->return_path[smp->hop_ptr] ==
+				port_num);
+		}
+
+		/* C14-13:2 */
+		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+
+			/* smp->hop_ptr updated when sending */
+			return (smp->return_path[hop_ptr-1] <= phys_port_cnt);
+		}
+
+		/* C14-13:3 -- We're at the end of the DR segment of path */
+		if (hop_ptr == 1) {
+			if (smp->dr_slid == IB_LID_PERMISSIVE) {
+				/* giving SMP to SM - update hop_ptr */
+				smp->hop_ptr--;
+				return 1;
+			}
+			/* smp->hop_ptr updated when sending */
+			return (node_type == IB_NODE_SWITCH);
+		}
+
+		/* C14-13:4 -- hop_ptr = 0 -> give to SM. */
+		/* C14-13:5 -- Check for unreasonable hop pointer. */
+		return (hop_ptr == 0);
+	}
+}
+
+/*
+ * Return 1 if the received DR SMP should be forwarded to the send queue.
+ * Return 0 if the SMP should be completed up the stack.
+ */
+int smi_check_forward_dr_smp(struct ib_smp *smp)
+{
+	u8 hop_ptr, hop_cnt;
+
+	hop_ptr = smp->hop_ptr;
+	hop_cnt = smp->hop_cnt;
+
+	if (!ib_get_smp_direction(smp)) {
+		/* C14-9:2 -- intermediate hop */
+		if (hop_ptr && hop_ptr < hop_cnt)
+			return 1;
+
+		/* C14-9:3 -- at the end of the DR segment of path */
+		if (hop_ptr == hop_cnt)
+			return (smp->dr_dlid == IB_LID_PERMISSIVE);
+
+		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */
+		if (hop_ptr == hop_cnt + 1)
+			return 1;
+	} else {
+		/* C14-13:2 */
+		if (2 <= hop_ptr && hop_ptr <= hop_cnt)
+			return 1;
+
+		/* C14-13:3 -- at the end of the DR segment of path */
+		if (hop_ptr == 1)
+			return (smp->dr_slid != IB_LID_PERMISSIVE);
+	}
+	return 0;
+}
+
Index: Makefile
===================================================================
--- Makefile	(revision 1161)
+++ Makefile	(working copy)
@@ -17,6 +17,7 @@
 
 ib_mad-objs := \
     mad.o \
+    smi.o \
     agent.o
 
 ib_sa-objs := sa_query.o
Index: smi.h
===================================================================
--- smi.h	(revision 0)
+++ smi.h	(revision 0)
@@ -0,0 +1,54 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#ifndef __SMI_H_
+#define __SMI_H_
+
+int smi_handle_dr_smp_recv(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num,
+			   int phys_port_cnt);
+extern int smi_check_forward_dr_smp(struct ib_smp *smp);
+extern int smi_handle_dr_smp_send(struct ib_smp *smp,
+				  u8 node_type,
+				  int port_num);
+extern int smi_check_local_dr_smp(struct ib_smp *smp,
+				  struct ib_device *device,
+				  int port_num);
+
+/*
+ * Return 1 if the SMP should be handled by the local SMA via process_mad.
+ */
+static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent,
+                         	      struct ib_smp *smp)
+{
+	/* C14-9:3 -- We're at the end of the DR segment of path */
+	/* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */
+	return ((mad_agent->device->process_mad &&
+		!ib_get_smp_direction(smp) &&
+		(smp->hop_ptr == smp->hop_cnt + 1)));
+}
+
+#endif	/* __SMI_H_ */


From halr at voltaire.com  Fri Nov  5 10:49:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 13:49:39 -0500
Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in
	ib_post_send_mad
Message-ID: <1099680579.2965.7.camel@hpc-1>

mad: Handle outgoing SMPs in ib_post_send_mad
The MAD layer is now ready to support the SM :-)

I have not yet handled the additional special cases: hop count increment
done by device, use send queue rather than process MAD for 0 hop SMPs).

Index: mad_priv.h
===================================================================
--- mad_priv.h	(revision 1161)
+++ mad_priv.h	(working copy)
@@ -115,6 +115,7 @@
 
 	atomic_t refcount;
 	wait_queue_head_t wait;
+	int phys_port_cnt;
 	u8 rmpp_version;
 };
 
Index: mad.c
===================================================================
--- mad.c	(revision 1162)
+++ mad.c	(working copy)
@@ -89,6 +89,7 @@
 static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
 				    struct ib_mad_send_wc *mad_send_wc);
 static void timeout_sends(void *data);
+static int solicited_mad(struct ib_mad *mad);
 
 /*
  * Returns a ib_mad_port_private structure or NULL for a device/port.
@@ -243,6 +244,7 @@
 	mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
 	mad_agent_priv->reg_req = reg_req;
 	mad_agent_priv->rmpp_version = rmpp_version;
+	mad_agent_priv->phys_port_cnt = port_priv->phys_port_cnt;
 	mad_agent_priv->agent.device = device;
 	mad_agent_priv->agent.recv_handler = recv_handler;
 	mad_agent_priv->agent.send_handler = send_handler;
@@ -368,6 +370,105 @@
 	spin_unlock_irqrestore(&mad_queue->lock, flags);
 }
 
+/*
+ * Return 0 if SMP is to be sent
+ * Return 1 if SMP was consumed locally (whether or not solicited)
+ * Return < 0 if error 
+ */
+static int handle_outgoing_smp(struct ib_mad_agent *mad_agent,
+			       struct ib_smp *smp,
+			       struct ib_send_wr *send_wr)
+{
+	int ret;
+
+	if (!smi_handle_dr_smp_send(smp,
+				    mad_agent->device->node_type,
+				    mad_agent->port_num)) {
+		ret = -EINVAL;
+		printk(KERN_ERR "Invalid directed route\n");
+		goto error1;
+	}
+	if (smi_check_local_dr_smp(smp,
+				   mad_agent->device,
+				   mad_agent->port_num)) {
+		struct ib_mad_private *mad_priv;
+		struct ib_mad_agent_private *mad_agent_priv;
+		struct ib_mad_send_wc mad_send_wc;
+
+		mad_priv = kmem_cache_alloc(ib_mad_cache,
+					    (in_atomic() || irqs_disabled()) ?
+					    GFP_ATOMIC : GFP_KERNEL);
+		if (!mad_priv) {
+			ret = -ENOMEM;
+			printk(KERN_ERR PFX "No memory for local response MAD\n");
+			goto error1;
+		}
+
+		mad_agent_priv = container_of(mad_agent,
+					      struct ib_mad_agent_private,
+					      agent);
+		ret = mad_agent->device->process_mad(mad_agent->device,
+						     0,
+						     mad_agent->port_num,
+						     smp->dr_slid, /* ? */
+						     (struct ib_mad *)smp,
+						     (struct ib_mad *)&mad_priv->mad);
+		if ((ret & IB_MAD_RESULT_SUCCESS) &&
+		    (ret & IB_MAD_RESULT_REPLY)) {
+			if (!smi_handle_dr_smp_recv((struct ib_smp *)&mad_priv->mad,
+						    mad_agent->device->node_type,
+						    mad_agent->port_num,
+						    mad_agent_priv->phys_port_cnt)) {
+				ret = -EINVAL;
+				kmem_cache_free(ib_mad_cache, mad_priv);
+				goto error1;
+			}
+		}
+
+		/* See if response is solicited and there is a recv handler */
+		if (solicited_mad(&mad_priv->mad.mad) && 
+		    mad_agent_priv->agent.recv_handler) {
+			struct ib_wc wc;
+
+			/* Defined behavior is to complete response before request */
+			wc.wr_id = send_wr->wr_id;
+			wc.status = IB_WC_SUCCESS;
+			wc.opcode = IB_WC_RECV;
+			wc.vendor_err = 0;
+			wc.byte_len = sizeof(struct ib_mad);
+			wc.src_qp = 0;	/* IB_QPT_SMI ? */
+			wc.wc_flags = 0;
+			wc.pkey_index = 0;
+			wc.slid = IB_LID_PERMISSIVE;
+			wc.sl = 0;
+			wc.dlid_path_bits = 0;
+			mad_priv->header.recv_wc.wc = &wc;
+			mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad);
+			INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
+			mad_priv->header.recv_buf.grh = NULL;
+			mad_priv->header.recv_buf.mad = &mad_priv->mad.mad;
+			mad_priv->header.recv_wc.recv_buf = &mad_priv->header.recv_buf;
+			mad_agent_priv->agent.recv_handler(mad_agent,
+							   &mad_priv->header.recv_wc);
+		} else
+			kmem_cache_free(ib_mad_cache, mad_priv);
+
+		if (mad_agent_priv->agent.send_handler) {
+			/* Now, complete send */
+			mad_send_wc.status = IB_WC_SUCCESS;
+			mad_send_wc.vendor_err = 0;
+			mad_send_wc.wr_id = send_wr->wr_id;
+			mad_agent_priv->agent.send_handler(mad_agent, &mad_send_wc);
+			ret = 1;
+		} else
+			ret = -EINVAL;
+	} else 
+		ret = 0;
+
+error1:
+	return ret;
+}
+
 static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv,
 		       struct ib_mad_send_wr_private *mad_send_wr,
 		       struct ib_send_wr *send_wr,
@@ -422,9 +523,27 @@
 	while (cur_send_wr) {
 		unsigned long			flags;
 		struct ib_mad_send_wr_private	*mad_send_wr;
+		struct ib_smp			*smp;
 
+		if (!cur_send_wr->wr.ud.mad_hdr) {
+			*bad_send_wr = cur_send_wr;
+			printk(KERN_ERR PFX "MAD header must be supplied in WR %p\n", cur_send_wr);
+			goto error1;
+		}
+
 		next_send_wr = (struct ib_send_wr *)cur_send_wr->next;
 
+		smp = (struct ib_smp *)cur_send_wr->wr.ud.mad_hdr;
+		if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+			ret = handle_outgoing_smp(mad_agent, smp, cur_send_wr);
+			if (ret < 0) {	/* error */
+				*bad_send_wr = cur_send_wr;
+				goto error1;
+			} else if (ret == 1) {	/* locally consumed */
+				goto next;
+			} 
+		}
+
 		/* Allocate MAD send WR tracking structure */
 		mad_send_wr = kmalloc(sizeof *mad_send_wr, 
 				      (in_atomic() || irqs_disabled()) ?
@@ -467,7 +586,8 @@
 			atomic_dec(&mad_agent_priv->refcount);
 			return ret;		
 		}
-		cur_send_wr= next_send_wr;
+next:
+		cur_send_wr = next_send_wr;
 	}
 
 	return 0;	


From halr at voltaire.com  Fri Nov  5 10:54:47 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 13:54:47 -0500
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>
Message-ID: <1099680887.2965.18.camel@hpc-1>

On Thu, 2004-11-04 at 12:44, Krishna Kumar wrote:
> On Thu, 4 Nov 2004, Roland Dreier wrote:
> 
> > Not sure what the goal is here, but I should point out that current
> > mthca code does not implement resizing either CQs or QPs.
> 
> Yes, I agree on that. Infact the verbs layer will return ENOSYS for
> mthca driver. But I was assuming that any other driver by a different
> hardware vendor can support this call (mthca over time could support
> this call too ?).

Is this a driver or firmware issue ?

> > However I'm not sure I understand why the MAD layer wants to resize
> > these objects -- given that the number of QPs is known in advance and
> > that the MAD layer can choose how many work requests to post per QP,
> > I'm not sure what is gained by trying to resize things dynamically.
> 
> Actually, I haven't really implemented the "dynamically" part, where you
> resize the CQ during operation. The spec said that when you create a QP,
> it can be larger than what you specified. If so, I see good value in
> increasing the size of the associated CQ, if it is supported by the
> driver.

Might this be useful for redirected QPs ?

Should the incorporation of this functionality be deferred until either
there is hardware which supports this or we find some use for it ?

-- Hal


From halr at voltaire.com  Fri Nov  5 10:56:52 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 13:56:52 -0500
Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk
In-Reply-To: <52y8hhmrs7.fsf@topspin.com>
References: <52y8hhmrs7.fsf@topspin.com>
Message-ID: <1099681012.2965.22.camel@hpc-1>

On Thu, 2004-11-04 at 12:46, Roland Dreier wrote:
> I have just copied the roland-merge branch to
> 
>     https://openib.org/svn/gen2/trunk
> 
> This tree will become the main development tree and will be used to
> create the tree we will submit to the kernel for inclusion.  Please
> use this tree for testing and as the base for all patches.
> 
> I will be cleaning up this tree (mostly deleting code that does not
> build any more, etc) over the next few days.

This looks great.

I have just 2 minor questions:

1. Are there changes planned for core/cache.c ?

2. Shouldn't src/userspace/tools/libsdp be removed for now ?

Thanks.

-- Hal


From roland at topspin.com  Fri Nov  5 11:20:21 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 11:20:21 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <1099680887.2965.18.camel@hpc-1> (Hal Rosenstock's message of
	"Fri, 05 Nov 2004 13:54:47 -0500")
References: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>
	<1099680887.2965.18.camel@hpc-1>
Message-ID: <52is8khzm2.fsf@topspin.com>

    Hal> Is this a driver or firmware issue ?

Driver issue.  I just haven't implemented CQ resize yet, and it's not
a high priority for me.

    Hal> Might this be useful for redirected QPs ?

I don't think so, since the redirected QP will not be attached to the
MAD layer's CQ.

    Hal> Should the incorporation of this functionality be deferred
    Hal> until either there is hardware which supports this or we find
    Hal> some use for it ?

I think so.

 - R.


From roland at topspin.com  Fri Nov  5 11:21:12 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 11:21:12 -0800
Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk
In-Reply-To: <1099681012.2965.22.camel@hpc-1> (Hal Rosenstock's message of
	"Fri, 05 Nov 2004 13:56:52 -0500")
References: <52y8hhmrs7.fsf@topspin.com> <1099681012.2965.22.camel@hpc-1>
Message-ID: <52ekj8hzkn.fsf@topspin.com>

    Hal> 1. Are there changes planned for core/cache.c ?

I've cleaned it up a little but I'm really not sure exactly what
should be done with it.

    Hal> 2. Shouldn't src/userspace/tools/libsdp be removed for now ?

Yeah, I'll do that.

 -R.


From mshefty at ichips.intel.com  Fri Nov  5 11:21:17 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 05 Nov 2004 11:21:17 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <1099680887.2965.18.camel@hpc-1>
References: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>
	<1099680887.2965.18.camel@hpc-1>
Message-ID: <418BD2AD.6030807@ichips.intel.com>

Hal Rosenstock wrote:
>>>However I'm not sure I understand why the MAD layer wants to resize
>>>these objects -- given that the number of QPs is known in advance and
>>>that the MAD layer can choose how many work requests to post per QP,
>>>I'm not sure what is gained by trying to resize things dynamically.
>>
>>Actually, I haven't really implemented the "dynamically" part, where you
>>resize the CQ during operation. The spec said that when you create a QP,
>>it can be larger than what you specified. If so, I see good value in
>>increasing the size of the associated CQ, if it is supported by the
>>driver.
> 
> 
> Might this be useful for redirected QPs ?

Since the client allocates the QP and CQ in this case, they would be 
responsible for resizing the CQ appropriately.  The MAD layer could 
provide queuing to prevent send queue overflow, or not, depending on how 
we want to implement it.

> Should the incorporation of this functionality be deferred until either
> there is hardware which supports this or we find some use for it ?

I think we should go ahead an put this code in.  We need to handle the 
case where the QP is sized larger than what we request anyway, to ensure 
that we don't overrun the CQ.


From krkumar at us.ibm.com  Fri Nov  5 11:57:53 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Fri, 5 Nov 2004 11:57:53 -0800 (PST)
Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query.
Message-ID: <Pine.LNX.4.44.0411051133130.16018-100000@DYN318430BLD>

Current code frees up memory in error case and dereferences it
later, plus the success case doesn't (seem to) free it up.

(do you guys need patches to be rooted from a particular directory
to be more efficient/convenient ?)

- KK

diff -ruNp 7/sa_query.c 8/sa_query.c
--- 7/sa_query.c	2004-11-05 11:37:44.000000000 -0800
+++ 8/sa_query.c	2004-11-05 11:51:06.000000000 -0800
@@ -544,12 +544,14 @@ int ib_sa_path_rec_get(struct ib_device
 		rec, query->sa_query.mad->data);

 	ret = send_mad(&query->sa_query, timeout_ms);
-	if (ret)
-		kfree(query);
-
-	*sa_query = &query->sa_query;

-	return ret ? ret : query->sa_query.id;
+	if (!ret) {
+		/* Success, return the SA Query and ID. */
+		ret = query->sa_query.id;
+		*sa_query = &query->sa_query;
+	}
+	kfree(query);
+	return ret;
 }
 EXPORT_SYMBOL(ib_sa_path_rec_get);

@@ -617,12 +619,14 @@ int ib_sa_mcmember_rec_query(struct ib_d
 		rec, query->sa_query.mad->data);

 	ret = send_mad(&query->sa_query, timeout_ms);
-	if (ret)
-		kfree(query);
-
-	*sa_query = &query->sa_query;
+	if (!ret) {
+		/* Success, return the SA Query and ID. */
+		ret = query->sa_query.id;
+		*sa_query = &query->sa_query;
+	}
+	kfree(query);
+	return ret;

-	return ret ? ret : query->sa_query.id;
 }
 EXPORT_SYMBOL(ib_sa_mcmember_rec_query);


From halr at voltaire.com  Fri Nov  5 12:09:42 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 15:09:42 -0500
Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query.
In-Reply-To: <Pine.LNX.4.44.0411051133130.16018-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411051133130.16018-100000@DYN318430BLD>
Message-ID: <1099685382.3278.56.camel@localhost.localdomain>

On Fri, 2004-11-05 at 14:57, Krishna Kumar wrote:
> (do you guys need patches to be rooted from a particular directory
> to be more efficient/convenient ?)

We should be patching against gen2/trunk now.

-- Hal


From halr at voltaire.com  Fri Nov  5 12:17:11 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 15:17:11 -0500
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <418BD2AD.6030807@ichips.intel.com>
References: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>
	<1099680887.2965.18.camel@hpc-1> <418BD2AD.6030807@ichips.intel.com>
Message-ID: <1099685830.3278.65.camel@localhost.localdomain>

On Fri, 2004-11-05 at 14:21, Sean Hefty wrote:
> I think we should go ahead an put this code in.  We need to handle the 
> case where the QP is sized larger than what we request anyway, to ensure 
> that we don't overrun the CQ.

Does the driver do this (QP is sized larger than what was requested) now
? Or is this a spec thing ?

If so, just to make sure I have this straight, what is/are the specific
patch(es) ? Is it the 2 patches from Wednesday entitled "PATCH 1/2
Resize CQ" and "PATCH 2/2 Implement error handling in resize failure".

Thanks.

-- Hal


From roland at topspin.com  Fri Nov  5 12:22:09 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 12:22:09 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <1099685830.3278.65.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Fri, 05 Nov 2004 15:17:11 -0500")
References: <Pine.LNX.4.44.0411040930560.12388-100000@DYN318430BLD>
	<1099680887.2965.18.camel@hpc-1> <418BD2AD.6030807@ichips.intel.com>
	<1099685830.3278.65.camel@localhost.localdomain>
Message-ID: <52654khwr2.fsf@topspin.com>

    Hal> Does the driver do this (QP is sized larger than what was
    Hal> requested) now ? Or is this a spec thing ?

Unless my memory is playing tricks on me, I don't think mthca will
create a QP larger than requested.

 - R.


From ftillier at infiniconsys.com  Fri Nov  5 12:22:13 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Fri, 5 Nov 2004 12:22:13 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <418BD2AD.6030807@ichips.intel.com>
Message-ID: <000001c4c375$26bbac40$655aa8c0@infiniconsys.com>

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Friday, November 05, 2004 11:21 AM
> 
> Hal Rosenstock wrote:
> >>>However I'm not sure I understand why the MAD layer wants to resize
> >>>these objects -- given that the number of QPs is known in advance and
> >>>that the MAD layer can choose how many work requests to post per QP,
> >>>I'm not sure what is gained by trying to resize things dynamically.
> >>
> >>Actually, I haven't really implemented the "dynamically" part, where you
> >>resize the CQ during operation. The spec said that when you create a QP,
> >>it can be larger than what you specified. If so, I see good value in
> >>increasing the size of the associated CQ, if it is supported by the
> >>driver.
> >
> >
> > Might this be useful for redirected QPs ?
> 
> Since the client allocates the QP and CQ in this case, they would be
> responsible for resizing the CQ appropriately.  The MAD layer could
> provide queuing to prevent send queue overflow, or not, depending on how
> we want to implement it.

If the MAD layer did provide queuing to prevent overflow for the requested
(not allocated) depth, then the CQ resize is unnecessary.  I would expect
that whatever code manages the QP/CQ should provide queuing so that MAD
agents don't all have to implement queueing with respect to one another.

- Fab


From ftillier at infiniconsys.com  Fri Nov  5 12:22:13 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Fri, 5 Nov 2004 12:22:13 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <52is8khzm2.fsf@topspin.com>
Message-ID: <000101c4c375$327c66a0$655aa8c0@infiniconsys.com>

> From: Roland Dreier [mailto:roland at topspin.com]
> Sent: Friday, November 05, 2004 11:20 AM
> 
>     Hal> Is this a driver or firmware issue ?
> 
> Driver issue.  I just haven't implemented CQ resize yet, and it's not
> a high priority for me.
> 

As far as I know, Tavor does not support QP resize.

- Fab


From halr at voltaire.com  Fri Nov  5 12:22:35 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 05 Nov 2004 15:22:35 -0500
Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk
In-Reply-To: <52ekj8hzkn.fsf@topspin.com>
References: <52y8hhmrs7.fsf@topspin.com> <1099681012.2965.22.camel@hpc-1>
	<52ekj8hzkn.fsf@topspin.com>
Message-ID: <1099686155.3278.71.camel@localhost.localdomain>

On Fri, 2004-11-05 at 14:21, Roland Dreier wrote:
>     Hal> 1. Are there changes planned for core/cache.c ?
> 
> I've cleaned it up a little but I'm really not sure exactly what
> should be done with it.

One more thing would be to rename ts_ib_core.h to something like
ib_cache.h.

-- Hal


From mshefty at ichips.intel.com  Fri Nov  5 12:27:00 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 05 Nov 2004 12:27:00 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <000001c4c375$26bbac40$655aa8c0@infiniconsys.com>
References: <000001c4c375$26bbac40$655aa8c0@infiniconsys.com>
Message-ID: <418BE214.3070302@ichips.intel.com>

Fab Tillier wrote:
> If the MAD layer did provide queuing to prevent overflow for the requested
> (not allocated) depth, then the CQ resize is unnecessary.  I would expect
> that whatever code manages the QP/CQ should provide queuing so that MAD
> agents don't all have to implement queueing with respect to one another.

Resizing the CQ is an optimization only.  If the resize fails, the MAD 
layer will simply restrict the number of outstanding sends/receives. 
The MAD layer will queue sends, but not receives.

- Sean


From krkumar at us.ibm.com  Fri Nov  5 12:56:15 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Fri, 5 Nov 2004 12:56:15 -0800 (PST)
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <1099685830.3278.65.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.44.0411051248360.16248-100000@DYN318430BLD>

Hi Hal,

I think others have answered it, but my 2 cents :

1. Current driver doesn't do this, it is a spec thing, a potential driver
   could do this in future. It is useful in the sense that the CQ is
   always atleast the size of the QP's that are using it in case the
   driver implements it, as Sean mentioned.

2. The patches that you are referring to are correct. If people agree that
   it is useful to have, I can regenerate against latest bits (that is if
   it doesn't apply).

thanks,

- KK


On Fri, 5 Nov 2004, Hal Rosenstock wrote:

> On Fri, 2004-11-05 at 14:21, Sean Hefty wrote:
> > I think we should go ahead an put this code in.  We need to handle the
> > case where the QP is sized larger than what we request anyway, to ensure
> > that we don't overrun the CQ.
>
> Does the driver do this (QP is sized larger than what was requested) now
> ? Or is this a spec thing ?
>
> If so, just to make sure I have this straight, what is/are the specific
> patch(es) ? Is it the 2 patches from Wednesday entitled "PATCH 1/2
> Resize CQ" and "PATCH 2/2 Implement error handling in resize failure".
>
> Thanks.
>
> -- Hal
>
>
>


From roland at topspin.com  Fri Nov  5 13:30:13 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 13:30:13 -0800
Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ
In-Reply-To: <Pine.LNX.4.44.0411051248360.16248-100000@DYN318430BLD> (Krishna
	Kumar's message of "Fri, 5 Nov 2004 12:56:15 -0800 (PST)")
References: <Pine.LNX.4.44.0411051248360.16248-100000@DYN318430BLD>
Message-ID: <521xf8htlm.fsf@topspin.com>

I guess my bottom line is that these patches add complexity and can't
be tested at the moment, so my inclination would be to leave them out.

 - R.


From krkumar at us.ibm.com  Fri Nov  5 13:22:46 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Fri, 5 Nov 2004 13:22:46 -0800 (PST)
Subject: [openib-general] 
	[PATCH] Extra kfrees, clean up unregisters, etc ...
Message-ID: <Pine.LNX.4.44.0411051157550.16018-100000@DYN318430BLD>

1. Don't kfree sa_dev twice.
2. Unnecessary kref_put : In failure case, we don't seem to have a
   reference until update_sm_ah is called in success case.
3. Clean up code which looks like a hack (i++ in failure).
4. Too many "i -s", no need to keep recalculating this :-)


5. Potential extra cleanup : I could have set index = 0 instead of
   index = i - s, I kept it this way to be quite identical to existing
   code.

Patch applies with -p1 on trunk directory, on top of my previous patch.

Thanks,

- KK

diff -ruNp trunk/src/linux-kernel/infiniband/core/sa_query.c.org trunk/src/linux-kernel/infiniband/core/sa_query.c
--- trunk/src/linux-kernel/infiniband/core/sa_query.c.org	2004-11-05 11:51:06.000000000 -0800
+++ trunk/src/linux-kernel/infiniband/core/sa_query.c	2004-11-05 13:10:50.000000000 -0800
@@ -682,6 +682,7 @@ static void ib_sa_add_one(struct ib_devi
 {
 	struct ib_sa_device *sa_dev;
 	int s, e, i;
+	int index;

 	if (device->node_type == IB_NODE_SWITCH)
 		s = e = 0;
@@ -703,29 +704,29 @@ static void ib_sa_add_one(struct ib_devi
 	sa_dev->start_port = s;
 	sa_dev->end_port   = e;

-	for (i = s; i <= e; ++i) {
-		sa_dev->port[i - s].mr       = NULL;
-		sa_dev->port[i - s].sm_ah    = NULL;
-		sa_dev->port[i - s].port_num = i;
-		spin_lock_init(&sa_dev->port[i - s].ah_lock);
+	for (i = s, index = i - s; i <= e; ++i, ++index) {
+		sa_dev->port[index].mr       = NULL;
+		sa_dev->port[index].sm_ah    = NULL;
+		sa_dev->port[index].port_num = i;
+		spin_lock_init(&sa_dev->port[index].ah_lock);

-		sa_dev->port[i - s].agent =
+		sa_dev->port[index].agent =
 			ib_register_mad_agent(device, i, IB_QPT_GSI,
 					      NULL, 0, send_handler,
 					      recv_handler, sa_dev);
-		if (IS_ERR(sa_dev->port[i - s].agent))
+		if (IS_ERR(sa_dev->port[index].agent))
 			goto err;

-		sa_dev->port[i - s].mr = ib_get_dma_mr(sa_dev->port[i - s].agent->qp->pd,
-						       IB_ACCESS_LOCAL_WRITE);
-		if (IS_ERR(sa_dev->port[i - s].mr)) {
-			/* Bump i so agent from this iter. is freed */
-			++i;
+		sa_dev->port[index].mr =
+				ib_get_dma_mr(sa_dev->port[index].agent->qp->pd,
+				IB_ACCESS_LOCAL_WRITE);
+		if (IS_ERR(sa_dev->port[index].mr)) {
+			ib_unregister_mad_agent(sa_dev->port[index].agent);
 			goto err;
 		}

-		INIT_WORK(&sa_dev->port[i - s].update_task,
-			  update_sm_ah, &sa_dev->port[i - s]);
+		INIT_WORK(&sa_dev->port[index].update_task,
+			  update_sm_ah, &sa_dev->port[index]);
 	}

 	/*
@@ -736,27 +737,20 @@ static void ib_sa_add_one(struct ib_devi
 	 */

 	INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event);
-	if (ib_register_event_handler(&sa_dev->event_handler)) {
-		kfree(sa_dev);
+	if (ib_register_event_handler(&sa_dev->event_handler))
 		goto err;
-	}

-	for (i = s; i <= e; ++i)
-		update_sm_ah(&sa_dev->port[i - s]);
+	while (--index >= 0)
+		update_sm_ah(&sa_dev->port[index]);

 	ib_set_client_data(device, &sa_client, sa_dev);

 	return;

 err:
-	while (--i >= s) {
-		if (sa_dev->port[i - s].mr && !IS_ERR(sa_dev->port[i - s].mr))
-			ib_dereg_mr(sa_dev->port[i - s].mr);
-
-		if (sa_dev->port[i - s].sm_ah)
-			kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
-
-		ib_unregister_mad_agent(sa_dev->port[i - s].agent);
+	while (--index >= 0) {
+		ib_dereg_mr(sa_dev->port[index].mr);
+		ib_unregister_mad_agent(sa_dev->port[index].agent);
 	}

 	kfree(sa_dev);


From krkumar at us.ibm.com  Fri Nov  5 13:48:56 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Fri, 5 Nov 2004 13:48:56 -0800 (PST)
Subject: [openib-general] [PATCH] mad doesn't get freed up after send_mad is
	called
Message-ID: <Pine.LNX.4.44.0411051348120.16367-100000@DYN318430BLD>

Applies on top of my previous patch...

diff -ruNp trunk/src/linux-kernel/infiniband/core/sa_query.c.org trunk/src/linux-kernel/infiniband/core/sa_query.c
--- trunk/src/linux-kernel/infiniband/core/sa_query.c.org	2004-11-05 13:13:12.000000000 -0800
+++ trunk/src/linux-kernel/infiniband/core/sa_query.c	2004-11-05 13:43:10.000000000 -0800
@@ -550,6 +550,7 @@ int ib_sa_path_rec_get(struct ib_device
 		ret = query->sa_query.id;
 		*sa_query = &query->sa_query;
 	}
+	kfree(query->sa_query.mad);
 	kfree(query);
 	return ret;
 }
@@ -624,6 +625,7 @@ int ib_sa_mcmember_rec_query(struct ib_d
 		ret = query->sa_query.id;
 		*sa_query = &query->sa_query;
 	}
+	kfree(query->sa_query.mad);
 	kfree(query);
 	return ret;


From krkumar at us.ibm.com  Fri Nov  5 14:01:22 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Fri, 5 Nov 2004 14:01:22 -0800 (PST)
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
Message-ID: <Pine.LNX.4.44.0411051400040.16367-100000@DYN318430BLD>

diff -ruNp trunk/src/linux-kernel/infiniband/core/sa_query.c.org trunk/src/linux-kernel/infiniband/core/sa_query.c
--- trunk/src/linux-kernel/infiniband/core/sa_query.c.org	2004-11-05 13:43:10.000000000 -0800
+++ trunk/src/linux-kernel/infiniband/core/sa_query.c	2004-11-05 13:58:47.000000000 -0800
@@ -387,18 +387,21 @@ static void ib_sa_event(struct ib_event_
 	}
 }

-void ib_sa_cancel_query(int id, struct ib_sa_query *query)
+static inline struct ib_sa_query *ib_sa_find_idr(int id)
 {
-	unsigned long flags;
+	struct ib_sa_query	*query
+	unsigned long		flags;

 	spin_lock_irqsave(&idr_lock, flags);
-	if (idr_find(&query_idr, query->id) != query) {
-		spin_unlock_irqrestore(&idr_lock, flags);
-		return;
-	}
+	query = idr_find(&query_idr, id);
 	spin_unlock_irqrestore(&idr_lock, flags);
+	return query;
+}

-	ib_cancel_mad(query->port->agent, query->id);
+void ib_sa_cancel_query(int id, struct ib_sa_query *query)
+{
+	if (ib_sa_find_idr(id) == query)
+		ib_cancel_mad(query->port->agent, query->id);
 }
 EXPORT_SYMBOL(ib_sa_cancel_query);

@@ -638,10 +641,7 @@ static void send_handler(struct ib_mad_a
 	struct ib_sa_query *query;
 	unsigned long flags;

-	spin_lock_irqsave(&idr_lock, flags);
-	query = idr_find(&query_idr, mad_send_wc->wr_id);
-	spin_unlock_irqrestore(&idr_lock, flags);
-
+	query = ib_sa_find_idr(mad_send_wc->wr_id);
 	if (!query)
 		return;

@@ -661,12 +661,8 @@ static void recv_handler(struct ib_mad_a
 			 struct ib_mad_recv_wc *mad_recv_wc)
 {
 	struct ib_sa_query *query;
-	unsigned long flags;
-
-	spin_lock_irqsave(&idr_lock, flags);
-	query = idr_find(&query_idr, mad_recv_wc->wc->wr_id);
-	spin_unlock_irqrestore(&idr_lock, flags);

+	query = ib_sa_find_idr(mad_recv_wc->wc->wr_id);
 	if (query) {
 		if (mad_recv_wc->wc->status == IB_WC_SUCCESS)
 			query->callback(query,


From mshefty at ichips.intel.com  Fri Nov  5 15:14:09 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 05 Nov 2004 15:14:09 -0800
Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs
	in	ib_post_send_mad
In-Reply-To: <1099680579.2965.7.camel@hpc-1>
References: <1099680579.2965.7.camel@hpc-1>
Message-ID: <418C0941.2080904@ichips.intel.com>

Hal Rosenstock wrote:

> mad: Handle outgoing SMPs in ib_post_send_mad
> The MAD layer is now ready to support the SM :-)
> 
> I have not yet handled the additional special cases: hop count increment
> done by device, use send queue rather than process MAD for 0 hop SMPs).

Hal, can you check that your code stays within 80 characters per line?

- Sean


From roland at topspin.com  Fri Nov  5 18:44:36 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 18:44:36 -0800
Subject: [openib-general]  [PATCH] Extra kfrees, clean up unregisters,
	etc ...
In-Reply-To: <Pine.LNX.4.44.0411051157550.16018-100000@DYN318430BLD> (Krishna
	Kumar's message of "Fri, 5 Nov 2004 13:22:46 -0800 (PST)")
References: <Pine.LNX.4.44.0411051157550.16018-100000@DYN318430BLD>
Message-ID: <52fz3nhf1n.fsf@topspin.com>

Thanks for the audit.  I applied this version of your patch.  Does
this still look correct?

Index: infiniband/core/sa_query.c
===================================================================
--- infiniband/core/sa_query.c	(revision 1164)
+++ infiniband/core/sa_query.c	(working copy)
@@ -699,29 +710,28 @@
 	sa_dev->start_port = s;
 	sa_dev->end_port   = e;
 
-	for (i = s; i <= e; ++i) {
-		sa_dev->port[i - s].mr       = NULL;
-		sa_dev->port[i - s].sm_ah    = NULL;
-		sa_dev->port[i - s].port_num = i;
-		spin_lock_init(&sa_dev->port[i - s].ah_lock);
+	for (i = 0; i <= e - s; ++i) {
+		sa_dev->port[i].mr       = NULL;
+		sa_dev->port[i].sm_ah    = NULL;
+		sa_dev->port[i].port_num = i + s;
+		spin_lock_init(&sa_dev->port[i].ah_lock);
 
-		sa_dev->port[i - s].agent =
-			ib_register_mad_agent(device, i, IB_QPT_GSI,
+		sa_dev->port[i].agent =
+			ib_register_mad_agent(device, i + s, IB_QPT_GSI,
 					      NULL, 0, send_handler,
 					      recv_handler, sa_dev);
-		if (IS_ERR(sa_dev->port[i - s].agent))
+		if (IS_ERR(sa_dev->port[i].agent))
 			goto err;
 
-		sa_dev->port[i - s].mr = ib_get_dma_mr(sa_dev->port[i - s].agent->qp->pd,
-						       IB_ACCESS_LOCAL_WRITE);
-		if (IS_ERR(sa_dev->port[i - s].mr)) {
-			/* Bump i so agent from this iter. is freed */
-			++i;
+		sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd,
+						   IB_ACCESS_LOCAL_WRITE);
+		if (IS_ERR(sa_dev->port[i].mr)) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
 			goto err;
 		}
 
-		INIT_WORK(&sa_dev->port[i - s].update_task,
-			  update_sm_ah, &sa_dev->port[i - s]);
+		INIT_WORK(&sa_dev->port[i].update_task,
+			  update_sm_ah, &sa_dev->port[i]);
 	}
 
 	/*
@@ -732,27 +742,20 @@
 	 */
 
 	INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event);
-	if (ib_register_event_handler(&sa_dev->event_handler)) {
-		kfree(sa_dev);
+	if (ib_register_event_handler(&sa_dev->event_handler))
 		goto err;
-	}
 
-	for (i = s; i <= e; ++i)
-		update_sm_ah(&sa_dev->port[i - s]);
+	for (i = 0; i <= e - s; ++i)
+		update_sm_ah(&sa_dev->port[i]);
 
 	ib_set_client_data(device, &sa_client, sa_dev);
 
 	return;
 
 err:
-	while (--i >= s) {
-		if (sa_dev->port[i - s].mr && !IS_ERR(sa_dev->port[i - s].mr))
-			ib_dereg_mr(sa_dev->port[i - s].mr);
-
-		if (sa_dev->port[i - s].sm_ah)
-			kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
-
-		ib_unregister_mad_agent(sa_dev->port[i - s].agent);
+	while (--i >= 0) {
+		ib_dereg_mr(sa_dev->port[i].mr);
+		ib_unregister_mad_agent(sa_dev->port[i].agent);
 	}
 
 	kfree(sa_dev);


From roland at topspin.com  Fri Nov  5 19:09:18 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 19:09:18 -0800
Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query.
In-Reply-To: <Pine.LNX.4.44.0411051133130.16018-100000@DYN318430BLD> (Krishna
	Kumar's message of "Fri, 5 Nov 2004 11:57:53 -0800 (PST)")
References: <Pine.LNX.4.44.0411051133130.16018-100000@DYN318430BLD>
Message-ID: <52brebhdwh.fsf@topspin.com>

Sorry, this and the follow-up patch are wrong.  The if the send
succeeds then we can't free the query structure until the query
finishes up.  (The query will be freed in the appropriate ->release
method in this case).

You are right that there is a memory leak though.  I fixed it like
this:

Index: infiniband/core/sa_query.c
===================================================================
--- infiniband/core/sa_query.c	(revision 1166)
+++ infiniband/core/sa_query.c	(working copy)
@@ -500,6 +500,7 @@
 
 static void ib_sa_path_rec_release(struct ib_sa_query *sa_query)
 {
+	kfree(sa_query->mad);
 	kfree(container_of(sa_query, struct ib_sa_path_query, sa_query));
 }
 
@@ -544,11 +545,12 @@
 		rec, query->sa_query.mad->data);
 
 	ret = send_mad(&query->sa_query, timeout_ms);
-	if (ret)
+	if (ret) {
+		kfree(query->sa_query.mad);
 		kfree(query);
+	} else
+		*sa_query = &query->sa_query;
 
-	*sa_query = &query->sa_query;
-
 	return ret ? ret : query->sa_query.id;
 }
 EXPORT_SYMBOL(ib_sa_path_rec_get);
@@ -572,6 +574,7 @@
 
 static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query)
 {
+	kfree(sa_query->mad);
 	kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query));
 }
 
@@ -617,11 +620,12 @@
 		rec, query->sa_query.mad->data);
 
 	ret = send_mad(&query->sa_query, timeout_ms);
-	if (ret)
+	if (ret) {
+		kfree(query->sa_query.mad);
 		kfree(query);
+	} else
+		*sa_query = &query->sa_query;
 
-	*sa_query = &query->sa_query;
-
 	return ret ? ret : query->sa_query.id;
 }
 EXPORT_SYMBOL(ib_sa_mcmember_rec_query);


From roland at topspin.com  Fri Nov  5 19:10:11 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 19:10:11 -0800
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
In-Reply-To: <Pine.LNX.4.44.0411051400040.16367-100000@DYN318430BLD> (Krishna
	Kumar's message of "Fri, 5 Nov 2004 14:01:22 -0800 (PST)")
References: <Pine.LNX.4.44.0411051400040.16367-100000@DYN318430BLD>
Message-ID: <527jozhdv0.fsf@topspin.com>

Thanks but I'm not going to apply this.  I prefer to have the locking
and the idr lookup be explicit (and it's only done in two places so
the cleanup is pretty minimal).

Thanks,
  Roland


From roland at topspin.com  Fri Nov  5 19:10:55 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 19:10:55 -0800
Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in
	ib_post_send_mad
In-Reply-To: <418C0941.2080904@ichips.intel.com> (Sean Hefty's message of
	"Fri, 05 Nov 2004 15:14:09 -0800")
References: <1099680579.2965.7.camel@hpc-1> <418C0941.2080904@ichips.intel.com>
Message-ID: <523bznhdts.fsf@topspin.com>

    Sean> Hal, can you check that your code stays within 80 characters
    Sean> per line?

The 80 character limit is really just a guideline.  It's not worth
going through contortions to fix an 85-character line.

 - Roland


From mshefty at ichips.intel.com  Fri Nov  5 19:56:27 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 05 Nov 2004 19:56:27 -0800
Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs
	in	ib_post_send_mad
In-Reply-To: <523bznhdts.fsf@topspin.com>
References: <1099680579.2965.7.camel@hpc-1>	<418C0941.2080904@ichips.intel.com>
	<523bznhdts.fsf@topspin.com>
Message-ID: <418C4B6B.1030108@ichips.intel.com>

Roland Dreier wrote:
>     Sean> Hal, can you check that your code stays within 80 characters
>     Sean> per line?
> 
> The 80 character limit is really just a guideline.  It's not worth
> going through contortions to fix an 85-character line.

Okay.  I was just going by the coding style documentation that mentioned 
that this was a "hard limit".  If it's not that big of a deal, then I'll 
only worry about excessively long lines.

- Sean


From roland at topspin.com  Fri Nov  5 21:23:13 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 21:23:13 -0800
Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in
	ib_post_send_mad
In-Reply-To: <418C4B6B.1030108@ichips.intel.com> (Sean Hefty's message of
	"Fri, 05 Nov 2004 19:56:27 -0800")
References: <1099680579.2965.7.camel@hpc-1>
	<418C0941.2080904@ichips.intel.com> <523bznhdts.fsf@topspin.com>
	<418C4B6B.1030108@ichips.intel.com>
Message-ID: <52y8hfft4u.fsf@topspin.com>

    Sean> Okay.  I was just going by the coding style documentation
    Sean> that mentioned that this was a "hard limit".  If it's not
    Sean> that big of a deal, then I'll only worry about excessively
    Sean> long lines.

Yeah, if you read through the kernel source, you can find tons and
tons of lines somewhat longer than 80 characters.  In fact just now I
was noticing gems like the 125-character line

struct class_device *class_simple_device_add(struct class_simple *cs, dev_t dev, struct device *device, const char *fmt, ...)

in drivers/base/class_simple.c... maybe a better example of the right
way to do things is a line like

			if (tp->link_config.advertising & ADVERTISED_1000baseT_Half)

from drivers/net/tg3.c, which ends in column 83 but look fine.

 - R.


From roland at topspin.com  Fri Nov  5 21:39:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 05 Nov 2004 21:39:38 -0800
Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct
Message-ID: <52u0s3fsdh.fsf@topspin.com>

It seems that there are lots of places where consumers need to
allocate an entire ib_device_attr struct and deal with the possibility
that ib_query_device() might fail, just to find out how many ports a
device has.  We discussed this before and concluded that it was OK to
assume that the number of physical ports is constant.

This patch simplifies a lot of code by making phys_port_cnt a field in
struct ib_device, so consumers can just read the value when they need it.

Does this look good to commit?

Thanks,
  Roland

Index: infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_main.c	(revision 1167)
+++ infiniband/ulp/ipoib/ipoib_main.c	(working copy)
@@ -821,19 +821,12 @@
 
 static void ipoib_add_one(struct ib_device *device)
 {
-	struct ib_device_attr props;
 	int port;
 
-	if (ib_query_device(device, &props)) {
-		printk(KERN_WARNING "%s: ib_device_properties_get failed\n",
-			device->name);
-		return;
-	}
-
 	if (device->node_type == IB_NODE_SWITCH)
 		ipoib_add_port("ib%d", device, 0);
 	else
-		for (port = 1; port <= props.phys_port_cnt; ++port)
+		for (port = 1; port <= device->phys_port_cnt; ++port)
 			ipoib_add_port("ib%d", device, port);
 }
 
Index: infiniband/include/ib_verbs.h
===================================================================
--- infiniband/include/ib_verbs.h	(revision 1167)
+++ infiniband/include/ib_verbs.h	(working copy)
@@ -91,7 +91,6 @@
 	int			max_cqe;
 	int			max_mr;
 	int			max_pd;
-	int			phys_port_cnt;
 	int			max_qp_rd_atom;
 	int			max_ee_rd_atom;
 	int			max_res_rd_atom;
@@ -794,6 +793,7 @@
 	}                            reg_state;
 
 	u8                           node_type;
+	u8                           phys_port_cnt;
 };
 
 struct ib_client {
Index: infiniband/core/device.c
===================================================================
--- infiniband/core/device.c	(revision 1167)
+++ infiniband/core/device.c	(working copy)
@@ -191,7 +191,6 @@
 int ib_register_device(struct ib_device *device)
 {
 	struct ib_device_private   *priv;
-	struct ib_device_attr       prop;
 	int                         ret;
 
 	down(&device_sem);
@@ -217,18 +216,11 @@
 
 	*priv = (struct ib_device_private) { 0 };
 
-	ret = device->query_device(device, &prop);
-	if (ret) {
-		printk(KERN_WARNING "query_device failed for %s\n",
-			       device->name);
-		goto out_free;
-	}
-
 	if (device->node_type == IB_NODE_SWITCH) {
 		priv->start_port = priv->end_port = 0;
 	} else {
 		priv->start_port = 1;
-		priv->end_port   = prop.phys_port_cnt;
+		priv->end_port   = device->phys_port_cnt;
 	}
 
 	priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data),
@@ -236,6 +228,7 @@
 	if (!priv->port_data) {
 		printk(KERN_WARNING "Couldn't allocate port info for %s\n",
 			       device->name);
+		ret = -ENOMEM;
 		goto out_free;
 	}
 
@@ -253,7 +246,8 @@
 		goto out_free_port;
 	}
 
-	if (ib_device_register_sysfs(device)) {
+	ret = ib_device_register_sysfs(device);
+	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
 		       device->name);
 		goto out_free_cache;
Index: infiniband/core/user_mad.c
===================================================================
--- infiniband/core/user_mad.c	(revision 1167)
+++ infiniband/core/user_mad.c	(working copy)
@@ -489,12 +489,8 @@
 	if (device->node_type == IB_NODE_SWITCH)
 		s = e = 0;
 	else {
-		struct ib_device_attr attr;
-		if (ib_query_device(device, &attr))
-			return;
-
 		s = 1;
-		e = attr.phys_port_cnt;
+		e = device->phys_port_cnt;
 	}
 
 	umad_dev = kmalloc(sizeof *umad_dev +
Index: infiniband/core/mad.c
===================================================================
--- infiniband/core/mad.c	(revision 1167)
+++ infiniband/core/mad.c	(working copy)
@@ -244,7 +244,6 @@
 	mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
 	mad_agent_priv->reg_req = reg_req;
 	mad_agent_priv->rmpp_version = rmpp_version;
-	mad_agent_priv->phys_port_cnt = port_priv->phys_port_cnt;
 	mad_agent_priv->agent.device = device;
 	mad_agent_priv->agent.recv_handler = recv_handler;
 	mad_agent_priv->agent.send_handler = send_handler;
@@ -418,7 +417,7 @@
 			if (!smi_handle_dr_smp_recv((struct ib_smp *)&mad_priv->mad,
 						    mad_agent->device->node_type,
 						    mad_agent->port_num,
-						    mad_agent_priv->phys_port_cnt)) {
+						    mad_agent->device->phys_port_cnt)) {
 				ret = -EINVAL;
 				kmem_cache_free(ib_mad_cache, mad_priv);
 				goto error1;
@@ -1085,7 +1084,7 @@
 		if (!smi_handle_dr_smp_recv(smp,
 					    port_priv->device->node_type,
 					    port_priv->port_num,
-					    port_priv->phys_port_cnt))
+					    port_priv->device->phys_port_cnt))
 			goto out;
 		if (!smi_check_forward_dr_smp(smp))
 			goto out;
@@ -1125,7 +1124,7 @@
 						(struct ib_smp *)response,
 						port_priv->device->node_type,
 						port_priv->port_num,
-						port_priv->phys_port_cnt)) {
+						port_priv->device->phys_port_cnt)) {
 					kfree(response);
 					goto out;
 				}
@@ -1842,8 +1841,7 @@
  * Create the QP, PD, MR, and CQ if needed
  */
 static int ib_mad_port_open(struct ib_device *device,
-			    int port_num,
-			    int num_ports)
+			    int port_num)
 {
 	int ret, cq_size;
 	u64 iova = 0;
@@ -1872,7 +1870,6 @@
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->device = device;
 	port_priv->port_num = port_num;
-	port_priv->phys_port_cnt = num_ports;
 	spin_lock_init(&port_priv->reg_lock);
 
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
@@ -1985,29 +1982,22 @@
 static void ib_mad_init_device(struct ib_device *device)
 {
 	int ret, num_ports, cur_port, i, ret2;
-	struct ib_device_attr device_attr;
 
-	ret = ib_query_device(device, &device_attr);
-	if (ret) {
-		printk(KERN_ERR PFX "Couldn't query device %s\n", device->name);
-		goto error_device_query;
-	}
-
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
 		cur_port = 0;
 	} else {
-		num_ports = device_attr.phys_port_cnt;
+		num_ports = device->phys_port_cnt;
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret = ib_mad_port_open(device, cur_port, num_ports);
+		ret = ib_mad_port_open(device, cur_port);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, cur_port);
 			goto error_device_open;
 		}
-		ret = ib_agent_port_open(device, cur_port, num_ports);
+		ret = ib_agent_port_open(device, cur_port);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d for agents\n",
 			       device->name, cur_port);
@@ -2039,20 +2029,13 @@
 
 static void ib_mad_remove_device(struct ib_device *device)
 {
-	int ret, i, num_ports, cur_port, ret2;
-	struct ib_device_attr device_attr;
+	int ret = 0, i, num_ports, cur_port, ret2;
 
-	ret = ib_query_device(device, &device_attr);
-	if (ret) {
-		printk(KERN_ERR PFX "Couldn't query device %s\n", device->name);
-		goto error_device_query;
-	}
-
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
 		cur_port = 0;
 	} else {
-		num_ports = device_attr.phys_port_cnt;
+		num_ports = device->phys_port_cnt;
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
@@ -2071,9 +2054,6 @@
 				ret = ret2;
 		}
 	}
-
-error_device_query:
-	return;
 }
 
 static struct ib_client mad_client = {
Index: infiniband/core/agent.h
===================================================================
--- infiniband/core/agent.h	(revision 1167)
+++ infiniband/core/agent.h	(working copy)
@@ -27,8 +27,7 @@
 #define __AGENT_H_
 
 extern int ib_agent_port_open(struct ib_device *device,
-			      int port_num,
-			      int phys_port_cnt);
+			      int port_num);
 
 extern int ib_agent_port_close(struct ib_device *device, int port_num);
 
Index: infiniband/core/sysfs.c
===================================================================
--- infiniband/core/sysfs.c	(revision 1167)
+++ infiniband/core/sysfs.c	(working copy)
@@ -640,14 +640,9 @@
 		if (ret)
 			goto err_put;
 	} else {
-		struct ib_device_attr attr;
 		int i;
 
-		ret = ib_query_device(device, &attr);
-		if (ret)
-			goto err_put;
-
-		for (i = 1; i <= attr.phys_port_cnt; ++i) {
+		for (i = 1; i <= device->phys_port_cnt; ++i) {
 			ret = add_port(device, i);
 			if (ret)
 				goto err_put;
Index: infiniband/core/sa_query.c
===================================================================
--- infiniband/core/sa_query.c	(revision 1167)
+++ infiniband/core/sa_query.c	(working copy)
@@ -696,12 +696,8 @@
 	if (device->node_type == IB_NODE_SWITCH)
 		s = e = 0;
 	else {
-		struct ib_device_attr attr;
-		if (ib_query_device(device, &attr))
-			return;
-
 		s = 1;
-		e = attr.phys_port_cnt;
+		e = device->phys_port_cnt;
 	}
 
 	sa_dev = kmalloc(sizeof *sa_dev +
Index: infiniband/hw/mthca/mthca_provider.c
===================================================================
--- infiniband/hw/mthca/mthca_provider.c	(revision 1167)
+++ infiniband/hw/mthca/mthca_provider.c	(working copy)
@@ -47,7 +47,6 @@
 	if (!in_mad || !out_mad)
 		goto out;
 
-	props->phys_port_cnt = to_mdev(ibdev)->limits.num_ports;
 	props->fw_ver        = to_mdev(ibdev)->fw_ver;
 
 	memset(in_mad, 0, sizeof *in_mad);
@@ -573,6 +572,7 @@
 
 	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
 	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
 	dev->ib_dev.dma_device           = dev->pdev;
 	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
 	dev->ib_dev.query_device         = mthca_query_device;


From halr at voltaire.com  Fri Nov  5 21:51:35 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 00:51:35 -0500
Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device	struct
In-Reply-To: <52u0s3fsdh.fsf@topspin.com>
References: <52u0s3fsdh.fsf@topspin.com>
Message-ID: <1099720295.3278.80.camel@localhost.localdomain>

On Sat, 2004-11-06 at 00:39, Roland Dreier wrote:
> It seems that there are lots of places where consumers need to
> allocate an entire ib_device_attr struct and deal with the possibility
> that ib_query_device() might fail, just to find out how many ports a
> device has.  

Yes, I noticed this when I added yet another case of this earlier
today/yesterday. You beat me to this :-)

> We discussed this before and concluded that it was OK to
> assume that the number of physical ports is constant.

Agreed.

> This patch simplifies a lot of code by making phys_port_cnt a field in
> struct ib_device, so consumers can just read the value when they need it.
> 
> Does this look good to commit?

Looks good. Do you want me to try it (tomorrow/today depending on your
time zone before committing it) ?

-- Hal


From halr at voltaire.com  Fri Nov  5 22:04:50 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 01:04:50 -0500
Subject: [openib-general] [PATCH] [TRIVIAL] Remove unused variable from
	ipoib_mutlicast.c
Message-ID: <1099721089.14986.1.camel@hpc-1>

Remove unused variable from ipoib_mutlicast.c

Index: ipoib_multicast.c
===================================================================
--- ipoib_multicast.c	(revision 1167)
+++ ipoib_multicast.c	(working copy)
@@ -259,7 +259,6 @@
 {
 	struct ipoib_mcast *mcast = mcast_ptr;
 	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	if (!status)
 		ipoib_mcast_join_finish(mcast, mcmember);


From halr at voltaire.com  Sat Nov  6 04:00:15 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 07:00:15 -0500
Subject: [openib-general] [PATCH] [TRIVIAL] Remove unused variable (not
	debug) from ipoib_mutlicast.c
Message-ID: <1099742414.17534.1.camel@hpc-1>

Remove unused variable (not debug) from ipoib_mutlicast.c
(This supercedes the previous version)

Index: ipoib_multicast.c
===================================================================
--- ipoib_multicast.c	(revision 1167)
+++ ipoib_multicast.c	(working copy)
@@ -259,7 +259,9 @@
 {
 	struct ipoib_mcast *mcast = mcast_ptr;
 	struct net_device *dev = mcast->dev;
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+#endif
 
 	if (!status)
 		ipoib_mcast_join_finish(mcast, mcmember);


From roland at topspin.com  Sat Nov  6 08:43:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Sat, 06 Nov 2004 08:43:40 -0800
Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct
In-Reply-To: <1099720295.3278.80.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Sat, 06 Nov 2004 00:51:35 -0500")
References: <52u0s3fsdh.fsf@topspin.com>
	<1099720295.3278.80.camel@localhost.localdomain>
Message-ID: <52lldfexmr.fsf@topspin.com>

    Hal> Looks good. Do you want me to try it (tomorrow/today
    Hal> depending on your time zone before committing it) ?

Sure, I'm happy to wait.

 - Roland


From roland at topspin.com  Sat Nov  6 08:44:56 2004
From: roland at topspin.com (Roland Dreier)
Date: Sat, 06 Nov 2004 08:44:56 -0800
Subject: [openib-general] [PATCH] [TRIVIAL] Remove unused variable (not
	debug) from ipoib_mutlicast.c
In-Reply-To: <1099742414.17534.1.camel@hpc-1> (Hal Rosenstock's message of
	"Sat, 06 Nov 2004 07:00:15 -0500")
References: <1099742414.17534.1.camel@hpc-1>
Message-ID: <52hdo3exkn.fsf@topspin.com>

    Hal> Remove unused variable (not debug) from ipoib_mutlicast.c

Thanks for pointing this out.  I fixed it like this rather than adding
another #ifdef...

Index: infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_multicast.c	(revision 1167)
+++ infiniband/ulp/ipoib/ipoib_multicast.c	(working copy)
@@ -259,14 +259,13 @@
 {
 	struct ipoib_mcast *mcast = mcast_ptr;
 	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	if (!status)
 		ipoib_mcast_join_finish(mcast, mcmember);
 	else {
 		if (mcast->logcount++ < 20)
-			ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT
-					", status %d\n",
+			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for "
+					IPOIB_GID_FMT ", status %d\n",
 					IPOIB_GID_ARG(mcast->mcmember.mgid), status);
 
 		/* Flush out any queued packets */
Index: infiniband/ulp/ipoib/ipoib.h
===================================================================
--- infiniband/ulp/ipoib/ipoib.h	(revision 1167)
+++ infiniband/ulp/ipoib/ipoib.h	(working copy)
@@ -228,7 +228,7 @@
 
 
 #define ipoib_printk(level, priv, format, arg...)	\
-	printk(level "%s: " format, (priv)->dev->name , ## arg)
+	printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg)
 #define ipoib_warn(priv, format, arg...)		\
 	ipoib_printk(KERN_WARNING, priv, format , ## arg)
 
Index: infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_ib.c	(revision 1167)
+++ infiniband/ulp/ipoib/ipoib_ib.c	(working copy)
@@ -461,12 +461,8 @@
 /*..ipoib_ib_dev_cleanup -- clean up IB resources for iface        */
 void ipoib_ib_dev_cleanup(struct net_device *dev)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	ipoib_dbg(netdev_priv(dev), "cleaning up ib_dev\n");
 
-	/* Avoid unused warning if DEBUG is off */
-	(void) priv;
-	ipoib_dbg(priv, "cleaning up ib_dev\n");
-
 	ipoib_mcast_stop_thread(dev);
 
 	/* Delete the broadcast address and the local address */


From halr at voltaire.com  Sat Nov  6 10:14:29 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 13:14:29 -0500
Subject: [openib-general] IPoIB Multicast
Message-ID: <1099764868.20222.8.camel@hpc-1>

Hi Roland,

The IB multicast support appears to be much better :-) I have not been
able to recreate (at least as yet) any of the ifdown or modprobe -r
issues I used to see. I will keep an eye on this and report back if this
changes.

I have found two minor anomalies/questions which do not cause any
operational issues:

1. If you down the interface and bring it back up, the second time up,
there are 2 identical join requests for the broadcast group rather than
just 1. These 2 come out very close to one another (217 usec apart). Is
there some counting issue that is causing this ?

2. When leaving an IP multicast group, there appears to be an extra join
to 0x16 (something like 224.0.0.22 which would be for IGMP). Any ideas
on this ?

Thanks.

-- Hal


From halr at voltaire.com  Sat Nov  6 10:34:57 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 13:34:57 -0500
Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in	device	struct
In-Reply-To: <52lldfexmr.fsf@topspin.com>
References: <52u0s3fsdh.fsf@topspin.com>
	<1099720295.3278.80.camel@localhost.localdomain>
	<52lldfexmr.fsf@topspin.com>
Message-ID: <1099766097.20222.23.camel@hpc-1>

On Sat, 2004-11-06 at 11:43, Roland Dreier wrote:
>     Hal> Looks good. Do you want me to try it (tomorrow/today
>     Hal> depending on your time zone before committing it) ?
> 
> Sure, I'm happy to wait.

Works for me. 

-- Hal


From halr at voltaire.com  Sat Nov  6 10:42:20 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 13:42:20 -0500
Subject: [openib-general] Latest IPoIB Bringup Questions
In-Reply-To: <52654u1vwg.fsf@topspin.com>
References: <1098985903.17991.74.camel@hpc-1> <52654u1vwg.fsf@topspin.com>
Message-ID: <1099766540.20222.29.camel@hpc-1>

On Thu, 2004-10-28 at 15:32, Roland Dreier wrote:
> Probably better to work on ip, since ifconfig has other issues (such
> as using an ioctl limited to 14 bytes to get the HW addr)

I presume this is the same issue for arp (e.g. arp -a). 

So how do we go about getting this increased in the 2.6 kernel ? Is 20
bytes sufficient ? Should this be part of our 2.6 diffs as well ?

-- Hal


From halr at voltaire.com  Sat Nov  6 10:49:17 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 13:49:17 -0500
Subject: [Fwd: [openib-general] ifconfig ib0 down and then up vis a vis IP
	connectivity]
Message-ID: <1099766957.20222.34.camel@hpc-1>

I have confirmed that this is an ARP cache issue on the remote machine.

The remote node is responding to the old DQPN of the machine whose IPoIB
interface was downed and then brought up. When it is brought up again,
it has a different QPN. The remote node still has the old QPN cached
until it times out and sends back to the old QPN which is discarded on
the local node.

It behaves just like a hardware address change for an IP address for
which a remote node had communication with previously (and cached it's
MAC address).

-- Hal

-----Forwarded Message-----

From: Hal Rosenstock <halr at voltaire.com>
To: openib-general at openib.org
Subject: [openib-general] ifconfig ib0 down and then up vis a vis IP connectivity
Date: 02 Nov 2004 14:45:03 -0500

Hi,

What is the ARP timeout in Linux ?

If I down and then up the ib0 interface, there is some delay before
connectivity is restored despite the fact that it is successfully
(re)attached to the multicast groups and that all the QPNs seem to be
the same. After some time period, connectivity is restored. Any idea on
what is different ? It seems like it is an ARP cache issue.

Thanks.

-- Hal

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Sat Nov  6 11:26:30 2004
From: roland at topspin.com (Roland Dreier)
Date: Sat, 06 Nov 2004 11:26:30 -0800
Subject: [openib-general] Latest IPoIB Bringup Questions
In-Reply-To: <1099766540.20222.29.camel@hpc-1> (Hal Rosenstock's message of
	"Sat, 06 Nov 2004 13:42:20 -0500")
References: <1098985903.17991.74.camel@hpc-1> <52654u1vwg.fsf@topspin.com>
	<1099766540.20222.29.camel@hpc-1>
Message-ID: <52d5yqg4nt.fsf@topspin.com>

    Hal> So how do we go about getting this increased in the 2.6
    Hal> kernel ? Is 20 bytes sufficient ? Should this be part of our
    Hal> 2.6 diffs as well ?

The kernel has no problem (I had MAX_ADDR_LEN increased to 32 about 2
years ago).  Just use the ip tool instead of ifconfig and arp (eg "ip
neigh" or "ip addr").

 - R.


From roland at topspin.com  Sat Nov  6 11:27:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Sat, 06 Nov 2004 11:27:53 -0800
Subject: [openib-general] IPoIB Multicast
In-Reply-To: <1099764868.20222.8.camel@hpc-1> (Hal Rosenstock's message of
	"Sat, 06 Nov 2004 13:14:29 -0500")
References: <1099764868.20222.8.camel@hpc-1>
Message-ID: <528y9eg4li.fsf@topspin.com>

    Hal> 1. If you down the interface and bring it back up, the second
    Hal> time up, there are 2 identical join requests for the
    Hal> broadcast group rather than just 1. These 2 come out very
    Hal> close to one another (217 usec apart). Is there some counting
    Hal> issue that is causing this ?

    Hal> 2. When leaving an IP multicast group, there appears to be an
    Hal> extra join to 0x16 (something like 224.0.0.22 which would be
    Hal> for IGMP). Any ideas on this ?

If you or someone else doesn't debug these issues first, I'll take a
look at the code.

 - R.


From halr at voltaire.com  Sat Nov  6 11:48:38 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 06 Nov 2004 14:48:38 -0500
Subject: [openib-general] IPoIB Multicast
In-Reply-To: <528y9eg4li.fsf@topspin.com>
References: <1099764868.20222.8.camel@hpc-1> <528y9eg4li.fsf@topspin.com>
Message-ID: <1099770518.3281.3.camel@localhost.localdomain>

On Sat, 2004-11-06 at 14:27, Roland Dreier wrote:
>     Hal> 1. If you down the interface and bring it back up, the second
>     Hal> time up, there are 2 identical join requests for the
>     Hal> broadcast group rather than just 1. These 2 come out very
>     Hal> close to one another (217 usec apart). Is there some counting
>     Hal> issue that is causing this ?
> 
>     Hal> 2. When leaving an IP multicast group, there appears to be an
>     Hal> extra join to 0x16 (something like 224.0.0.22 which would be
>     Hal> for IGMP). Any ideas on this ?
> 
> If you or someone else doesn't debug these issues first, I'll take a
> look at the code.

I'll take a first crack and look at the code to see what I can
determine. 

On the second issue, I partially understand what is going on: IPmc group
changes need to be reported via IGMP so the IPmc router knows to prune
the multicast tree, but... I don't understand why it joins here (and not
earlier when a IPmc group is first joined by this node and second after
the join is successful, I do not see any IGMP packet come out of the
node (onto IB; maybe it is going out the ethernet instead).

-- Hal


From mshefty at ichips.intel.com  Mon Nov  8 08:48:43 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 08 Nov 2004 08:48:43 -0800
Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct
In-Reply-To: <52u0s3fsdh.fsf@topspin.com>
References: <52u0s3fsdh.fsf@topspin.com>
Message-ID: <418FA36B.8010402@ichips.intel.com>

Roland Dreier wrote:
> This patch simplifies a lot of code by making phys_port_cnt a field in
> struct ib_device, so consumers can just read the value when they need it.
> 
> Does this look good to commit?

Looks good to me.

- Sean


From iod00d at hp.com  Mon Nov  8 08:55:55 2004
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 8 Nov 2004 08:55:55 -0800
Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs
	in	ib_post_send_mad
In-Reply-To: <418C4B6B.1030108@ichips.intel.com>
References: <1099680579.2965.7.camel@hpc-1> <418C0941.2080904@ichips.intel.com>
	<523bznhdts.fsf@topspin.com> <418C4B6B.1030108@ichips.intel.com>
Message-ID: <20041108165555.GD14706@cup.hp.com>

On Fri, Nov 05, 2004 at 07:56:27PM -0800, Sean Hefty wrote:
> Roland Dreier wrote:
> >The 80 character limit is really just a guideline.  It's not worth
> >going through contortions to fix an 85-character line.
> 
> Okay.  I was just going by the coding style documentation that mentioned 
> that this was a "hard limit".  If it's not that big of a deal, then I'll 
> only worry about excessively long lines.

I don't take it as a hard limit either.
But I rarely write code that exceeds 80 columns.
And I expect someone will complain when gen2 is submitted to LKML
if more than a few lines are longer than 80 columns.

grant


From roland at topspin.com  Mon Nov  8 09:10:31 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 09:10:31 -0800
Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct
In-Reply-To: <418FA36B.8010402@ichips.intel.com> (Sean Hefty's message of
	"Mon, 08 Nov 2004 08:48:43 -0800")
References: <52u0s3fsdh.fsf@topspin.com> <418FA36B.8010402@ichips.intel.com>
Message-ID: <523bzke06w.fsf@topspin.com>

Cool, I've committed this.

 - R.


From krkumar at us.ibm.com  Mon Nov  8 10:45:08 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 8 Nov 2004 10:45:08 -0800 (PST)
Subject: [openib-general]  [PATCH] Extra kfrees, clean up unregisters,
	etc ...
In-Reply-To: <52fz3nhf1n.fsf@topspin.com>
Message-ID: <Pine.LNX.4.44.0411081044530.24458-100000@DYN318430BLD>

Yes, looks good.

- KK


On Fri, 5 Nov 2004, Roland Dreier wrote:

> Thanks for the audit.  I applied this version of your patch.  Does
> this still look correct?
>
> Index: infiniband/core/sa_query.c
> ===================================================================
> --- infiniband/core/sa_query.c	(revision 1164)
> +++ infiniband/core/sa_query.c	(working copy)
> @@ -699,29 +710,28 @@
>  	sa_dev->start_port = s;
>  	sa_dev->end_port   = e;
>
> -	for (i = s; i <= e; ++i) {
> -		sa_dev->port[i - s].mr       = NULL;
> -		sa_dev->port[i - s].sm_ah    = NULL;
> -		sa_dev->port[i - s].port_num = i;
> -		spin_lock_init(&sa_dev->port[i - s].ah_lock);
> +	for (i = 0; i <= e - s; ++i) {
> +		sa_dev->port[i].mr       = NULL;
> +		sa_dev->port[i].sm_ah    = NULL;
> +		sa_dev->port[i].port_num = i + s;
> +		spin_lock_init(&sa_dev->port[i].ah_lock);
>
> -		sa_dev->port[i - s].agent =
> -			ib_register_mad_agent(device, i, IB_QPT_GSI,
> +		sa_dev->port[i].agent =
> +			ib_register_mad_agent(device, i + s, IB_QPT_GSI,
>  					      NULL, 0, send_handler,
>  					      recv_handler, sa_dev);
> -		if (IS_ERR(sa_dev->port[i - s].agent))
> +		if (IS_ERR(sa_dev->port[i].agent))
>  			goto err;
>
> -		sa_dev->port[i - s].mr = ib_get_dma_mr(sa_dev->port[i - s].agent->qp->pd,
> -						       IB_ACCESS_LOCAL_WRITE);
> -		if (IS_ERR(sa_dev->port[i - s].mr)) {
> -			/* Bump i so agent from this iter. is freed */
> -			++i;
> +		sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd,
> +						   IB_ACCESS_LOCAL_WRITE);
> +		if (IS_ERR(sa_dev->port[i].mr)) {
> +			ib_unregister_mad_agent(sa_dev->port[i].agent);
>  			goto err;
>  		}
>
> -		INIT_WORK(&sa_dev->port[i - s].update_task,
> -			  update_sm_ah, &sa_dev->port[i - s]);
> +		INIT_WORK(&sa_dev->port[i].update_task,
> +			  update_sm_ah, &sa_dev->port[i]);
>  	}
>
>  	/*
> @@ -732,27 +742,20 @@
>  	 */
>
>  	INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event);
> -	if (ib_register_event_handler(&sa_dev->event_handler)) {
> -		kfree(sa_dev);
> +	if (ib_register_event_handler(&sa_dev->event_handler))
>  		goto err;
> -	}
>
> -	for (i = s; i <= e; ++i)
> -		update_sm_ah(&sa_dev->port[i - s]);
> +	for (i = 0; i <= e - s; ++i)
> +		update_sm_ah(&sa_dev->port[i]);
>
>  	ib_set_client_data(device, &sa_client, sa_dev);
>
>  	return;
>
>  err:
> -	while (--i >= s) {
> -		if (sa_dev->port[i - s].mr && !IS_ERR(sa_dev->port[i - s].mr))
> -			ib_dereg_mr(sa_dev->port[i - s].mr);
> -
> -		if (sa_dev->port[i - s].sm_ah)
> -			kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
> -
> -		ib_unregister_mad_agent(sa_dev->port[i - s].agent);
> +	while (--i >= 0) {
> +		ib_dereg_mr(sa_dev->port[i].mr);
> +		ib_unregister_mad_agent(sa_dev->port[i].agent);
>  	}
>
>  	kfree(sa_dev);
>
>


From roland at topspin.com  Mon Nov  8 10:55:22 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 10:55:22 -0800
Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr
Message-ID: <52654gcgrp.fsf@topspin.com>

Convert mad.c and agent.c to use ib_get_dma_mr() instead of
ib_reg_phys_mr().  This is simpler and is actually required on
platforms such as sparc64 where DMA addresses may not match up with
physical RAM addresses.

OK to commit?

 - Roland

Index: core/agent.c
===================================================================
--- core/agent.c	(revision 1172)
+++ core/agent.c	(working copy)
@@ -281,15 +281,9 @@
 	kfree(agent_send_wr);
 }
 
-int ib_agent_port_open(struct ib_device *device, int port_num,
-			      int phys_port_cnt)
+int ib_agent_port_open(struct ib_device *device, int port_num)
 {
 	int ret;
-	u64 iova = 0;
-	struct ib_phys_buf buf_list = {
-		.addr = 0,
-		.size = (unsigned long) high_memory - PAGE_OFFSET
-	};
 	struct ib_agent_port_private *port_priv;
 	struct ib_mad_reg_req reg_req;
 	unsigned long flags;
@@ -312,7 +306,6 @@
 
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->port_num = port_num;
-	port_priv->phys_port_cnt = phys_port_cnt;
 	port_priv->wr_id = 0;
 	spin_lock_init(&port_priv->send_list_lock);
 	INIT_LIST_HEAD(&port_priv->send_posted_list);
@@ -356,9 +349,8 @@
 		goto error4;
 	}
 
-	port_priv->mr = ib_reg_phys_mr(port_priv->dr_smp_agent->qp->pd,
-				      &buf_list, 1,
-				       IB_ACCESS_LOCAL_WRITE, &iova);
+	port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd,
+				      IB_ACCESS_LOCAL_WRITE);
 	if (IS_ERR(port_priv->mr)) {
 		printk(KERN_ERR SPFX "Couldn't register MR\n");
 		ret = PTR_ERR(port_priv->mr);
Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1172)
+++ core/mad.c	(working copy)
@@ -1844,11 +1844,6 @@
 			    int port_num)
 {
 	int ret, cq_size;
-	u64 iova = 0;
-	struct ib_phys_buf buf_list = {
-		.addr = 0,
-		.size = (unsigned long) high_memory - PAGE_OFFSET
-	};
 	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
 
@@ -1890,8 +1885,7 @@
 		goto error4;
 	}
 
-	port_priv->mr = ib_reg_phys_mr(port_priv->pd, &buf_list, 1,
-				       IB_ACCESS_LOCAL_WRITE, &iova);
+	port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE);
 	if (IS_ERR(port_priv->mr)) {
 		printk(KERN_ERR PFX "Couldn't register ib_mad MR\n");
 		ret = PTR_ERR(port_priv->mr);


From tduffy at sun.com  Mon Nov  8 10:57:23 2004
From: tduffy at sun.com (Tom Duffy)
Date: Mon, 08 Nov 2004 10:57:23 -0800
Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr
In-Reply-To: <52654gcgrp.fsf@topspin.com>
References: <52654gcgrp.fsf@topspin.com>
Message-ID: <1099940243.2274.8.camel@duffman>

On Mon, 2004-11-08 at 10:55 -0800, Roland Dreier wrote:
> Convert mad.c and agent.c to use ib_get_dma_mr() instead of
> ib_reg_phys_mr().  This is simpler and is actually required on
> platforms such as sparc64 where DMA addresses may not match up with
> physical RAM addresses.
> 
> OK to commit?

Yes Yes please.

-tduffy

-- 
Tom Duffy <tduffy at sun.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041108/eaa56293/attachment.sig>

From krkumar at us.ibm.com  Mon Nov  8 10:50:44 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 8 Nov 2004 10:50:44 -0800 (PST)
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
In-Reply-To: <527jozhdv0.fsf@topspin.com>
Message-ID: <Pine.LNX.4.44.0411081045210.24458-100000@DYN318430BLD>

On Fri, 5 Nov 2004, Roland Dreier wrote:

> Thanks but I'm not going to apply this.  I prefer to have the locking
> and the idr lookup be explicit (and it's only done in two places so
> the cleanup is pretty minimal).

Actually three places ... And IMO, it does make the locking code look
cleaner, eg, the original code (with multiple unlocks) :

void ib_sa_cancel_query(int id, struct ib_sa_query *query)
{
        unsigned long flags;

        spin_lock_irqsave(&idr_lock, flags);
        if (idr_find(&query_idr, query->id) != query) {
                spin_unlock_irqrestore(&idr_lock, flags);
                return;
        }
        spin_unlock_irqrestore(&idr_lock, flags);

        ib_cancel_mad(query->port->agent, query->id);
}

now becomes :

void ib_sa_cancel_query(int id, struct ib_sa_query *query)
{
        if (ib_sa_find_idr(id) == query)
                ib_cancel_mad(query->port->agent, query->id);
}

with the find:

static inline struct ib_sa_query *ib_sa_find_idr(int id)
{
        struct ib_sa_query      *query
        unsigned long           flags;

        spin_lock_irqsave(&idr_lock, flags);
        query = idr_find(&query_idr, id);
        spin_unlock_irqrestore(&idr_lock, flags);
        return query;
}

thx,

- KK


From mshefty at ichips.intel.com  Mon Nov  8 11:01:42 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 08 Nov 2004 11:01:42 -0800
Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr
In-Reply-To: <52654gcgrp.fsf@topspin.com>
References: <52654gcgrp.fsf@topspin.com>
Message-ID: <418FC296.7070602@ichips.intel.com>

Roland Dreier wrote:

> Convert mad.c and agent.c to use ib_get_dma_mr() instead of
> ib_reg_phys_mr().  This is simpler and is actually required on
> platforms such as sparc64 where DMA addresses may not match up with
> physical RAM addresses.
> 
> OK to commit?

Looks good to me.

- Sean


From mshefty at ichips.intel.com  Mon Nov  8 11:07:54 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 08 Nov 2004 11:07:54 -0800
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
In-Reply-To: <Pine.LNX.4.44.0411081045210.24458-100000@DYN318430BLD>
References: <Pine.LNX.4.44.0411081045210.24458-100000@DYN318430BLD>
Message-ID: <418FC40A.9080403@ichips.intel.com>

Krishna Kumar wrote:

> Actually three places ... And IMO, it does make the locking code look
> cleaner, eg, the original code (with multiple unlocks) :
> 
> void ib_sa_cancel_query(int id, struct ib_sa_query *query)
> {
>         unsigned long flags;
> 
>         spin_lock_irqsave(&idr_lock, flags);
>         if (idr_find(&query_idr, query->id) != query) {
>                 spin_unlock_irqrestore(&idr_lock, flags);
>                 return;
>         }
>         spin_unlock_irqrestore(&idr_lock, flags);
> 
>         ib_cancel_mad(query->port->agent, query->id);

I admit that I haven't looked at the SA code yet, but can 
ib_sa_cancel_query pass straight through to ib_cancel_mad?  Since the 
lock is not held around both the find and the cancel, it seems possible.

- Sean


From krkumar at us.ibm.com  Mon Nov  8 11:03:18 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 8 Nov 2004 11:03:18 -0800 (PST)
Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query.
In-Reply-To: <52brebhdwh.fsf@topspin.com>
Message-ID: <Pine.LNX.4.44.0411081056420.24458-100000@DYN318430BLD>

Hi Roland,

I agree with this. BTW, can't the release handler execute before the
(I know, quirky race, but interrupts ...) :
	} else
		*sa_query = &query->sa_query

and free up the memory ? Do you want send_mad() to return a copy of
the sa_query (*copy = *query) before it actually sends on the wire ?
The callers can use this on success case to return sa_query and id.

- KK

On Fri, 5 Nov 2004, Roland Dreier wrote:

> Sorry, this and the follow-up patch are wrong.  The if the send
> succeeds then we can't free the query structure until the query
> finishes up.  (The query will be freed in the appropriate ->release
> method in this case).
>
> You are right that there is a memory leak though.  I fixed it like
> this:
>
> Index: infiniband/core/sa_query.c
> ===================================================================
> --- infiniband/core/sa_query.c	(revision 1166)
> +++ infiniband/core/sa_query.c	(working copy)
> @@ -500,6 +500,7 @@
>
>  static void ib_sa_path_rec_release(struct ib_sa_query *sa_query)
>  {
> +	kfree(sa_query->mad);
>  	kfree(container_of(sa_query, struct ib_sa_path_query, sa_query));
>  }
>
> @@ -544,11 +545,12 @@
>  		rec, query->sa_query.mad->data);
>
>  	ret = send_mad(&query->sa_query, timeout_ms);
> -	if (ret)
> +	if (ret) {
> +		kfree(query->sa_query.mad);
>  		kfree(query);
> +	} else
> +		*sa_query = &query->sa_query;
>
> -	*sa_query = &query->sa_query;
> -
>  	return ret ? ret : query->sa_query.id;
>  }
>  EXPORT_SYMBOL(ib_sa_path_rec_get);
> @@ -572,6 +574,7 @@
>
>  static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query)
>  {
> +	kfree(sa_query->mad);
>  	kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query));
>  }
>
> @@ -617,11 +620,12 @@
>  		rec, query->sa_query.mad->data);
>
>  	ret = send_mad(&query->sa_query, timeout_ms);
> -	if (ret)
> +	if (ret) {
> +		kfree(query->sa_query.mad);
>  		kfree(query);
> +	} else
> +		*sa_query = &query->sa_query;
>
> -	*sa_query = &query->sa_query;
> -
>  	return ret ? ret : query->sa_query.id;
>  }
>  EXPORT_SYMBOL(ib_sa_mcmember_rec_query);
>
>


From roland at topspin.com  Mon Nov  8 11:12:03 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 11:12:03 -0800
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
In-Reply-To: <418FC40A.9080403@ichips.intel.com> (Sean Hefty's message of
	"Mon, 08 Nov 2004 11:07:54 -0800")
References: <Pine.LNX.4.44.0411081045210.24458-100000@DYN318430BLD>
	<418FC40A.9080403@ichips.intel.com>
Message-ID: <521xf4cfzw.fsf@topspin.com>

Actually looking at this code one more time:

	spin_lock_irqsave(&idr_lock, flags);
	if (idr_find(&query_idr, query->id) != query) {
		spin_unlock_irqrestore(&idr_lock, flags);
		return;
	}
	spin_unlock_irqrestore(&idr_lock, flags);

	ib_cancel_mad(query->port->agent, query->id);

I realize that it has a race.  I check that the query is still around
inside the spinlock, but the query could complete and be freed in
between the unlock and the call to ib_cancel_mad().  I'll have to add
some reference counting...

 - R.


From halr at voltaire.com  Mon Nov  8 11:19:53 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 08 Nov 2004 14:19:53 -0500
Subject: [openib-general] [PATCH] mad: Eliminate line wraps in mad.c and
	agent.c
Message-ID: <1099941592.25460.4.camel@hpc-1>

mad: Eliminate line wraps (lines over 80 columns) in mad.c and agent.c

Index: mad.c
===================================================================
--- mad.c	(revision 1168)
+++ mad.c	(working copy)
@@ -244,7 +244,6 @@
 	mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
 	mad_agent_priv->reg_req = reg_req;
 	mad_agent_priv->rmpp_version = rmpp_version;
-	mad_agent_priv->phys_port_cnt = port_priv->phys_port_cnt;
 	mad_agent_priv->agent.device = device;
 	mad_agent_priv->agent.recv_handler = recv_handler;
 	mad_agent_priv->agent.send_handler = send_handler;
@@ -400,25 +399,28 @@
 					    GFP_ATOMIC : GFP_KERNEL);
 		if (!mad_priv) {
 			ret = -ENOMEM;
-			printk(KERN_ERR PFX "No memory for local response MAD\n");
+			printk(KERN_ERR PFX "No memory for local "
+			       "response MAD\n");
 			goto error1;
 		}
 
 		mad_agent_priv = container_of(mad_agent,
 					      struct ib_mad_agent_private,
 					      agent);
-		ret = mad_agent->device->process_mad(mad_agent->device,
-						     0,
-						     mad_agent->port_num,
-						     smp->dr_slid, /* ? */
-						     (struct ib_mad *)smp,
-						     (struct ib_mad *)&mad_priv->mad);
+		ret = mad_agent->device->process_mad(
+					mad_agent->device,
+					0,
+					mad_agent->port_num,
+					smp->dr_slid, /* ? */
+					(struct ib_mad *)smp,
+					(struct ib_mad *)&mad_priv->mad);
 		if ((ret & IB_MAD_RESULT_SUCCESS) &&
 		    (ret & IB_MAD_RESULT_REPLY)) {
-			if (!smi_handle_dr_smp_recv((struct ib_smp *)&mad_priv->mad,
-						    mad_agent->device->node_type,
-						    mad_agent->port_num,
-						    mad_agent_priv->phys_port_cnt)) {
+			if (!smi_handle_dr_smp_recv(
+					(struct ib_smp *)&mad_priv->mad,
+					mad_agent->device->node_type,
+					mad_agent->port_num,
+					mad_agent->device->phys_port_cnt)) {
 				ret = -EINVAL;
 				kmem_cache_free(ib_mad_cache, mad_priv);
 				goto error1;
@@ -430,7 +432,10 @@
 		    mad_agent_priv->agent.recv_handler) {
 			struct ib_wc wc;
 
-			/* Defined behavior is to complete response before request */
+			/*
+			 * Defined behavior is to complete response
+			 * before request
+			 */
 			wc.wr_id = send_wr->wr_id;
 			wc.status = IB_WC_SUCCESS;
 			wc.opcode = IB_WC_RECV;
@@ -443,13 +448,16 @@
 			wc.sl = 0;
 			wc.dlid_path_bits = 0;
 			mad_priv->header.recv_wc.wc = &wc;
-			mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad);
+			mad_priv->header.recv_wc.mad_len =
+							sizeof(struct ib_mad);
 			INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
 			mad_priv->header.recv_buf.grh = NULL;
 			mad_priv->header.recv_buf.mad = &mad_priv->mad.mad;
-			mad_priv->header.recv_wc.recv_buf = &mad_priv->header.recv_buf;
-			mad_agent_priv->agent.recv_handler(mad_agent,
-							   &mad_priv->header.recv_wc);
+			mad_priv->header.recv_wc.recv_buf =
+						&mad_priv->header.recv_buf;
+			mad_agent_priv->agent.recv_handler(
+						mad_agent,
+						&mad_priv->header.recv_wc);
 		} else
 			kmem_cache_free(ib_mad_cache, mad_priv);
 
@@ -458,7 +466,9 @@
 			mad_send_wc.status = IB_WC_SUCCESS;
 			mad_send_wc.vendor_err = 0;
 			mad_send_wc.wr_id = send_wr->wr_id;
-			mad_agent_priv->agent.send_handler(mad_agent, &mad_send_wc);
+			mad_agent_priv->agent.send_handler(
+						mad_agent,
+						&mad_send_wc);
 			ret = 1;
 		} else
 			ret = -EINVAL;
@@ -515,7 +525,8 @@
 	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler))
 		goto error2;
 
-	mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private,
+	mad_agent_priv = container_of(mad_agent,
+				      struct ib_mad_agent_private,
 				      agent);
 
 	/* Walk list of send WRs and post each on send list */
@@ -527,7 +538,8 @@
 
 		if (!cur_send_wr->wr.ud.mad_hdr) {
 			*bad_send_wr = cur_send_wr;
-			printk(KERN_ERR PFX "MAD header must be supplied in WR %p\n", cur_send_wr);
+			printk(KERN_ERR PFX "MAD header must be supplied "
+			       "in WR %p\n", cur_send_wr);
 			goto error1;
 		}
 
@@ -609,7 +621,8 @@
 	struct ib_mad_private_header *mad_priv_hdr;
 	struct ib_mad_private *priv;
 
-	mad_priv_hdr = container_of(mad_recv_wc, struct ib_mad_private_header,
+	mad_priv_hdr = container_of(mad_recv_wc,
+				    struct ib_mad_private_header,
 				    recv_wc);
 	priv = container_of(mad_priv_hdr, struct ib_mad_private, header);
 
@@ -678,7 +691,8 @@
 	/* Allocate management method table */
 	*method = kmalloc(sizeof **method, GFP_ATOMIC);
 	if (!*method) {
-		printk(KERN_ERR PFX "No memory for ib_mad_mgmt_method_table\n");
+		printk(KERN_ERR PFX "No memory for "
+		       "ib_mad_mgmt_method_table\n");
 		return -ENOMEM;
 	}
 	/* Clear management method table */
@@ -773,7 +787,8 @@
 		goto error3;
 
 	/* Finally, add in methods being registered */
-	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); 
+	for (i = find_first_bit(mad_reg_req->method_mask,
+				IB_MGMT_MAX_METHODS); 
 	     i < IB_MGMT_MAX_METHODS;
 	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
 			       1+i)) {
@@ -806,7 +821,10 @@
 	struct ib_mad_mgmt_method_table *method;
 	u8 mgmt_class;
 
-	/* Was MAD registration request supplied with original registration ? */
+	/*
+	 * Was MAD registration request supplied
+	 * with original registration ?
+	 */
 	if (!agent_priv->reg_req) {
 		goto out;
 	}
@@ -1085,7 +1103,7 @@
 		if (!smi_handle_dr_smp_recv(smp,
 					    port_priv->device->node_type,
 					    port_priv->port_num,
-					    port_priv->phys_port_cnt))
+					    port_priv->device->phys_port_cnt))
 			goto out;
 		if (!smi_check_forward_dr_smp(smp))
 			goto out;
@@ -1108,7 +1126,10 @@
 		response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
 		if (!response) {
 			printk(KERN_ERR PFX "No memory for response MAD\n");
-			/* Is it better to assume that it wouldn't be processed ? */
+			/*
+			 * Is it better to assume that
+			 * it wouldn't be processed ?
+			 */
 			goto out;
 		}
 
@@ -1122,16 +1143,17 @@
 			if (response->mad_hdr.mgmt_class ==
 			    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
 				if (!smi_handle_dr_smp_recv(
-						(struct ib_smp *)response,
-						port_priv->device->node_type,
-						port_priv->port_num,
-						port_priv->phys_port_cnt)) {
+					(struct ib_smp *)response,
+					port_priv->device->node_type,
+					port_priv->port_num,
+					port_priv->device->phys_port_cnt)) {
 					kfree(response);
 					goto out;
 				}
 			}
 			/* Send response */
-			grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh);
+			grh = (void *)recv->header.recv_buf.mad -
+			      sizeof(struct ib_grh);
 			if (agent_send(response, grh, wc,
 				       port_priv->device,
 				       port_priv->port_num)) {
@@ -1175,7 +1197,8 @@
 					 struct ib_mad_send_wr_private,
 					 agent_list);
 
-		if (time_after(mad_agent_priv->timeout, mad_send_wr->timeout)) {
+		if (time_after(mad_agent_priv->timeout,
+			       mad_send_wr->timeout)) {
 			mad_agent_priv->timeout = mad_send_wr->timeout;
 			cancel_delayed_work(&mad_agent_priv->work);
 			delay = mad_send_wr->timeout - jiffies;
@@ -1204,7 +1227,8 @@
 		temp_mad_send_wr = list_entry(list_item,
 					      struct ib_mad_send_wr_private,
 					      agent_list);
-		if (time_after(mad_send_wr->timeout, temp_mad_send_wr->timeout))
+		if (time_after(mad_send_wr->timeout,
+			       temp_mad_send_wr->timeout))
 			break;
 	}
 	list_add(&mad_send_wr->agent_list, list_item);
@@ -1517,7 +1541,8 @@
 				 PCI_DMA_FROMDEVICE);
 
 		kmem_cache_free(ib_mad_cache, mad_priv);
-		printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx failed ret = %d\n",
+		printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx "
+		       "failed ret = %d\n",
 		       (unsigned long long) recv_wr.wr_id, ret);
 		return -EINVAL;
 	}
@@ -1607,7 +1632,8 @@
 
 	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't allocate memory for "
+		       "ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
@@ -1628,7 +1654,8 @@
 	kfree(attr);
 
 	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init ret = %d\n", ret);
+		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init "
+		       "ret = %d\n", ret);
 	return ret;
 }
 
@@ -1643,7 +1670,8 @@
 
 	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't allocate memory for "
+		       "ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
@@ -1654,7 +1682,8 @@
 	kfree(attr);
 
 	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr ret = %d\n", ret);
+		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr "
+		       "ret = %d\n", ret);
 	return ret;
 }
 
@@ -1669,7 +1698,8 @@
 
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't allocate memory for "
+		       "ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
@@ -1681,7 +1711,8 @@
 	kfree(attr);
 
 	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts ret = %d\n", ret);
+		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts "
+		       "ret = %d\n", ret);
 	return ret;
 }
 
@@ -1696,7 +1727,8 @@
 
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
  	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't allocate memory for "
+		       "ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
@@ -1707,7 +1739,8 @@
 	kfree(attr);
 
 	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset ret = %d\n", ret);
+		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset "
+		       "ret = %d\n", ret);
 	return ret;
 }
 
@@ -1743,14 +1776,16 @@
 
 	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	if (ret) {
-		printk(KERN_ERR PFX "Failed to request completion notification\n");
+		printk(KERN_ERR PFX "Failed to request completion "
+		       "notification\n");
 		goto error;
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]);
 		if (ret) {
-			printk(KERN_ERR PFX "Couldn't post receive requests\n");
+			printk(KERN_ERR PFX "Couldn't post receive "
+			       "requests\n");
 			goto error;
 		}
 	}
@@ -1777,11 +1812,13 @@
 	int i, ret;
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_reset(port_priv->qp_info[i].qp);
+		ret = ib_mad_change_qp_state_to_reset(
+						port_priv->qp_info[i].qp);
 		if (ret) {
-			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change "
-			       "%s port %d QP%d state to RESET\n",
-			       port_priv->device->name, port_priv->port_num, i);
+			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change"
+			       " %s port %d QP%d state to RESET\n",
+			       port_priv->device->name, port_priv->port_num,
+			       i);
 		}
 		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
 		ib_mad_return_posted_send_mads(&port_priv->qp_info[i]);
@@ -1842,8 +1879,7 @@
  * Create the QP, PD, MR, and CQ if needed
  */
 static int ib_mad_port_open(struct ib_device *device,
-			    int port_num,
-			    int num_ports)
+			    int port_num)
 {
 	int ret, cq_size;
 	u64 iova = 0;
@@ -1872,7 +1908,6 @@
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->device = device;
 	port_priv->port_num = port_num;
-	port_priv->phys_port_cnt = num_ports;
 	spin_lock_init(&port_priv->reg_lock);
 
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
@@ -1985,31 +2020,25 @@
 static void ib_mad_init_device(struct ib_device *device)
 {
 	int ret, num_ports, cur_port, i, ret2;
-	struct ib_device_attr device_attr;
 
-	ret = ib_query_device(device, &device_attr);
-	if (ret) {
-		printk(KERN_ERR PFX "Couldn't query device %s\n", device->name);
-		goto error_device_query;
-	}
-
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
 		cur_port = 0;
 	} else {
-		num_ports = device_attr.phys_port_cnt;
+		num_ports = device->phys_port_cnt;
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret = ib_mad_port_open(device, cur_port, num_ports);
+		ret = ib_mad_port_open(device, cur_port);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, cur_port);
 			goto error_device_open;
 		}
-		ret = ib_agent_port_open(device, cur_port, num_ports);
+		ret = ib_agent_port_open(device, cur_port);
 		if (ret) {
-			printk(KERN_ERR PFX "Couldn't open %s port %d for agents\n",
+			printk(KERN_ERR PFX "Couldn't open %s port %d "
+			       "for agents\n",
 			       device->name, cur_port);
 			goto error_device_open;
 		}
@@ -2022,7 +2051,8 @@
 		cur_port--;
 		ret2 = ib_agent_port_close(device, cur_port);
 		if (ret2) {
-			printk(KERN_ERR PFX "Couldn't close %s port %d for agent\n",
+			printk(KERN_ERR PFX "Couldn't close %s port %d "
+			       "for agents\n",
 			       device->name, cur_port);
 		}
 		ret2 = ib_mad_port_close(device, cur_port);
@@ -2039,26 +2069,20 @@
 
 static void ib_mad_remove_device(struct ib_device *device)
 {
-	int ret, i, num_ports, cur_port, ret2;
-	struct ib_device_attr device_attr;
+	int ret = 0, i, num_ports, cur_port, ret2;
 
-	ret = ib_query_device(device, &device_attr);
-	if (ret) {
-		printk(KERN_ERR PFX "Couldn't query device %s\n", device->name);
-		goto error_device_query;
-	}
-
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
 		cur_port = 0;
 	} else {
-		num_ports = device_attr.phys_port_cnt;
+		num_ports = device->phys_port_cnt;
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
 		ret2 = ib_agent_port_close(device, cur_port);
 		if (ret2) {
-			printk(KERN_ERR PFX "Couldn't close %s port %d for agent\n",
+			printk(KERN_ERR PFX "Couldn't close %s port %d "
+			       "for agents\n",
 			       device->name, cur_port);
 			if (!ret)
 				ret = ret2;
@@ -2071,9 +2095,6 @@
 				ret = ret2;
 		}
 	}
-
-error_device_query:
-	return;
 }
 
 static struct ib_client mad_client = {
Index: agent.c
===================================================================
--- agent.c	(revision 1168)
+++ agent.c	(working copy)
@@ -84,7 +84,8 @@
 		return 1;
 	port_priv = ib_get_agent_mad(device, port_num, NULL);
 	if (!port_priv) {
-		printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not open\n",
+		printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d "
+		       "not open\n",
 		       device->name, port_num);
 		return 1;
 	}
@@ -109,7 +110,8 @@
 	/* Find matching MAD agent */
 	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
 	if (!port_priv) {
-		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n",
+		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent "
+		       "%p\n",
 		       mad_agent);
 		goto out;
 	}
@@ -143,12 +145,16 @@
 	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		if (wc->wc_flags & IB_WC_GRH) {
 			ah_attr.ah_flags = IB_AH_GRH;
-			ah_attr.grh.sgid_index = 0; /* Should sgid be looked up
-? */
+			/* Should sgid be looked up ? */
+			ah_attr.grh.sgid_index = 0;
 			ah_attr.grh.hop_limit = grh->hop_limit;
-			ah_attr.grh.flow_label = be32_to_cpup(&grh->version_tclass_flow) & 0xfffff; 
-			ah_attr.grh.traffic_class = (be32_to_cpup(&grh->version_tclass_flow) >> 20) & 0xff;
-			memcpy(ah_attr.grh.dgid.raw, grh->sgid.raw, sizeof(struct ib_grh));
+			ah_attr.grh.flow_label = be32_to_cpup(
+				&grh->version_tclass_flow)  & 0xffff;
+			ah_attr.grh.traffic_class = (be32_to_cpup(
+				&grh->version_tclass_flow) >> 20) & 0xff;
+			memcpy(ah_attr.grh.dgid.raw,
+			       grh->sgid.raw,
+			       sizeof(struct ib_grh));
 		} else {
 			ah_attr.ah_flags = 0; /* No GRH for SM class */
 		}
@@ -243,8 +249,8 @@
 	/* Find matching MAD agent */
 	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
 	if (!port_priv) {
-		printk(KERN_ERR SPFX "agent_send_handler: no matching MAD agent "
-		       "%p\n", mad_agent);
+		printk(KERN_ERR SPFX "agent_send_handler: no matching MAD "
+		       "agent %p\n", mad_agent);
 		return;
 	}
 
@@ -252,8 +258,9 @@
 	spin_lock_irqsave(&port_priv->send_list_lock, flags);
 	if (list_empty(&port_priv->send_posted_list)) {
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
-		printk(KERN_ERR SPFX "Send completion WR ID 0x%Lx but send list "
-		       "is empty\n", (unsigned long long) mad_send_wc->wr_id);
+		printk(KERN_ERR SPFX "Send completion WR ID 0x%Lx but send "
+		       "list is empty\n",
+		       (unsigned long long) mad_send_wc->wr_id);
 		return;
 	}
 

From roland at topspin.com  Mon Nov  8 11:20:10 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 11:20:10 -0800
Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query.
In-Reply-To: <Pine.LNX.4.44.0411081056420.24458-100000@DYN318430BLD> (Krishna
	Kumar's message of "Mon, 8 Nov 2004 11:03:18 -0800 (PST)")
References: <Pine.LNX.4.44.0411081056420.24458-100000@DYN318430BLD>
Message-ID: <52wtwwb11x.fsf@topspin.com>

    Krishna> Hi Roland, I agree with this. BTW, can't the release
    Krishna> handler execute before the (I know, quirky race, but
    Krishna> interrupts ...)

Yeah, good point (although the consumer can't rely on the value until
the function has returned, the consumer's callback might overwrite
it).

 - R.


From roland at topspin.com  Mon Nov  8 11:21:35 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 11:21:35 -0800
Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query.
In-Reply-To: <52wtwwb11x.fsf@topspin.com> (Roland Dreier's message of "Mon,
	08 Nov 2004 11:20:10 -0800")
References: <Pine.LNX.4.44.0411081056420.24458-100000@DYN318430BLD>
	<52wtwwb11x.fsf@topspin.com>
Message-ID: <52sm7kb0zk.fsf@topspin.com>

I think this should be better:

Index: core/sa_query.c
===================================================================
--- core/sa_query.c	(revision 1175)
+++ core/sa_query.c	(working copy)
@@ -544,12 +544,13 @@
 	ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table),
 		rec, query->sa_query.mad->data);
 
+	*sa_query = &query->sa_query;
 	ret = send_mad(&query->sa_query, timeout_ms);
 	if (ret) {
+		*sa_query = NULL;
 		kfree(query->sa_query.mad);
 		kfree(query);
-	} else
-		*sa_query = &query->sa_query;
+	}
 
 	return ret ? ret : query->sa_query.id;
 }
@@ -619,12 +620,13 @@
 	ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table),
 		rec, query->sa_query.mad->data);
 
+	*sa_query = &query->sa_query;
 	ret = send_mad(&query->sa_query, timeout_ms);
 	if (ret) {
+		*sa_query = NULL;
 		kfree(query->sa_query.mad);
 		kfree(query);
-	} else
-		*sa_query = &query->sa_query;
+	}
 
 	return ret ? ret : query->sa_query.id;
 }


From halr at voltaire.com  Mon Nov  8 11:28:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 08 Nov 2004 14:28:39 -0500
Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr
In-Reply-To: <52654gcgrp.fsf@topspin.com>
References: <52654gcgrp.fsf@topspin.com>
Message-ID: <1099942119.25460.8.camel@hpc-1>

On Mon, 2004-11-08 at 13:55, Roland Dreier wrote:
> Convert mad.c and agent.c to use ib_get_dma_mr() instead of
> ib_reg_phys_mr().  This is simpler and is actually required on
> platforms such as sparc64 where DMA addresses may not match up with
> physical RAM addresses.
> 
> OK to commit?

OK by me with same comment in two places below:

> 
>  - Roland
> 
> Index: core/agent.c
> ===================================================================
> --- core/agent.c	(revision 1172)
> +++ core/agent.c	(working copy)
> @@ -281,15 +281,9 @@
>  	kfree(agent_send_wr);
>  }
>  
> -int ib_agent_port_open(struct ib_device *device, int port_num,
> -			      int phys_port_cnt)
> +int ib_agent_port_open(struct ib_device *device, int port_num)
>  {
>  	int ret;
> -	u64 iova = 0;
> -	struct ib_phys_buf buf_list = {
> -		.addr = 0,
> -		.size = (unsigned long) high_memory - PAGE_OFFSET
> -	};
>  	struct ib_agent_port_private *port_priv;
>  	struct ib_mad_reg_req reg_req;
>  	unsigned long flags;
> @@ -312,7 +306,6 @@
>  
>  	memset(port_priv, 0, sizeof *port_priv);
>  	port_priv->port_num = port_num;
> -	port_priv->phys_port_cnt = phys_port_cnt;
>  	port_priv->wr_id = 0;
>  	spin_lock_init(&port_priv->send_list_lock);
>  	INIT_LIST_HEAD(&port_priv->send_posted_list);
> @@ -356,9 +349,8 @@
>  		goto error4;
>  	}
>  
> -	port_priv->mr = ib_reg_phys_mr(port_priv->dr_smp_agent->qp->pd,
> -				      &buf_list, 1,
> -				       IB_ACCESS_LOCAL_WRITE, &iova);
> +	port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd,
> +				      IB_ACCESS_LOCAL_WRITE);
>  	if (IS_ERR(port_priv->mr)) {
>  		printk(KERN_ERR SPFX "Couldn't register MR\n");

Should this message be changed ?

>  		ret = PTR_ERR(port_priv->mr);
> Index: core/mad.c
> ===================================================================
> --- core/mad.c	(revision 1172)
> +++ core/mad.c	(working copy)
> @@ -1844,11 +1844,6 @@
>  			    int port_num)
>  {
>  	int ret, cq_size;
> -	u64 iova = 0;
> -	struct ib_phys_buf buf_list = {
> -		.addr = 0,
> -		.size = (unsigned long) high_memory - PAGE_OFFSET
> -	};
>  	struct ib_mad_port_private *port_priv;
>  	unsigned long flags;
>  
> @@ -1890,8 +1885,7 @@
>  		goto error4;
>  	}
>  
> -	port_priv->mr = ib_reg_phys_mr(port_priv->pd, &buf_list, 1,
> -				       IB_ACCESS_LOCAL_WRITE, &iova);
> +	port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE);
>  	if (IS_ERR(port_priv->mr)) {
>  		printk(KERN_ERR PFX "Couldn't register ib_mad MR\n");


Should this message be changed ?

>  		ret = PTR_ERR(port_priv->mr);


From krkumar at us.ibm.com  Mon Nov  8 11:18:50 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Mon, 8 Nov 2004 11:18:50 -0800 (PST)
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
In-Reply-To: <521xf4cfzw.fsf@topspin.com>
Message-ID: <Pine.LNX.4.44.0411081110520.24615-100000@DYN318430BLD>

Good catch Sean.

Yes, but since it is a race (hence uncommon), isn't it enough to let the
ib_cancel_mad handle it ? It drops out if find_send_by_wr_id fails to find
this entry.

- KK

On Mon, 8 Nov 2004, Roland Dreier wrote:

> Actually looking at this code one more time:
>
> 	spin_lock_irqsave(&idr_lock, flags);
> 	if (idr_find(&query_idr, query->id) != query) {
> 		spin_unlock_irqrestore(&idr_lock, flags);
> 		return;
> 	}
> 	spin_unlock_irqrestore(&idr_lock, flags);
>
> 	ib_cancel_mad(query->port->agent, query->id);
>
> I realize that it has a race.  I check that the query is still around
> inside the spinlock, but the query could complete and be freed in
> between the unlock and the call to ib_cancel_mad().  I'll have to add
> some reference counting...
>
>  - R.


From roland at topspin.com  Mon Nov  8 11:45:41 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 11:45:41 -0800
Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c
In-Reply-To: <Pine.LNX.4.44.0411081110520.24615-100000@DYN318430BLD> (Krishna
	Kumar's message of "Mon, 8 Nov 2004 11:18:50 -0800 (PST)")
References: <Pine.LNX.4.44.0411081110520.24615-100000@DYN318430BLD>
Message-ID: <52oei8azve.fsf@topspin.com>

    Krishna> Good catch Sean.  Yes, but since it is a race (hence
    Krishna> uncommon), isn't it enough to let the ib_cancel_mad
    Krishna> handle it ? It drops out if find_send_by_wr_id fails to
    Krishna> find this entry.

Actually it's my catch :)

The problem is that 

	ib_cancel_mad(query->port->agent, query->id);

dereferences query, which might already be gone.

I think I have a clean way to fix it though.

 - R.


From roland at topspin.com  Mon Nov  8 11:46:52 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 11:46:52 -0800
Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr
In-Reply-To: <1099942119.25460.8.camel@hpc-1> (Hal Rosenstock's message of
	"Mon, 08 Nov 2004 14:28:39 -0500")
References: <52654gcgrp.fsf@topspin.com> <1099942119.25460.8.camel@hpc-1>
Message-ID: <52k6swaztf.fsf@topspin.com>

OK, I committed with error messags like "Couldn't get ib_mad DMA MR"

 - R.


From halr at voltaire.com  Mon Nov  8 12:29:04 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 08 Nov 2004 15:29:04 -0500
Subject: [openib-general] [PATCH] Eliminate no longer used phys_port_cnt
 member in mad and	agent structures
Message-ID: <1099945743.8714.1.camel@hpc-1>

Eliminate no longer used phys_port_cnt member in mad and agent
structures

Index: mad_priv.h
===================================================================
--- mad_priv.h	(revision 1177)
+++ mad_priv.h	(working copy)
@@ -115,7 +115,6 @@
 
 	atomic_t refcount;
 	wait_queue_head_t wait;
-	int phys_port_cnt;
 	u8 rmpp_version;
 };
 
@@ -157,7 +156,6 @@
 	struct list_head port_list;
 	struct ib_device *device;
 	int port_num;
-	int phys_port_cnt;
 	struct ib_cq *cq;
 	struct ib_pd *pd;
 	struct ib_mr *mr;
Index: agent_priv.h
===================================================================
--- agent_priv.h	(revision 1177)
+++ agent_priv.h	(working copy)
@@ -42,7 +42,6 @@
 	struct list_head send_posted_list;
 	spinlock_t send_list_lock;
 	int port_num;
-	int phys_port_cnt;
 	struct ib_mad_agent *dr_smp_agent;    /* DR SM class */
 	struct ib_mad_agent *lr_smp_agent;    /* LR SM class */
 	struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */


From halr at voltaire.com  Mon Nov  8 13:13:57 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 08 Nov 2004 16:13:57 -0500
Subject: [Fwd: Re: [openib-general] IPoIB Multicast]
Message-ID: <1099948436.8714.23.camel@hpc-1>

On the first issue, I now understand why there are 2 joins for the
broadcast group.

ipoib_set_mcast_list causes the ipoib_restart_task to be run which stops
and starts the multicast thread. Doing this causes the first join to the
broadcast group to be cancelled (flushed) and even though it appears to
work on the IB wire, it is not completed in the host. The second join
for the broadcast group is completed without being cancelled. 

Not sure what (if anything) should be done about this.

-- Hal

-----Forwarded Message-----

From: Hal Rosenstock <halr at voltaire.com>
To: Roland Dreier <roland at topspin.com>
Cc: openib-general at openib.org
Subject: Re: [openib-general] IPoIB Multicast
Date: 06 Nov 2004 14:48:38 -0500

On Sat, 2004-11-06 at 14:27, Roland Dreier wrote:
>     Hal> 1. If you down the interface and bring it back up, the second
>     Hal> time up, there are 2 identical join requests for the
>     Hal> broadcast group rather than just 1. These 2 come out very
>     Hal> close to one another (217 usec apart). Is there some counting
>     Hal> issue that is causing this ?
> 
>     Hal> 2. When leaving an IP multicast group, there appears to be an
>     Hal> extra join to 0x16 (something like 224.0.0.22 which would be
>     Hal> for IGMP). Any ideas on this ?
> 
> If you or someone else doesn't debug these issues first, I'll take a
> look at the code.

I'll take a first crack and look at the code to see what I can
determine. 

On the second issue, I partially understand what is going on: IPmc group
changes need to be reported via IGMP so the IPmc router knows to prune
the multicast tree, but... I don't understand why it joins here (and not
earlier when a IPmc group is first joined by this node and second after
the join is successful, I do not see any IGMP packet come out of the
node (onto IB; maybe it is going out the ethernet instead).

-- Hal

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mshefty at ichips.intel.com  Mon Nov  8 15:48:41 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 08 Nov 2004 15:48:41 -0800
Subject: [openib-general] ib_mad_recv_done_handler questions
Message-ID: <419005D9.8070305@ichips.intel.com>

Looking at the latest changes to ib_mad_recv_done_handler, I have a 
couple of questions:

* If the underlying driver provides a process_mad routine, a response 
MAD is allocated every time a MAD is received on QP 0 or 1.  Can we 
either push this allocation down into the HCA driver, or find an 
alternative way of interacting between the two drivers that doesn't 
require this allocation unless a response will be generated?

* If process_mad consumes the MAD, should the code just goto out? 
Something more like:

	ret = port_priv->device->process_mad(...)
	if ((ret & IB_MAD_RESULT_SUCCESS) &&
	    (ret & IB_MAD_RESULT_REPLY)) {
	...
	} else

becomes

	ret = port_priv->device->process_mad(...)
	if (ret & IB_MAD_RESULT_SUCCESS)) {
		if (ret & IB_MAD_RESULT_REPLY)) {
		...
		}
		...
		goto out;
	} else

	Does the MAD still need to be dispatched in this case?

- Sean


From mshefty at ichips.intel.com  Mon Nov  8 16:27:23 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 08 Nov 2004 16:27:23 -0800
Subject: [openib-general] MAD agent code comments
Message-ID: <41900EEB.7050109@ichips.intel.com>

A couple of comments (so far) while tracing through the MAD agent code.

* There are a couple of places where ib_get_agent_mad() will be called 
multiple times in the same execution path.  For example agent_send calls 
it, as does agent_mad_send.  I didn't check to see if the calls would 
return the same ib_agent_port_private structure.  (Would calling the 
function ib_get_agent_port() make more sense?)

* The agent code assumes that sends are completed in the order that they 
are posted.  The MAD code does not guarantee that this is the case.  (It 
cannot do this as a result of matching requests with response, handling 
timeouts, and error handling.)

- Sean


From roland at topspin.com  Mon Nov  8 16:51:05 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 16:51:05 -0800
Subject: [openib-general] ib_mad_recv_done_handler questions
In-Reply-To: <419005D9.8070305@ichips.intel.com> (Sean Hefty's message of
	"Mon, 08 Nov 2004 15:48:41 -0800")
References: <419005D9.8070305@ichips.intel.com>
Message-ID: <521xf3c0au.fsf@topspin.com>

    Sean> * If the underlying driver provides a process_mad routine, a
    Sean> response MAD is allocated every time a MAD is received on QP
    Sean> 0 or 1.  Can we either push this allocation down into the
    Sean> HCA driver, or find an alternative way of interacting
    Sean> between the two drivers that doesn't require this allocation
    Sean> unless a response will be generated?

How about if the MAD layer allocates a response MAD when a MAD is
received, and if the process_mad call doesn't actually generate a
response the MAD layer just stashed the response MAD away to use for
the next receive?  This should keep the number of allocations within 1
of the number of responses actually generated, but save us from
tracking allocations between two layers.

 - R.


From mshefty at ichips.intel.com  Mon Nov  8 16:57:28 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 08 Nov 2004 16:57:28 -0800
Subject: [openib-general] ib_mad_recv_done_handler questions
In-Reply-To: <521xf3c0au.fsf@topspin.com>
References: <419005D9.8070305@ichips.intel.com> <521xf3c0au.fsf@topspin.com>
Message-ID: <419015F8.4030500@ichips.intel.com>

Roland Dreier wrote:

>     Sean> * If the underlying driver provides a process_mad routine, a
>     Sean> response MAD is allocated every time a MAD is received on QP
>     Sean> 0 or 1.  Can we either push this allocation down into the
>     Sean> HCA driver, or find an alternative way of interacting
>     Sean> between the two drivers that doesn't require this allocation
>     Sean> unless a response will be generated?
> 
> How about if the MAD layer allocates a response MAD when a MAD is
> received, and if the process_mad call doesn't actually generate a
> response the MAD layer just stashed the response MAD away to use for
> the next receive?  This should keep the number of allocations within 1
> of the number of responses actually generated, but save us from
> tracking allocations between two layers.

That sounds reasonable, and I think avoiding allocations in the HCA 
driver is desirable given the current design.

- Sean


From roland at topspin.com  Mon Nov  8 20:52:04 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 08 Nov 2004 20:52:04 -0800
Subject: [openib-general] [PATCH] Convert cache.c to use RCU
Message-ID: <52lldbaakr.fsf@topspin.com>

Use RCU instead of seqlocks, and simplify the code.

Index: core/device.c
===================================================================
--- core/device.c	(revision 1178)
+++ core/device.c	(working copy)
@@ -190,8 +190,7 @@
 
 int ib_register_device(struct ib_device *device)
 {
-	struct ib_device_private   *priv;
-	int                         ret;
+	int ret;
 
 	down(&device_sem);
 
@@ -206,51 +205,16 @@
 		goto out;
 	}
 
-	priv = kmalloc(sizeof *priv, GFP_KERNEL);
-	if (!priv) {
-		printk(KERN_WARNING "Couldn't allocate private struct for %s\n",
-			       device->name);
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	*priv = (struct ib_device_private) { 0 };
-
-	if (device->node_type == IB_NODE_SWITCH) {
-		priv->start_port = priv->end_port = 0;
-	} else {
-		priv->start_port = 1;
-		priv->end_port   = device->phys_port_cnt;
-	}
-
-	priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data),
-				  GFP_KERNEL);
-	if (!priv->port_data) {
-		printk(KERN_WARNING "Couldn't allocate port info for %s\n",
-			       device->name);
-		ret = -ENOMEM;
-		goto out_free;
-	}
-
-	device->core = priv;
-
 	INIT_LIST_HEAD(&device->event_handler_list);
 	INIT_LIST_HEAD(&device->client_data_list);
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
-	ret = ib_cache_setup(device);
-	if (ret) {
-		printk(KERN_WARNING "Couldn't create device info cache for %s\n",
-			       device->name);
-		goto out_free_port;
-	}
-
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
 		       device->name);
-		goto out_free_cache;
+		goto out;
 	}
 
 	list_add_tail(&device->core_list, &device_list);
@@ -265,18 +229,6 @@
 				client->add(device);
 	}
 
-	up(&device_sem);
-	return 0;
-
- out_free_cache:
-	ib_cache_cleanup(device);
-
- out_free_port:
-	kfree(priv->port_data);
-
- out_free:
-	kfree(priv);
-
  out:
 	up(&device_sem);
 	return ret;
@@ -285,7 +237,6 @@
 
 void ib_unregister_device(struct ib_device *device)
 {
-	struct ib_device_private *priv = device->core;
 	struct ib_client *client;
 	struct ib_client_data *context, *tmp;
 	unsigned long flags;
@@ -305,11 +256,6 @@
 		kfree(context);
 	spin_unlock_irqrestore(&device->client_data_lock, flags);
 
-	ib_cache_cleanup(device);
-
-	kfree(priv->port_data);
-	kfree(priv);
-
 	device->reg_state = IB_DEV_UNREGISTERED;
 }
 EXPORT_SYMBOL(ib_unregister_device);
@@ -490,11 +436,18 @@
 	if (ret)
 		printk(KERN_WARNING "Couldn't create InfiniBand device class\n");
 
+	ret = ib_cache_setup();
+	if (ret) {
+		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
+		ib_sysfs_cleanup();
+	}
+
 	return ret;
 }
 
 static void __exit ib_core_cleanup(void)
 {
+	ib_cache_cleanup();
 	ib_sysfs_cleanup();
 }
 
Index: core/cache.c
===================================================================
--- core/cache.c	(revision 1178)
+++ core/cache.c	(working copy)
@@ -23,57 +23,77 @@
 
 #include <linux/version.h>
 #include <linux/module.h>
-
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/rcupdate.h>
 
 #include "core_priv.h"
 
-int ib_cached_lid_get(struct ib_device   *device,
-		      u8                  port,
-		      struct ib_port_lid *port_lid)
-{
-	struct ib_device_private *priv;
-	unsigned int seq;
+struct ib_pkey_cache {
+	struct rcu_head rcu;
+	int             table_len;
+	u16             table[0];
+};
 
-	priv = device->core;
+struct ib_gid_cache {
+	struct rcu_head rcu;
+	int             table_len;
+	union ib_gid    table[0];
+};
 
-	if (port < priv->start_port || port > priv->end_port)
-		return -EINVAL;
+struct ib_update_work {
+	struct work_struct work;
+	struct ib_device  *device;
+	u8                 port_num;
+};
 
-	do {
-		seq = read_seqcount_begin(&priv->port_data[port].lock);
-		memcpy(port_lid,
-		       &priv->port_data[port].port_lid,
-		       sizeof (struct ib_port_lid));
-	} while (read_seqcount_retry(&priv->port_data[port].lock, seq));
+static inline int start_port(struct ib_device *device)
+{
+	return device->node_type == IB_NODE_SWITCH ? 0 : 1;
+}
 
-	return 0;
+static inline int end_port(struct ib_device *device)
+{
+	return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt;
 }
-EXPORT_SYMBOL(ib_cached_lid_get);
 
+static void rcu_free_pkey(struct rcu_head *head)
+{
+	struct ib_pkey_cache *cache =
+		container_of(head, struct ib_pkey_cache, rcu);
+	kfree(cache);
+}
+
+static void rcu_free_gid(struct rcu_head *head)
+{
+	struct ib_gid_cache *cache =
+		container_of(head, struct ib_gid_cache, rcu);
+	kfree(cache);
+}
+
 int ib_cached_gid_get(struct ib_device *device,
 		      u8                port,
 		      int               index,
 		      union ib_gid     *gid)
 {
-	struct ib_device_private *priv;
-	unsigned int seq;
+	struct ib_gid_cache *cache;
+	int ret = 0;
 
-	priv = device->core;
-
-	if (port < priv->start_port || port > priv->end_port)
+	if (port < start_port(device) || port > end_port(device))
 		return -EINVAL;
 
-	if (index < 0 || index >= priv->port_data[port].properties.gid_tbl_len)
-		return -EINVAL;
+	rcu_read_lock();
 
-	do {
-		seq = read_seqcount_begin(&priv->port_data[port].lock);
-		*gid = priv->port_data[port].gid_table[index];
-	} while (read_seqcount_retry(&priv->port_data[port].lock, seq));
+	cache = rcu_dereference(device->cache.gid_cache[port - start_port(device)]);
 
-	return 0;
+	if (index < 0 || index >= cache->table_len)
+		ret = -EINVAL;
+	else
+		*gid = cache->table[index];
+
+	rcu_read_unlock();
+
+	return ret;
 }
 EXPORT_SYMBOL(ib_cached_gid_get);
 
@@ -82,23 +102,24 @@
 		       int               index,
 		       u16              *pkey)
 {
-	struct ib_device_private *priv;
-	unsigned int              seq;
+	struct ib_pkey_cache *cache;
+	int ret = 0;
 
-	priv = device->core;
-
-	if (port < priv->start_port || port > priv->end_port)
+	if (port < start_port(device) || port > end_port(device))
 		return -EINVAL;
 
-	if (index < 0 || index >= priv->port_data[port].properties.pkey_tbl_len)
-		return -EINVAL;
+	rcu_read_lock();
 
-	do {
-		seq = read_seqcount_begin(&priv->port_data[port].lock);
-		*pkey = priv->port_data[port].pkey_table[index];
-	} while (read_seqcount_retry(&priv->port_data[port].lock, seq));
+	cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]);
 
-	return 0;
+	if (index < 0 || index >= cache->table_len)
+		ret = -EINVAL;
+	else
+		*pkey = cache->table[index];
+
+	rcu_read_unlock();
+
+	return ret;
 }
 EXPORT_SYMBOL(ib_cached_pkey_get);
 
@@ -107,207 +128,214 @@
 			u16               pkey,
 			u16              *index)
 {
-	struct ib_device_private *priv;
-	unsigned int              seq;
-	int                       i;
-	int                       found;
+	struct ib_pkey_cache *cache;
+	int i;
+	int ret = -ENOENT;
 
-	priv = device->core;
-
-	if (port < priv->start_port || port > priv->end_port)
+	if (port < start_port(device) || port > end_port(device))
 		return -EINVAL;
 
-	do {
-		seq = read_seqcount_begin(&priv->port_data[port].lock);
-		found = -1;
-		for (i = 0; i < priv->port_data[port].properties.pkey_tbl_len; ++i) {
-			if ((priv->port_data[port].pkey_table[i] & 0x7fff) ==
-			    (pkey & 0x7fff)) {
-				found = i;
-				break;
-			}
+	rcu_read_lock();
+
+	cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]);
+
+	*index = -1;
+
+	for (i = 0; i < cache->table_len; ++i)
+		if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
+			*index = i;
+			ret = 0;
+			break;
 		}
-	} while (read_seqcount_retry(&priv->port_data[port].lock, seq));
 
-	if (found < 0) {
-		return -ENOENT;
-	} else {
-		*index = found;
-		return 0;
-	}
+	rcu_read_unlock();
+	return ret;
 }
 EXPORT_SYMBOL(ib_cached_pkey_find);
 
 static void ib_cache_update(struct ib_device *device,
 			    u8                port)
 {
-	struct ib_device_private  *priv = device->core;
-	struct ib_port_data       *info = &priv->port_data[port];
 	struct ib_port_attr       *tprops = NULL;
-	union ib_gid              *tgid = NULL;
-	u16                       *tpkey = NULL;
+	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
+	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
 	int                        i;
 	int                        ret;
 
 	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
 	if (!tprops)
-		goto out;
+		return;
 
-	ret = device->query_port(device, port, tprops);
+	ret = ib_query_port(device, port, tprops);
 	if (ret) {
-		printk(KERN_WARNING "query_port failed (%d) for %s\n",
+		printk(KERN_WARNING "ib_query_port failed (%d) for %s\n",
 		       ret, device->name);
-		goto out;
+		goto err;
 	}
 
-	tprops->gid_tbl_len = min(tprops->gid_tbl_len,
-				  info->gid_table_alloc_length);
-	tgid = kmalloc(tprops->gid_tbl_len * sizeof *tgid, GFP_KERNEL);
-	if (!tgid)
-		goto out;
+	pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
+			     sizeof *pkey_cache->table, GFP_KERNEL);
+	if (!pkey_cache)
+		goto err;
 
-	for (i = 0; i < tprops->gid_tbl_len; ++i) {
-		ret = device->query_gid(device, port, i, tgid + i);
+	INIT_RCU_HEAD(&pkey_cache->rcu);
+	pkey_cache->table_len = tprops->pkey_tbl_len;
+
+	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
+			    sizeof *gid_cache->table, GFP_KERNEL);
+	if (!gid_cache)
+		goto err;
+
+	INIT_RCU_HEAD(&gid_cache->rcu);
+	gid_cache->table_len = tprops->gid_tbl_len;
+
+	for (i = 0; i < pkey_cache->table_len; ++i) {
+		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
 		if (ret) {
-			printk(KERN_WARNING "query_gid failed (%d) for %s (index %d)\n",
+			printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n",
 			       ret, device->name, i);
-			goto out;
+			goto err;
 		}
 	}
 
-	tprops->pkey_tbl_len = min(tprops->pkey_tbl_len,
-				   info->pkey_table_alloc_length);
-	tpkey = kmalloc(tprops->pkey_tbl_len * sizeof (u16),
-			GFP_KERNEL);
-	if (!tpkey)
-		goto out;
-
-	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
-		ret = device->query_pkey(device, port, i, &tpkey[i]);
+	for (i = 0; i < gid_cache->table_len; ++i) {
+		ret = ib_query_gid(device, port, i, gid_cache->table + i);
 		if (ret) {
-			printk(KERN_WARNING "query_pkey failed (%d) "
-			       "for %s, port %d, index %d\n",
-			       ret, device->name, port, i);
-			goto out;
+			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
+			       ret, device->name, i);
+			goto err;
 		}
 	}
 
-	write_seqcount_begin(&info->lock);
+	old_pkey_cache = device->cache.pkey_cache[port - start_port(device)];
+	old_gid_cache  = device->cache.gid_cache [port - start_port(device)];
 
-	info->properties = *tprops;
+#warning Delete definition of rcu_assign_pointer when 2.6.10 is released!
+#ifndef rcu_assign_pointer
+#define rcu_assign_pointer(p, v)	({ \
+						smp_wmb(); \
+						(p) = (v); \
+					})
+#endif
 
-	info->port_lid.lid = info->properties.lid;
-	info->port_lid.lmc = info->properties.lmc;
+	rcu_assign_pointer(device->cache.pkey_cache[port - start_port(device)],
+			   pkey_cache);
+	rcu_assign_pointer(device->cache.gid_cache [port - start_port(device)],
+			   gid_cache);
 
-	memcpy(info->gid_table, tgid,
-	       tprops->gid_tbl_len * sizeof *tgid);
-	memcpy(info->pkey_table, tpkey,
-	       tprops->pkey_tbl_len * sizeof *tpkey);
+	if (old_pkey_cache)
+		call_rcu(&old_pkey_cache->rcu, rcu_free_pkey);
+	if (old_gid_cache)
+		call_rcu(&old_gid_cache->rcu, rcu_free_gid);
 
-	write_seqcount_end(&info->lock);
+	kfree(tprops);
+	return;
 
- out:
+err:
+	kfree(pkey_cache);
+	kfree(gid_cache);
 	kfree(tprops);
-	kfree(tpkey);
-	kfree(tgid);
 }
 
-static void ib_cache_task(void *port_ptr)
+static void ib_cache_task(void *work_ptr)
 {
-	struct ib_port_data *port_data = port_ptr;
+	struct ib_update_work *work = work_ptr;
 
-	ib_cache_update(port_data->device, port_data->port_num);
+	ib_cache_update(work->device, work->port_num);
+	kfree(work);
 }
 
 static void ib_cache_event(struct ib_event_handler *handler,
 			   struct ib_event *event)
 {
+	struct ib_update_work *work;
+
 	if (event->event == IB_EVENT_PORT_ERR    ||
 	    event->event == IB_EVENT_PORT_ACTIVE ||
 	    event->event == IB_EVENT_LID_CHANGE  ||
 	    event->event == IB_EVENT_PKEY_CHANGE ||
 	    event->event == IB_EVENT_SM_CHANGE) {
-		struct ib_device_private *priv = event->device->core;
-		schedule_work(&priv->port_data[event->element.port_num].refresh_task);
+		work = kmalloc(sizeof *work, GFP_ATOMIC);
+		if (work) {
+			INIT_WORK(&work->work, ib_cache_task, work);
+			work->device   = event->device;
+			work->port_num = event->element.port_num;
+			schedule_work(&work->work);
+		}
 	}
 }
 
-int ib_cache_setup(struct ib_device *device)
+void ib_cache_setup_one(struct ib_device *device)
 {
-	struct ib_device_private *priv = device->core;
-	struct ib_port_attr       prop;
-	int                       p;
-	int                       ret;
+	int p;
 
-	for (p = priv->start_port; p <= priv->end_port; ++p) {
-		priv->port_data[p].device = device;
-		priv->port_data[p].port_num = p;
-		INIT_WORK(&priv->port_data[p].refresh_task,
-			  ib_cache_task, &priv->port_data[p]);
-		priv->port_data[p].gid_table  = NULL;
-		priv->port_data[p].pkey_table = NULL;
-		priv->port_data[p].event_handler.device = NULL;
+	device->cache.pkey_cache =
+		kmalloc(sizeof *device->cache.pkey_cache *
+			end_port(device) - start_port(device), GFP_KERNEL);
+	device->cache.gid_cache =
+		kmalloc(sizeof *device->cache.pkey_cache *
+			end_port(device) - start_port(device), GFP_KERNEL);
+
+	if (!device->cache.pkey_cache || !device->cache.gid_cache) {
+		printk(KERN_WARNING "Couldn't allocate cache "
+		       "for %s\n", device->name);
+		goto err;
 	}
 
-	for (p = priv->start_port; p <= priv->end_port; ++p) {
-		seqcount_init(&priv->port_data[p].lock);
-		ret = device->query_port(device, p, &prop);
-		if (ret) {
-			printk(KERN_WARNING "query_port failed for %s\n",
-			       device->name);
-			goto error;
-		}
-		priv->port_data[p].gid_table_alloc_length = prop.gid_tbl_len;
-		priv->port_data[p].gid_table = kmalloc(prop.gid_tbl_len *
-						       sizeof (union ib_gid),
-						       GFP_KERNEL);
-		if (!priv->port_data[p].gid_table) {
-			ret = -ENOMEM;
-			goto error;
-		}
+	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
+		device->cache.pkey_cache[p] = NULL;
+		device->cache.gid_cache [p] = NULL;
+		ib_cache_update(device, p + start_port(device));
+	}
 
-		priv->port_data[p].pkey_table_alloc_length = prop.pkey_tbl_len;
-		priv->port_data[p].pkey_table = kmalloc(prop.pkey_tbl_len * sizeof (u16),
-							GFP_KERNEL);
-		if (!priv->port_data[p].pkey_table) {
-			ret = -ENOMEM;
-			goto error;
-		}
+	INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
+			      device, ib_cache_event);
+	if (ib_register_event_handler(&device->cache.event_handler))
+		goto err_cache;
 
-		ib_cache_update(device, p);
+	return;
 
-		INIT_IB_EVENT_HANDLER(&priv->port_data[p].event_handler,
-				      device, ib_cache_event);
-		ret = ib_register_event_handler(&priv->port_data[p].event_handler);
-		if (ret) {
-			priv->port_data[p].event_handler.device = NULL;
-			goto error;
-		}
+err_cache:
+	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
+		kfree(device->cache.pkey_cache[p]);
+		kfree(device->cache.gid_cache[p]);
 	}
 
-	return 0;
+err:
+	kfree(device->cache.pkey_cache);
+	kfree(device->cache.gid_cache);
+}
 
- error:
-	for (p = priv->start_port; p <= priv->end_port; ++p) {
-		if (priv->port_data[p].event_handler.device)
-			ib_unregister_event_handler(&priv->port_data[p].event_handler);
-		kfree(priv->port_data[p].gid_table);
-		kfree(priv->port_data[p].pkey_table);
+void ib_cache_cleanup_one(struct ib_device *device)
+{
+	int p;
+
+	ib_unregister_event_handler(&device->cache.event_handler);
+	flush_scheduled_work();
+
+	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
+		kfree(device->cache.pkey_cache[p]);
+		kfree(device->cache.gid_cache[p]);
 	}
 
-	return ret;
+	kfree(device->cache.pkey_cache);
+	kfree(device->cache.gid_cache);
 }
 
-void ib_cache_cleanup(struct ib_device *device)
+struct ib_client cache_client = {
+	.name   = "cache",
+	.add    = ib_cache_setup_one,
+	.remove = ib_cache_cleanup_one
+};
+
+int __init ib_cache_setup(void)
 {
-	struct ib_device_private *priv = device->core;
-	int                       p;
+	return ib_register_client(&cache_client);
+}
 
-	for (p = priv->start_port; p <= priv->end_port; ++p) {
-		ib_unregister_event_handler(&priv->port_data[p].event_handler);
-		kfree(priv->port_data[p].gid_table);
-		kfree(priv->port_data[p].pkey_table);
-	}
+void __exit ib_cache_cleanup(void)
+{
+	ib_unregister_client(&cache_client);
 }
 
 /*
Index: core/core_priv.h
===================================================================
--- core/core_priv.h	(revision 1178)
+++ core/core_priv.h	(working copy)
@@ -29,39 +29,15 @@
 
 #include <ib_verbs.h>
 
-struct ib_device_private {
-	int                     start_port;
-	int                     end_port;
-	u64                     node_guid;
-	struct ib_port_data    *port_data;
-};
-
-struct ib_port_data {
-	struct ib_device          *device;
-
-	struct ib_event_handler    event_handler;
-	struct work_struct         refresh_task;
-
-	seqcount_t                 lock;
-	struct ib_port_attr        properties;
-	struct ib_port_lid         port_lid;
-	int                        gid_table_alloc_length;
-	u16                        pkey_table_alloc_length;
-	union ib_gid              *gid_table;
-	u16                       *pkey_table;
-	u8                         port_num;
-};
-
-int  ib_cache_setup(struct ib_device *device);
-void ib_cache_cleanup(struct ib_device *device);
-void ib_completion_thread(struct list_head *entry, void *device_ptr);
-void ib_async_thread(struct list_head *entry, void *device_ptr);
-
 int  ib_device_register_sysfs(struct ib_device *device);
 void ib_device_unregister_sysfs(struct ib_device *device);
+
 int  ib_sysfs_setup(void);
 void ib_sysfs_cleanup(void);
 
+int  ib_cache_setup(void);
+void ib_cache_cleanup(void);
+
 #endif /* _CORE_PRIV_H */
 
 /*
Index: include/ib_verbs.h
===================================================================
--- include/ib_verbs.h	(revision 1178)
+++ include/ib_verbs.h	(working copy)
@@ -672,6 +672,12 @@
 
 #define IB_DEVICE_NAME_MAX 64
 
+struct ib_cache {
+	struct ib_event_handler event_handler;
+	struct ib_pkey_cache  **pkey_cache;
+	struct ib_gid_cache   **gid_cache;
+};
+
 struct ib_device {
 	struct pci_dev               *dma_device;
 
@@ -684,7 +690,8 @@
 	struct list_head              client_data_list;
 	spinlock_t                    client_data_lock;
 
-	void                         *core;
+	struct ib_cache               cache;
+
 	u32                           flags;
 
 	int		           (*query_device)(struct ib_device *device,
Index: include/ts_ib_core.h
===================================================================
--- include/ts_ib_core.h	(revision 1178)
+++ include/ts_ib_core.h	(working copy)
@@ -24,14 +24,6 @@
 #ifndef _TS_IB_CORE_H
 #define _TS_IB_CORE_H
 
-struct ib_port_lid {
-	u16        lid;
-	u8         lmc;
-};
-
-int ib_cached_lid_get(struct ib_device    *device,
-		      u8                   port,
-		      struct ib_port_lid  *port_lid);
 int ib_cached_gid_get(struct ib_device    *device,
 		      u8                   port,
 		      int                  index,
Index: ulp/ipoib/ipoib_multicast.c
===================================================================
--- ulp/ipoib/ipoib_multicast.c	(revision 1178)
+++ ulp/ipoib/ipoib_multicast.c	(working copy)
@@ -517,10 +517,12 @@
 	}
 
 	{
-		struct ib_port_lid port_lid;
+		struct ib_port_attr attr;
 
-		ib_cached_lid_get(priv->ca, priv->port, &port_lid);
-		priv->local_lid = port_lid.lid;
+		if (!ib_query_port(priv->ca, priv->port, &attr))
+			priv->local_lid = attr.lid;
+		else
+			ipoib_warn(priv, "ib_query_port failed\n");
 	}
 
 	priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) -


From sreenivasulu at topspin.com  Tue Nov  9 02:26:04 2004
From: sreenivasulu at topspin.com (Sreenivasulu Pulichintala)
Date: Tue, 9 Nov 2004 15:56:04 +0530
Subject: [openib-general] VAPI_RETRY_EXC_ERR
Message-ID: <4A388685F814D54CAE412B2DAB7CE91C195454@initexch.topspincom.com>

HI,

 
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my
fortran applications, some times my application crashes producing the
following error -

 
===

Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81
mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_spinandwaitcq:
Assertion `sc->status == VAPI_SUCCESS' failed.
Timeout alarm signaled^M
Cleaning up all processes ...done.^M
Killed by signal 15.^M^M
===
 
In what possible cases I get this error? Is it because of RESYNC?
 
Any help in this regard is highly appreciated.
 
Thanks
Sree
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041109/4bf25841/attachment.html>

From sreenivasulu at topspin.com  Tue Nov  9 02:49:17 2004
From: sreenivasulu at topspin.com (Sreenivasulu Pulichintala)
Date: Tue, 9 Nov 2004 16:19:17 +0530
Subject: [openib-general] VAPI_RETRY_EXC_ERR
Message-ID: <4A388685F814D54CAE412B2DAB7CE91C195455@initexch.topspincom.com>

The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR

 
-----Original Message-----
From: Sreenivasulu Pulichintala 
Sent: Tuesday, November 09, 2004 3:56 PM
To: openib-general at openib.org
Subject: [openib-general] VAPI_RETRY_EXC_ERR

 
HI,

 
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my
fortran applications, some times my application crashes producing the
following error -

 
===

Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81
mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_spinandwaitcq:
Assertion `sc->status == VAPI_SUCCESS' failed.
Timeout alarm signaled^M
Cleaning up all processes ...done.^M
Killed by signal 15.^M^M
===
 
In what possible cases I get this error? Is it because of RESYNC?
 
Any help in this regard is highly appreciated.
 
Thanks
Sree
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041109/ef071135/attachment.html>

From halr at voltaire.com  Tue Nov  9 05:57:56 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 08:57:56 -0500
Subject: [openib-general] ib_mad_recv_done_handler questions
In-Reply-To: <419005D9.8070305@ichips.intel.com>
References: <419005D9.8070305@ichips.intel.com>
Message-ID: <1100008675.8714.2084.camel@hpc-1>

On Mon, 2004-11-08 at 18:48, Sean Hefty wrote:
> Looking at the latest changes to ib_mad_recv_done_handler, I have a 
> couple of questions:

> * If process_mad consumes the MAD, should the code just goto out? 
> Something more like:
> 
> 	ret = port_priv->device->process_mad(...)
> 	if ((ret & IB_MAD_RESULT_SUCCESS) &&
> 	    (ret & IB_MAD_RESULT_REPLY)) {
> 	...
> 	} else
> 
> becomes
> 
> 	ret = port_priv->device->process_mad(...)
> 	if (ret & IB_MAD_RESULT_SUCCESS)) {
> 		if (ret & IB_MAD_RESULT_REPLY)) {
> 		...
> 		}
> 		...
> 		goto out;
> 	} else

Patch shortly on this.

> 	Does the MAD still need to be dispatched in this case?

I'm not sure exactly what all the reasons for !success being returned
from process_mad are but my reasoning was as follows:

In this error case, it is unclear whether the packet would have been
consumed or not. If it would not have been consumed, it should be
dispatched. It is only in the case where it would have been consumed
that dispatching it causes a potential issue. If the packet is indeed
dispatched to a client, wouldn't/shouldn't the client throw it away (as
unexpected) ? 

If it is thrown away in this error case (a more conservative strategy),
some retransmission strategy would kick in on one side or the other.

I wasn't sure about this and chose the former strategy.

-- Hal


From halr at voltaire.com  Tue Nov  9 06:01:37 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 09:01:37 -0500
Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler,
 don't dispatch	additional error cases
Message-ID: <1100008896.8714.2105.camel@hpc-1>

mad: In ib_mad_recv_done_handler, don't dispatch additional error cases

Index: mad.c
===================================================================
--- mad.c	(revision 1180)
+++ mad.c	(working copy)
@@ -1138,26 +1138,27 @@
 						     wc->slid,
 						     recv->header.recv_buf.mad,
 						     response);
-		if ((ret & IB_MAD_RESULT_SUCCESS) &&
-		    (ret & IB_MAD_RESULT_REPLY)) {
-			if (response->mad_hdr.mgmt_class ==
-			    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
-				if (!smi_handle_dr_smp_recv(
-					(struct ib_smp *)response,
-					port_priv->device->node_type,
-					port_priv->port_num,
-					port_priv->device->phys_port_cnt)) {
+		if (ret & IB_MAD_RESULT_SUCCESS) {
+			if (ret & IB_MAD_RESULT_REPLY) {
+				if (response->mad_hdr.mgmt_class ==
+				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+					if (!smi_handle_dr_smp_recv(
+					    (struct ib_smp *)response,
+					    port_priv->device->node_type,
+					    port_priv->port_num,
+					    port_priv->device->phys_port_cnt)) {
+						kfree(response);
+						goto out;
+					}
+				}
+				/* Send response */
+				grh = (void *)recv->header.recv_buf.mad -
+				      sizeof(struct ib_grh);
+				if (agent_send(response, grh, wc,
+					       port_priv->device,
+					       port_priv->port_num)) {
 					kfree(response);
-					goto out;
 				}
-			}
-			/* Send response */
-			grh = (void *)recv->header.recv_buf.mad -
-			      sizeof(struct ib_grh);
-			if (agent_send(response, grh, wc,
-				       port_priv->device,
-				       port_priv->port_num)) {
-				kfree(response);
 				goto out;
 			}
 		} else


From halr at voltaire.com  Tue Nov  9 06:12:38 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 09:12:38 -0500
Subject: [openib-general] MAD agent code comments
In-Reply-To: <41900EEB.7050109@ichips.intel.com>
References: <41900EEB.7050109@ichips.intel.com>
Message-ID: <1100009558.13933.3.camel@localhost.localdomain>

On Mon, 2004-11-08 at 19:27, Sean Hefty wrote:
> A couple of comments (so far) while tracing through the MAD agent code.
> 
> * There are a couple of places where ib_get_agent_mad() will be called 
> multiple times in the same execution path.  For example agent_send calls 
> it, as does agent_mad_send.  

Are there others like this ?

> I didn't check to see if the calls would 
> return the same ib_agent_port_private structure.  

I eliminated the duplicate call. Patch shortly on this.

> (Would calling the 
> function ib_get_agent_port() make more sense?)

Yes.

> * The agent code assumes that sends are completed in the order that they 
> are posted.  The MAD code does not guarantee that this is the case.  (It 
> cannot do this as a result of matching requests with response, handling 
> timeouts, and error handling.)

Since the agent does not use solicited sends, are its sends completed in
order (so this is only an issue for clients using solicited sends) ? 

-- Hal


From halr at voltaire.com  Tue Nov  9 06:27:59 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 09:27:59 -0500
Subject: [openib-general] [PATCH] agent: Rename ib_get_agent_mad to
	ib_get_agent_port and
	eliminate duplicated call to it in agent_mad_send
Message-ID: <1100010478.26166.2.camel@hpc-1>

agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate
duplicated call to it in agent_mad_send (pointed out by Sean Hefty)

Index: agent.c
===================================================================
--- agent.c	(revision 1180)
+++ agent.c	(working copy)
@@ -35,8 +35,8 @@
 
 
 static inline struct ib_agent_port_private *
-__ib_get_agent_mad(struct ib_device *device, int port_num,
-		   struct ib_mad_agent *mad_agent)
+__ib_get_agent_port(struct ib_device *device, int port_num,
+		    struct ib_mad_agent *mad_agent)
 {
 	struct ib_agent_port_private *entry;
 
@@ -61,14 +61,14 @@
 }
 
 static inline struct ib_agent_port_private *
-ib_get_agent_mad(struct ib_device *device, int port_num,
-		 struct ib_mad_agent *mad_agent)
+ib_get_agent_port(struct ib_device *device, int port_num,
+		  struct ib_mad_agent *mad_agent)
 {
 	struct ib_agent_port_private *entry;
 	unsigned long flags;
 
 	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
-	entry = __ib_get_agent_mad(device, port_num, mad_agent);
+	entry = __ib_get_agent_port(device, port_num, mad_agent);
 	spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
 
 	return entry;
@@ -82,7 +82,7 @@
 
 	if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
 		return 1;
-	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	port_priv = ib_get_agent_port(device, port_num, NULL);
 	if (!port_priv) {
 		printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d "
 		       "not open\n",
@@ -94,11 +94,11 @@
 }
 
 static int agent_mad_send(struct ib_mad_agent *mad_agent,
+			  struct ib_agent_port_private *port_priv,
 			  struct ib_mad *mad,
 			  struct ib_grh *grh,
 			  struct ib_wc *wc)
 {
-	struct ib_agent_port_private *port_priv;
 	struct ib_agent_send_wr *agent_send_wr;
 	struct ib_sge gather_list;
 	struct ib_send_wr send_wr;
@@ -107,15 +107,6 @@
 	unsigned long flags;
 	int ret = 1;
 
-	/* Find matching MAD agent */
-	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
-	if (!port_priv) {
-		printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent "
-		       "%p\n",
-		       mad_agent);
-		goto out;
-	}
-
 	agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL);
 	if (!agent_send_wr)
 		goto out;
@@ -213,7 +204,7 @@
 	struct ib_agent_port_private *port_priv;
 	struct ib_mad_agent *mad_agent;
 
-	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	port_priv = ib_get_agent_port(device, port_num, NULL);
 	if (!port_priv) {
 		printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n",
 		       device->name, port_num);
@@ -235,7 +226,7 @@
 			return 1;
 	}
 
-	return agent_mad_send(mad_agent, mad, grh, wc);
+	return agent_mad_send(mad_agent, port_priv, mad, grh, wc);
 }
 
 static void agent_send_handler(struct ib_mad_agent *mad_agent,
@@ -247,7 +238,7 @@
 	unsigned long			flags;
 
 	/* Find matching MAD agent */
-	port_priv = ib_get_agent_mad(NULL, 0, mad_agent);
+	port_priv = ib_get_agent_port(NULL, 0, mad_agent);
 	if (!port_priv) {
 		printk(KERN_ERR SPFX "agent_send_handler: no matching MAD "
 		       "agent %p\n", mad_agent);
@@ -296,7 +287,7 @@
 	unsigned long flags;
 
 	/* First, check if port already open for SMI */
-	port_priv = ib_get_agent_mad(device, port_num, NULL);
+	port_priv = ib_get_agent_port(device, port_num, NULL);
 	if (port_priv) {
 		printk(KERN_DEBUG SPFX "%s port %d already open\n",
 		       device->name, port_num);
@@ -388,7 +379,7 @@
 	unsigned long flags;
 
 	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
-	port_priv = __ib_get_agent_mad(device, port_num, NULL);
+	port_priv = __ib_get_agent_port(device, port_num, NULL);
 	if (port_priv == NULL) {
 		spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
 		printk(KERN_ERR SPFX "Port %d not found\n", port_num);


From halr at voltaire.com  Tue Nov  9 06:21:25 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 09:21:25 -0500
Subject: [openib-general] ib_mad_recv_done_handler questions
In-Reply-To: <521xf3c0au.fsf@topspin.com>
References: <419005D9.8070305@ichips.intel.com> <521xf3c0au.fsf@topspin.com>
Message-ID: <1100010085.13933.5.camel@localhost.localdomain>

On Mon, 2004-11-08 at 19:51, Roland Dreier wrote:
>     Sean> * If the underlying driver provides a process_mad routine, a
>     Sean> response MAD is allocated every time a MAD is received on QP
>     Sean> 0 or 1.  Can we either push this allocation down into the
>     Sean> HCA driver, or find an alternative way of interacting
>     Sean> between the two drivers that doesn't require this allocation
>     Sean> unless a response will be generated?
> 
> How about if the MAD layer allocates a response MAD when a MAD is
> received, and if the process_mad call doesn't actually generate a
> response the MAD layer just stashed the response MAD away to use for
> the next receive?  This should keep the number of allocations within 1
> of the number of responses actually generated, but save us from
> tracking allocations between two layers.

I like it. I'll work up a patch for this.

-- Hal


From halr at voltaire.com  Tue Nov  9 06:34:28 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 09:34:28 -0500
Subject: [openib-general] IPoIB Completion Handling
Message-ID: <1100010867.26166.7.camel@hpc-1>

Hi Roland,

In ipoib_ib_handle_wc when status != success, isn't the WC opcode
invalid ? Also, in that case, don't receives also need to be reposted ?

-- Hal


From tziporet at mellanox.co.il  Tue Nov  9 06:43:01 2004
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 9 Nov 2004 16:43:01 +0200 
Subject: [openib-general] VAPI_RETRY_EXC_ERR
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BE9E6@mtlex01.yok.mtl.com>

There can be several problems:
- The retry count is too small - try to put max number - 7
- Maybe the timeout is too small - so the HCA start to perform retry too
much - try to enlarge it to 21
- Can be that the PSN between two sides is not synchronized
- The link fail
- The QP in the other side was closed or moved to error
 
In case this error occurs at the beginning of the application then it can
indicate that the QP configuration is wrong.
 
Tziporet

-----Original Message-----
From: Sreenivasulu Pulichintala [mailto:sreenivasulu at topspin.com]
Sent: Tuesday, November 09, 2004 12:49 PM
To: openib-general at openib.org
Subject: RE: [openib-general] VAPI_RETRY_EXC_ERR


The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR

 
-----Original Message-----
From: Sreenivasulu Pulichintala 
Sent: Tuesday, November 09, 2004 3:56 PM
To: openib-general at openib.org
Subject: [openib-general] VAPI_RETRY_EXC_ERR

 
HI,

 
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran
applications, some times my application crashes producing the following
error -

 
===

Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81
mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_spinandwaitcq: Assertion
`sc->status == VAPI_SUCCESS' failed.
Timeout alarm signaled^M
Cleaning up all processes ...done.^M
Killed by signal 15.^M^M
===
 
In what possible cases I get this error? Is it because of RESYNC?
 
Any help in this regard is highly appreciated.
 
Thanks
Sree
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041109/71bcc0ed/attachment.html>

From roland at topspin.com  Tue Nov  9 07:09:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 07:09:53 -0800
Subject: [openib-general] Re: IPoIB Completion Handling
In-Reply-To: <1100010867.26166.7.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 09 Nov 2004 09:34:28 -0500")
References: <1100010867.26166.7.camel@hpc-1>
Message-ID: <52hdnz9hz2.fsf@topspin.com>

    Hal> In ipoib_ib_handle_wc when status != success, isn't the WC
    Hal> opcode invalid ? Also, in that case, don't receives also need
    Hal> to be reposted ?

Yes, the error handling in IPoIB needs to be fixed.

 - R.


From roland at topspin.com  Tue Nov  9 07:37:50 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 07:37:50 -0800
Subject: [openib-general] Re: IPoIB Completion Handling
In-Reply-To: <52hdnz9hz2.fsf@topspin.com> (Roland Dreier's message of "Tue,
	09 Nov 2004 07:09:53 -0800")
References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com>
Message-ID: <524qjz9goh.fsf@topspin.com>

    Hal> In ipoib_ib_handle_wc when status != success, isn't the WC
    Hal> opcode invalid ? Also, in that case, don't receives also need
    Hal> to be reposted ?

    Roland> Yes, the error handling in IPoIB needs to be fixed.

By the way, reposting the receives is not the right thing to do on
error -- the QP will be in the error state, so any new work requests
will just complete with a flush status.  We need to reset the QP and
start over to recover from errors.

 - R.


From halr at voltaire.com  Tue Nov  9 08:05:46 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 11:05:46 -0500
Subject: [openib-general] Re: IPoIB Completion Handling
In-Reply-To: <524qjz9goh.fsf@topspin.com>
References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com>
	<524qjz9goh.fsf@topspin.com>
Message-ID: <1100016345.13933.230.camel@localhost.localdomain>

On Tue, 2004-11-09 at 10:37, Roland Dreier wrote:
> By the way, reposting the receives is not the right thing to do on
> error -- the QP will be in the error state, so any new work requests
> will just complete with a flush status.  We need to reset the QP and
> start over to recover from errors.

Is the same thing true for QP0/1 ? If so, this needs to be done there as
well. (There used to be a port restart there but this was excised).

-- Hal


From roland at topspin.com  Tue Nov  9 08:37:05 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 08:37:05 -0800
Subject: [openib-general] Re: IPoIB Completion Handling
In-Reply-To: <1100016345.13933.230.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Tue, 09 Nov 2004 11:05:46 -0500")
References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com>
	<524qjz9goh.fsf@topspin.com>
	<1100016345.13933.230.camel@localhost.localdomain>
Message-ID: <52zn1r7zda.fsf@topspin.com>

    Roland> By the way, reposting the receives is not the right thing
    Roland> to do on error -- the QP will be in the error state, so
    Roland> any new work requests will just complete with a flush
    Roland> status.  We need to reset the QP and start over to recover
    Roland> from errors.

    Hal> Is the same thing true for QP0/1 ? If so, this needs to be
    Hal> done there as well. (There used to be a port restart there
    Hal> but this was excised).

Yes, of course (QP0/1 act just like normal UD QPs as far as work
request processing/error semantics go).

 - R.


From halr at voltaire.com  Tue Nov  9 08:49:21 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 11:49:21 -0500
Subject: [openib-general] [PATCH] mad/agent: Modify receive buffer
	allocation strategy
Message-ID: <1100018960.7222.6.camel@hpc-1>

mad/agent: Modify receive buffer allocation strategy
(Inefficiency pointed out by Sean; algorithm described by Roland)

Problem: Currently, if the underlying driver provides a process_mad
routine, a response MAD is allocated every time a MAD is received on
QP 0 or 1.

Solution: The MAD layer can allocate a response MAD when a MAD is
received, and if the process_mad call doesn't actually generate a
response the MAD layer just stashes the response MAD away to use for
the next receive. This should keep the number of allocations within 1
of the number of responses actually generated, but save us from
tracking allocations between two layers.

Index: agent.h
===================================================================
--- agent.h	(revision 1180)
+++ agent.h	(working copy)
@@ -31,7 +31,7 @@
 
 extern int ib_agent_port_close(struct ib_device *device, int port_num);
 
-extern int agent_send(struct ib_mad *mad,
+extern int agent_send(struct ib_mad_private *mad,
 		      struct ib_grh *grh,
 		      struct ib_wc *wc,
 		      struct ib_device *device,
Index: agent_priv.h
===================================================================
--- agent_priv.h	(revision 1180)
+++ agent_priv.h	(working copy)
@@ -33,7 +33,7 @@
 struct ib_agent_send_wr {
 	struct list_head send_list;
 	struct ib_ah *ah;
-	struct ib_mad *mad;
+	struct ib_mad_private *mad;
 	DECLARE_PCI_UNMAP_ADDR(mapping)
 };
 
Index: agent.c
===================================================================
--- agent.c	(revision 1182)
+++ agent.c	(working copy)
@@ -33,7 +33,9 @@
 static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED;
 static LIST_HEAD(ib_agent_port_list);
 
+extern kmem_cache_t *ib_mad_cache;
 
+
 static inline struct ib_agent_port_private *
 __ib_get_agent_port(struct ib_device *device, int port_num,
 		    struct ib_mad_agent *mad_agent)
@@ -95,7 +97,7 @@
 
 static int agent_mad_send(struct ib_mad_agent *mad_agent,
 			  struct ib_agent_port_private *port_priv,
-			  struct ib_mad *mad,
+			  struct ib_mad_private *mad,
 			  struct ib_grh *grh,
 			  struct ib_wc *wc)
 {
@@ -114,10 +116,10 @@
 
 	/* PCI mapping */
 	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
-					  mad,
-					  sizeof(struct ib_mad),
+					  &mad->grh,
+					  sizeof *mad - sizeof mad->header,
 					  PCI_DMA_TODEVICE);
-	gather_list.length = sizeof(struct ib_mad);
+	gather_list.length = sizeof *mad - sizeof mad->header;
 	gather_list.lkey = (*port_priv->mr).lkey;
 
 	send_wr.next = NULL;
@@ -133,7 +135,7 @@
 	ah_attr.src_path_bits = wc->dlid_path_bits;
 	ah_attr.sl = wc->sl;
 	ah_attr.static_rate = 0;
-	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
+	if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		if (wc->wc_flags & IB_WC_GRH) {
 			ah_attr.ah_flags = IB_AH_GRH;
 			/* Should sgid be looked up ? */
@@ -162,14 +164,14 @@
 	}
 
 	send_wr.wr.ud.ah = agent_send_wr->ah;
-	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
+	if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		send_wr.wr.ud.pkey_index = wc->pkey_index;
 		send_wr.wr.ud.remote_qkey = IB_QP1_QKEY;
 	} else {
 		send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */
 		send_wr.wr.ud.remote_qkey = 0; /* for SMPs */
 	}
-	send_wr.wr.ud.mad_hdr = (struct ib_mad_hdr *)mad;
+	send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr;
 	send_wr.wr_id = ++port_priv->wr_id;
 
 	pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr);
@@ -180,7 +182,8 @@
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 		pci_unmap_single(mad_agent->device->dma_device,
 				 pci_unmap_addr(agent_send_wr, mapping),
-				 sizeof(struct ib_mad),
+				 sizeof(struct ib_mad_private) -
+				 sizeof(struct ib_mad_private_header),
 				 PCI_DMA_TODEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
 		kfree(agent_send_wr);
@@ -195,7 +198,7 @@
 	return ret;
 }
 
-int agent_send(struct ib_mad *mad,
+int agent_send(struct ib_mad_private *mad,
 	       struct ib_grh *grh,
 	       struct ib_wc *wc,
 	       struct ib_device *device,
@@ -212,7 +215,7 @@
 	}
 
 	/* Get mad agent based on mgmt_class in MAD */
-	switch (mad->mad_hdr.mgmt_class) {
+	switch (mad->mad.mad.mad_hdr.mgmt_class) {
 		case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
 			mad_agent = port_priv->dr_smp_agent;
 			break;
@@ -269,13 +272,14 @@
 	/* Unmap PCI */
 	pci_unmap_single(mad_agent->device->dma_device,
 			 pci_unmap_addr(agent_send_wr, mapping),
-			 sizeof(struct ib_mad),
+			 sizeof(struct ib_mad_private) -
+			 sizeof(struct ib_mad_private_header),
 			 PCI_DMA_TODEVICE);
 
 	ib_destroy_ah(agent_send_wr->ah);
 
 	/* Release allocated memory */
-	kfree(agent_send_wr->mad);
+	kmem_cache_free(ib_mad_cache, agent_send_wr->mad);
 	kfree(agent_send_wr);
 }
 
Index: mad.c
===================================================================
--- mad.c	(revision 1181)
+++ mad.c	(working copy)
@@ -69,7 +69,7 @@
 MODULE_AUTHOR("Sean Hefty");
 
 
-static kmem_cache_t *ib_mad_cache;
+kmem_cache_t *ib_mad_cache;
 static struct list_head ib_mad_port_list;
 static u32 ib_mad_client_id = 0;
 
@@ -83,7 +83,8 @@
 static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
 			   struct ib_mad_agent_private *priv);
 static void remove_mad_reg_req(struct ib_mad_agent_private *priv); 
-static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info);
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info,
+				   struct ib_mad_private *mad);
 static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info);
 static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
 static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
@@ -1067,12 +1068,17 @@
 {
 	struct ib_mad_qp_info *qp_info;
 	struct ib_mad_private_header *mad_priv_hdr;
-	struct ib_mad_private *recv;
+	struct ib_mad_private *recv, *response;
 	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent;
 	struct ib_smp *smp;
 	int solicited;
 
+	response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+	if (!response)
+		printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
+		       "for response buffer\n");
+
 	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
 	qp_info = mad_list->mad_queue->qp_info;
 	dequeue_mad(mad_list);
@@ -1119,11 +1125,9 @@
 
 	/* Give driver "right of first refusal" on incoming MAD */
 	if (port_priv->device->process_mad) {
-		struct ib_mad *response;
 		struct ib_grh *grh;
 		int ret;
 
-		response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
 		if (!response) {
 			printk(KERN_ERR PFX "No memory for response MAD\n");
 			/*
@@ -1137,32 +1141,29 @@
 						     port_priv->port_num,
 						     wc->slid,
 						     recv->header.recv_buf.mad,
-						     response);
+						     &response->mad.mad);
 		if (ret & IB_MAD_RESULT_SUCCESS) {
 			if (ret & IB_MAD_RESULT_REPLY) {
-				if (response->mad_hdr.mgmt_class ==
+				if (response->mad.mad.mad_hdr.mgmt_class ==
 				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
 					if (!smi_handle_dr_smp_recv(
-					    (struct ib_smp *)response,
+					    (struct ib_smp *)&response->mad.mad,
 					    port_priv->device->node_type,
 					    port_priv->port_num,
 					    port_priv->device->phys_port_cnt)) {
-						kfree(response);
 						goto out;
 					}
 				}
 				/* Send response */
 				grh = (void *)recv->header.recv_buf.mad -
 				      sizeof(struct ib_grh);
-				if (agent_send(response, grh, wc,
-					       port_priv->device,
-					       port_priv->port_num)) {
-					kfree(response);
-				}
+				if (!agent_send(response, grh, wc,
+						port_priv->device,
+						port_priv->port_num))
+					response = NULL;
 				goto out;
 			}
-		} else
-			kfree(response);
+		} 
 	}
 
 	/* Determine corresponding MAD agent for incoming receive MAD */
@@ -1183,7 +1184,7 @@
 		kmem_cache_free(ib_mad_cache, recv);
 
 	/* Post another receive request for this QP */
-	ib_mad_post_receive_mad(qp_info);
+	ib_mad_post_receive_mad(qp_info, response);
 }
 
 static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
@@ -1491,7 +1492,8 @@
 	queue_work(port_priv->wq, &port_priv->work);
 }
 
-static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info)
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info,
+				   struct ib_mad_private *mad)
 {
 	struct ib_mad_private *mad_priv;
 	struct ib_sge sg_list;
@@ -1499,19 +1501,23 @@
 	struct ib_recv_wr *bad_recv_wr;
 	int ret;
 
-	/* 
-	 * Allocate memory for receive buffer.
-	 * This is for both MAD and private header
-	 * which contains the receive tracking structure.
-	 * By prepending this header, there is one rather 
-	 * than two memory allocations.
-	 */
-	mad_priv = kmem_cache_alloc(ib_mad_cache,
-				    (in_atomic() || irqs_disabled()) ?
-				    GFP_ATOMIC : GFP_KERNEL);
-	if (!mad_priv) {
-		printk(KERN_ERR PFX "No memory for receive buffer\n");
-		return -ENOMEM;
+	if (mad)
+		mad_priv = mad;
+	else {
+		/* 
+		 * Allocate memory for receive buffer.
+		 * This is for both MAD and private header
+		 * which contains the receive tracking structure.
+		 * By prepending this header, there is one rather 
+		 * than two memory allocations.
+		 */
+		mad_priv = kmem_cache_alloc(ib_mad_cache,
+					    (in_atomic() || irqs_disabled()) ?
+					    GFP_ATOMIC : GFP_KERNEL);
+		if (!mad_priv) {
+			printk(KERN_ERR PFX "No memory for receive buffer\n");
+			return -ENOMEM;
+		}
 	}
 
 	/* Setup scatter list */
@@ -1559,7 +1565,7 @@
 	int i, ret;
 
 	for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) {
-		ret = ib_mad_post_receive_mad(qp_info);
+		ret = ib_mad_post_receive_mad(qp_info, NULL);
 		if (ret) {
 			printk(KERN_ERR PFX "receive post %d failed "
 				"on %s port %d\n", i + 1,


From mshefty at ichips.intel.com  Tue Nov  9 09:00:54 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 09:00:54 -0800
Subject: [openib-general] MAD agent code comments
In-Reply-To: <1100009558.13933.3.camel@localhost.localdomain>
References: <41900EEB.7050109@ichips.intel.com>
	<1100009558.13933.3.camel@localhost.localdomain>
Message-ID: <4190F7C6.8020509@ichips.intel.com>

Hal Rosenstock wrote:
> Since the agent does not use solicited sends, are its sends completed in
> order (so this is only an issue for clients using solicited sends) ? 

I would think that solicited sends (i.e. responses) would be easier to 
maintain order, since those wouldn't have a timeout.  But my preference 
would be to not defined the API this way.  It makes queuing for QP 
overrun and error handling difficult.

For example, a client posts 2 sends, both of which get queued.  If the 
first send gets posted, but the second send fails when posting to the 
QP, then we'd need to delay reporting the second send's completion. 
This also makes it more difficult to go to multi-threaded completion 
handling, if that were shown to be beneficial.

- Sean


From halr at voltaire.com  Tue Nov  9 09:07:56 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 12:07:56 -0500
Subject: [openib-general] More on IPoIB Multicast
Message-ID: <1100020075.7342.1.camel@hpc-1>

Hi Roland,

If a multicast send is attempted and the node is not joined to the
multicast group which is the destination of the send, a send only join
(which is neutered due to lack of SM support) is assumed. Is my
understanding correct ?

Linux also supports multicast routing. For this case, I think a non
member join is needed. I'm not sure how to detect which of the join
cases to use.

Also, for multicast routing, the multicast group created/removed traps
would need to be subscribed to (and the SM would need to support these).
Does anyone know if OpenSM does support this ?

-- Hal


From roland at topspin.com  Tue Nov  9 09:07:06 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 09:07:06 -0800
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <1100020075.7342.1.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 09 Nov 2004 12:07:56 -0500")
References: <1100020075.7342.1.camel@hpc-1>
Message-ID: <52r7n37xz9.fsf@topspin.com>

    Hal> Hi Roland, If a multicast send is attempted and the node is
    Hal> not joined to the multicast group which is the destination of
    Hal> the send, a send only join (which is neutered due to lack of
    Hal> SM support) is assumed. Is my understanding correct ?

Yes.

    Hal> Linux also supports multicast routing. For this case, I think
    Hal> a non member join is needed. I'm not sure how to detect which
    Hal> of the join cases to use.

    Hal> Also, for multicast routing, the multicast group
    Hal> created/removed traps would need to be subscribed to (and the
    Hal> SM would need to support these).

Someone who understands how the kernel does multicast routing would
have to guide us here.

My goal is to get basic IPv4 cleaned up to the point I feel
comfortable submitting upstream.  However I'm very happy to have other
people look at IPv6, multicast routing, multiport bonding/failover
(although my feeling is that it would be better to extend the existing
bonding driver rather than trying to put this in the IPoIB driver), ....

 - R.


From mshefty at ichips.intel.com  Tue Nov  9 09:11:39 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 09:11:39 -0800
Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't
	dispatch	additional error cases
In-Reply-To: <1100008896.8714.2105.camel@hpc-1>
References: <1100008896.8714.2105.camel@hpc-1>
Message-ID: <4190FA4B.5000000@ichips.intel.com>

Hal Rosenstock wrote:

> mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
> +		if (ret & IB_MAD_RESULT_SUCCESS) {
> +			if (ret & IB_MAD_RESULT_REPLY) {
> +				if (response->mad_hdr.mgmt_class ==
> +				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
> +					if (!smi_handle_dr_smp_recv(
> +					    (struct ib_smp *)response,
> +					    port_priv->device->node_type,
> +					    port_priv->port_num,
> +					    port_priv->device->phys_port_cnt)) {
> +						kfree(response);
> +						goto out;
> +					}
> +				}
> +				/* Send response */
> +				grh = (void *)recv->header.recv_buf.mad -
> +				      sizeof(struct ib_grh);
> +				if (agent_send(response, grh, wc,
> +					       port_priv->device,
> +					       port_priv->port_num)) {
>  					kfree(response);
>  				}
>  				goto out;
>  			}

			goto out;

I guess I was wondering if it was okay to move "goto out" to here, and 
always skip dispatching if process_mad returned success.  I think 
dispatching in the failure case makes sense.

>  		} else


From mshefty at ichips.intel.com  Tue Nov  9 09:16:33 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 09:16:33 -0800
Subject: [openib-general] Re: IPoIB Completion Handling
In-Reply-To: <1100016345.13933.230.camel@localhost.localdomain>
References: <1100010867.26166.7.camel@hpc-1>
	<52hdnz9hz2.fsf@topspin.com>	<524qjz9goh.fsf@topspin.com>
	<1100016345.13933.230.camel@localhost.localdomain>
Message-ID: <4190FB71.1090704@ichips.intel.com>

Hal Rosenstock wrote:

> On Tue, 2004-11-09 at 10:37, Roland Dreier wrote:
> 
>>By the way, reposting the receives is not the right thing to do on
>>error -- the QP will be in the error state, so any new work requests
>>will just complete with a flush status.  We need to reset the QP and
>>start over to recover from errors.
> 
> 
> Is the same thing true for QP0/1 ? If so, this needs to be done there as
> well. (There used to be a port restart there but this was excised).

Btw, I have plans to get to this shortly.  I have the send queuing code 
complete (need to re-merge after the patches this morning), but I 
haven't been able to debug the code yet.  I'm running into some issues 
configuring a point-to-point "fabric", with opensm running on the 
sourceforge stack on the other node.  I have some changes to handle send 
queuing that are needed when recovering from QP errors as well.

- Sean


From halr at voltaire.com  Tue Nov  9 09:18:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 12:18:39 -0500
Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler,
	don't dispatch	additional error cases
In-Reply-To: <4190FA4B.5000000@ichips.intel.com>
References: <1100008896.8714.2105.camel@hpc-1>
	<4190FA4B.5000000@ichips.intel.com>
Message-ID: <1100020719.13933.332.camel@localhost.localdomain>

On Tue, 2004-11-09 at 12:11, Sean Hefty wrote:
> Hal Rosenstock wrote:
> 
> > mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
> > +		if (ret & IB_MAD_RESULT_SUCCESS) {
> > +			if (ret & IB_MAD_RESULT_REPLY) {
> > +				if (response->mad_hdr.mgmt_class ==
> > +				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
> > +					if (!smi_handle_dr_smp_recv(
> > +					    (struct ib_smp *)response,
> > +					    port_priv->device->node_type,
> > +					    port_priv->port_num,
> > +					    port_priv->device->phys_port_cnt)) {
> > +						kfree(response);
> > +						goto out;
> > +					}
> > +				}
> > +				/* Send response */
> > +				grh = (void *)recv->header.recv_buf.mad -
> > +				      sizeof(struct ib_grh);
> > +				if (agent_send(response, grh, wc,
> > +					       port_priv->device,
> > +					       port_priv->port_num)) {
> >  					kfree(response);
> >  				}
> >  				goto out;
> >  			}
> 
> 			goto out;
> 
> I guess I was wondering if it was okay to move "goto out" to here, and 
> always skip dispatching if process_mad returned success.  I think 
> dispatching in the failure case makes sense.

Yes (more than OK, it's better :-) I'll issue a patch for this shortly.

-- Hal


From halr at voltaire.com  Tue Nov  9 09:33:02 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 12:33:02 -0500
Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler,
 don't dispatch in	additional case
Message-ID: <1100021581.10808.1.camel@hpc-1>

mad: In ib_mad_recv_done_handler, don't dispatch in additional case

Index: mad.c
===================================================================
--- mad.c	(revision 1183)
+++ mad.c	(working copy)
@@ -1161,8 +1161,8 @@
 						port_priv->device,
 						port_priv->port_num))
 					response = NULL;
-				goto out;
 			}
+			goto out;
 		} 
 	}
 

From halr at voltaire.com  Tue Nov  9 09:30:50 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 12:30:50 -0500
Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive	buffer
	allocation strategy]
Message-ID: <1100021450.13933.339.camel@localhost.localdomain>

One more thing on this I forgot to post:

As I am not yet set up with Kegel cross tools (and don't have a machine
where the pci_ macros are non trivial), I would appreciate it if someone
could verify these changes (or latest code) on some architecture where
the pci_ macros are non trivial.

Thanks.

-- Hal

-----Forwarded Message-----

From: Hal Rosenstock <halr at voltaire.com>
To: openib-general at openib.org
Subject: [openib-general] [PATCH] mad/agent: Modify receive buffer       allocation strategy
Date: 09 Nov 2004 11:49:21 -0500

mad/agent: Modify receive buffer allocation strategy
(Inefficiency pointed out by Sean; algorithm described by Roland)

Problem: Currently, if the underlying driver provides a process_mad
routine, a response MAD is allocated every time a MAD is received on
QP 0 or 1.

Solution: The MAD layer can allocate a response MAD when a MAD is
received, and if the process_mad call doesn't actually generate a
response the MAD layer just stashes the response MAD away to use for
the next receive. This should keep the number of allocations within 1
of the number of responses actually generated, but save us from
tracking allocations between two layers.

Index: agent.h
===================================================================
--- agent.h	(revision 1180)
+++ agent.h	(working copy)
@@ -31,7 +31,7 @@
 
 extern int ib_agent_port_close(struct ib_device *device, int port_num);
 
-extern int agent_send(struct ib_mad *mad,
+extern int agent_send(struct ib_mad_private *mad,
 		      struct ib_grh *grh,
 		      struct ib_wc *wc,
 		      struct ib_device *device,
Index: agent_priv.h
===================================================================
--- agent_priv.h	(revision 1180)
+++ agent_priv.h	(working copy)
@@ -33,7 +33,7 @@
 struct ib_agent_send_wr {
 	struct list_head send_list;
 	struct ib_ah *ah;
-	struct ib_mad *mad;
+	struct ib_mad_private *mad;
 	DECLARE_PCI_UNMAP_ADDR(mapping)
 };
 
Index: agent.c
===================================================================
--- agent.c	(revision 1182)
+++ agent.c	(working copy)
@@ -33,7 +33,9 @@
 static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED;
 static LIST_HEAD(ib_agent_port_list);
 
+extern kmem_cache_t *ib_mad_cache;
 
+
 static inline struct ib_agent_port_private *
 __ib_get_agent_port(struct ib_device *device, int port_num,
 		    struct ib_mad_agent *mad_agent)
@@ -95,7 +97,7 @@
 
 static int agent_mad_send(struct ib_mad_agent *mad_agent,
 			  struct ib_agent_port_private *port_priv,
-			  struct ib_mad *mad,
+			  struct ib_mad_private *mad,
 			  struct ib_grh *grh,
 			  struct ib_wc *wc)
 {
@@ -114,10 +116,10 @@
 
 	/* PCI mapping */
 	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
-					  mad,
-					  sizeof(struct ib_mad),
+					  &mad->grh,
+					  sizeof *mad - sizeof mad->header,
 					  PCI_DMA_TODEVICE);
-	gather_list.length = sizeof(struct ib_mad);
+	gather_list.length = sizeof *mad - sizeof mad->header;
 	gather_list.lkey = (*port_priv->mr).lkey;
 
 	send_wr.next = NULL;
@@ -133,7 +135,7 @@
 	ah_attr.src_path_bits = wc->dlid_path_bits;
 	ah_attr.sl = wc->sl;
 	ah_attr.static_rate = 0;
-	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
+	if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		if (wc->wc_flags & IB_WC_GRH) {
 			ah_attr.ah_flags = IB_AH_GRH;
 			/* Should sgid be looked up ? */
@@ -162,14 +164,14 @@
 	}
 
 	send_wr.wr.ud.ah = agent_send_wr->ah;
-	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
+	if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
 		send_wr.wr.ud.pkey_index = wc->pkey_index;
 		send_wr.wr.ud.remote_qkey = IB_QP1_QKEY;
 	} else {
 		send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */
 		send_wr.wr.ud.remote_qkey = 0; /* for SMPs */
 	}
-	send_wr.wr.ud.mad_hdr = (struct ib_mad_hdr *)mad;
+	send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr;
 	send_wr.wr_id = ++port_priv->wr_id;
 
 	pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr);
@@ -180,7 +182,8 @@
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 		pci_unmap_single(mad_agent->device->dma_device,
 				 pci_unmap_addr(agent_send_wr, mapping),
-				 sizeof(struct ib_mad),
+				 sizeof(struct ib_mad_private) -
+				 sizeof(struct ib_mad_private_header),
 				 PCI_DMA_TODEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
 		kfree(agent_send_wr);
@@ -195,7 +198,7 @@
 	return ret;
 }
 
-int agent_send(struct ib_mad *mad,
+int agent_send(struct ib_mad_private *mad,
 	       struct ib_grh *grh,
 	       struct ib_wc *wc,
 	       struct ib_device *device,
@@ -212,7 +215,7 @@
 	}
 
 	/* Get mad agent based on mgmt_class in MAD */
-	switch (mad->mad_hdr.mgmt_class) {
+	switch (mad->mad.mad.mad_hdr.mgmt_class) {
 		case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
 			mad_agent = port_priv->dr_smp_agent;
 			break;
@@ -269,13 +272,14 @@
 	/* Unmap PCI */
 	pci_unmap_single(mad_agent->device->dma_device,
 			 pci_unmap_addr(agent_send_wr, mapping),
-			 sizeof(struct ib_mad),
+			 sizeof(struct ib_mad_private) -
+			 sizeof(struct ib_mad_private_header),
 			 PCI_DMA_TODEVICE);
 
 	ib_destroy_ah(agent_send_wr->ah);
 
 	/* Release allocated memory */
-	kfree(agent_send_wr->mad);
+	kmem_cache_free(ib_mad_cache, agent_send_wr->mad);
 	kfree(agent_send_wr);
 }
 
Index: mad.c
===================================================================
--- mad.c	(revision 1181)
+++ mad.c	(working copy)
@@ -69,7 +69,7 @@
 MODULE_AUTHOR("Sean Hefty");
 

-static kmem_cache_t *ib_mad_cache;
+kmem_cache_t *ib_mad_cache;
 static struct list_head ib_mad_port_list;
 static u32 ib_mad_client_id = 0;
 
@@ -83,7 +83,8 @@
 static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
 			   struct ib_mad_agent_private *priv);
 static void remove_mad_reg_req(struct ib_mad_agent_private *priv); 
-static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info);
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info,
+				   struct ib_mad_private *mad);
 static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info);
 static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
 static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
@@ -1067,12 +1068,17 @@
 {
 	struct ib_mad_qp_info *qp_info;
 	struct ib_mad_private_header *mad_priv_hdr;
-	struct ib_mad_private *recv;
+	struct ib_mad_private *recv, *response;
 	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent;
 	struct ib_smp *smp;
 	int solicited;
 
+	response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+	if (!response)
+		printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
+		       "for response buffer\n");
+
 	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
 	qp_info = mad_list->mad_queue->qp_info;
 	dequeue_mad(mad_list);
@@ -1119,11 +1125,9 @@
 
 	/* Give driver "right of first refusal" on incoming MAD */
 	if (port_priv->device->process_mad) {
-		struct ib_mad *response;
 		struct ib_grh *grh;
 		int ret;
 
-		response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL);
 		if (!response) {
 			printk(KERN_ERR PFX "No memory for response MAD\n");
 			/*
@@ -1137,32 +1141,29 @@
 						     port_priv->port_num,
 						     wc->slid,
 						     recv->header.recv_buf.mad,
-						     response);
+						     &response->mad.mad);
 		if (ret & IB_MAD_RESULT_SUCCESS) {
 			if (ret & IB_MAD_RESULT_REPLY) {
-				if (response->mad_hdr.mgmt_class ==
+				if (response->mad.mad.mad_hdr.mgmt_class ==
 				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
 					if (!smi_handle_dr_smp_recv(
-					    (struct ib_smp *)response,
+					    (struct ib_smp *)&response->mad.mad,
 					    port_priv->device->node_type,
 					    port_priv->port_num,
 					    port_priv->device->phys_port_cnt)) {
-						kfree(response);
 						goto out;
 					}
 				}
 				/* Send response */
 				grh = (void *)recv->header.recv_buf.mad -
 				      sizeof(struct ib_grh);
-				if (agent_send(response, grh, wc,
-					       port_priv->device,
-					       port_priv->port_num)) {
-					kfree(response);
-				}
+				if (!agent_send(response, grh, wc,
+						port_priv->device,
+						port_priv->port_num))
+					response = NULL;
 				goto out;
 			}
-		} else
-			kfree(response);
+		} 
 	}
 
 	/* Determine corresponding MAD agent for incoming receive MAD */
@@ -1183,7 +1184,7 @@
 		kmem_cache_free(ib_mad_cache, recv);
 
 	/* Post another receive request for this QP */
-	ib_mad_post_receive_mad(qp_info);
+	ib_mad_post_receive_mad(qp_info, response);
 }
 
 static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
@@ -1491,7 +1492,8 @@
 	queue_work(port_priv->wq, &port_priv->work);
 }
 
-static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info)
+static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info,
+				   struct ib_mad_private *mad)
 {
 	struct ib_mad_private *mad_priv;
 	struct ib_sge sg_list;
@@ -1499,19 +1501,23 @@
 	struct ib_recv_wr *bad_recv_wr;
 	int ret;
 
-	/* 
-	 * Allocate memory for receive buffer.
-	 * This is for both MAD and private header
-	 * which contains the receive tracking structure.
-	 * By prepending this header, there is one rather 
-	 * than two memory allocations.
-	 */
-	mad_priv = kmem_cache_alloc(ib_mad_cache,
-				    (in_atomic() || irqs_disabled()) ?
-				    GFP_ATOMIC : GFP_KERNEL);
-	if (!mad_priv) {
-		printk(KERN_ERR PFX "No memory for receive buffer\n");
-		return -ENOMEM;
+	if (mad)
+		mad_priv = mad;
+	else {
+		/* 
+		 * Allocate memory for receive buffer.
+		 * This is for both MAD and private header
+		 * which contains the receive tracking structure.
+		 * By prepending this header, there is one rather 
+		 * than two memory allocations.
+		 */
+		mad_priv = kmem_cache_alloc(ib_mad_cache,
+					    (in_atomic() || irqs_disabled()) ?
+					    GFP_ATOMIC : GFP_KERNEL);
+		if (!mad_priv) {
+			printk(KERN_ERR PFX "No memory for receive buffer\n");
+			return -ENOMEM;
+		}
 	}
 
 	/* Setup scatter list */
@@ -1559,7 +1565,7 @@
 	int i, ret;
 
 	for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) {
-		ret = ib_mad_post_receive_mad(qp_info);
+		ret = ib_mad_post_receive_mad(qp_info, NULL);
 		if (ret) {
 			printk(KERN_ERR PFX "receive post %d failed "
 				"on %s port %d\n", i + 1,


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From libor at topspin.com  Tue Nov  9 09:55:13 2004
From: libor at topspin.com (Libor Michalek)
Date: Tue, 9 Nov 2004 09:55:13 -0800
Subject: [openib-general] VAPI_RETRY_EXC_ERR
In-Reply-To: <4A388685F814D54CAE412B2DAB7CE91C195455@initexch.topspincom.com>;
	from sreenivasulu@topspin.com on Tue, Nov 09, 2004 at 04:19:17PM
	+0530
References: <4A388685F814D54CAE412B2DAB7CE91C195455@initexch.topspincom.com>
Message-ID: <20041109095513.A30186@topspin.com>

On Tue, Nov 09, 2004 at 04:19:17PM +0530, Sreenivasulu Pulichintala wrote:

> -----Original Message-----
> From: Sreenivasulu Pulichintala 
> Sent: Tuesday, November 09, 2004 3:56 PM
> To: openib-general at openib.org
> Subject: [openib-general] VAPI_RETRY_EXC_ERR
> 
> HI,
> 
> I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my
> fortran applications, some times my application crashes producing the
> following error -
>
> Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81

  Of the possible issues that Tziporet lists, the most likely problem
with MVAPICH 0.9.2 is that the local ack timeout is too small for either
large or blocking clusters. It is currently set to 10 (DEFAULT_ACK_TIMEOUT)
which translates to 4 milliseconds. (IBTA spec section 9.9.2) I would
try a value such as 15 or 20... 

  Also the retry counter is set using the define DEFAULT_RETRY_COUNT in
the MVAPICH source. It's currently set to 5.


-Libor


From roland at topspin.com  Tue Nov  9 11:37:33 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 11:37:33 -0800
Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive	buffer
	allocation strategy]
In-Reply-To: <1100021450.13933.339.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Tue, 09 Nov 2004 12:30:50 -0500")
References: <1100021450.13933.339.camel@localhost.localdomain>
Message-ID: <52is8e7r0i.fsf@topspin.com>

    Hal> One more thing on this I forgot to post: As I am not yet set
    Hal> up with Kegel cross tools (and don't have a machine where the
    Hal> pci_ macros are non trivial), I would appreciate it if
    Hal> someone could verify these changes (or latest code) on some
    Hal> architecture where the pci_ macros are non trivial.

It builds fine on all the architectures I test but (with r1184) the
SMA doesn't seem to be working (port stays in INIT state).  I see the
port_rcv_data counter going up so I know the SM is sweeping.  On i386
I don't see anything in the log, and on ppc64 I see a stream of:

    Invalid directed route

in the kernel log.

 - R.


From roland at topspin.com  Tue Nov  9 11:39:16 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 11:39:16 -0800
Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive	buffer
	allocation strategy]
In-Reply-To: <52is8e7r0i.fsf@topspin.com> (Roland Dreier's message of "Tue,
	09 Nov 2004 11:37:33 -0800")
References: <1100021450.13933.339.camel@localhost.localdomain>
	<52is8e7r0i.fsf@topspin.com>
Message-ID: <52ekj27qxn.fsf@topspin.com>

By the way, we probably want this applied:

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1184)
+++ core/mad.c	(working copy)
@@ -385,7 +385,7 @@
 				    mad_agent->device->node_type,
 				    mad_agent->port_num)) {
 		ret = -EINVAL;
-		printk(KERN_ERR "Invalid directed route\n");
+		printk(KERN_ERR PFX "Invalid directed route\n");
 		goto error1;
 	}
 	if (smi_check_local_dr_smp(smp,


From mshefty at ichips.intel.com  Tue Nov  9 11:40:14 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 11:40:14 -0800
Subject: [openib-general] error trying to bring up node
Message-ID: <41911D1E.10608@ichips.intel.com>

I have two nodes directly connected.  When trying to bring up the openib 
node, I receive a local length error on the CQ after trying to perform a 
send.

I'm continuing to debug...

- Sean


From mshefty at ichips.intel.com  Tue Nov  9 11:56:47 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 11:56:47 -0800
Subject: [openib-general] error trying to bring up node
In-Reply-To: <41911D1E.10608@ichips.intel.com>
References: <41911D1E.10608@ichips.intel.com>
Message-ID: <419120FF.7010607@ichips.intel.com>

Sean Hefty wrote:

> I have two nodes directly connected.  When trying to bring up the openib 
> node, I receive a local length error on the CQ after trying to perform a 
> send.
> 
> I'm continuing to debug...

static int agent_mad_send(struct ib_mad_agent *mad_agent,
			  struct ib_agent_port_private *port_priv,
			  struct ib_mad_private *mad,
			  struct ib_grh *grh,
			  struct ib_wc *wc)
{
...
	/* PCI mapping */
	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
					  &mad->grh,
					  sizeof *mad -
						sizeof mad->header,
					  PCI_DMA_TODEVICE);
	gather_list.length = sizeof *mad - sizeof mad->header;
	gather_list.lkey = (*port_priv->mr).lkey;


Wouldn't this result in sending the GRH data buffer before the MAD 
buffer?  Does mthca check the size of sends that are posted to QP0/1 and 
report an error if they are larger than 256 bytes?

- Sean


From halr at voltaire.com  Tue Nov  9 12:19:51 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 15:19:51 -0500
Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify	receive	buffer
	allocation strategy]
In-Reply-To: <52ekj27qxn.fsf@topspin.com>
References: <1100021450.13933.339.camel@localhost.localdomain>
	<52is8e7r0i.fsf@topspin.com> <52ekj27qxn.fsf@topspin.com>
Message-ID: <1100031591.13933.349.camel@localhost.localdomain>

On Tue, 2004-11-09 at 14:39, Roland Dreier wrote:
> By the way, we probably want this applied:

Thanks. Applied.

-- Hal


From roland at topspin.com  Tue Nov  9 12:25:55 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 12:25:55 -0800
Subject: [openib-general] error trying to bring up node
In-Reply-To: <419120FF.7010607@ichips.intel.com> (Sean Hefty's message of
	"Tue, 09 Nov 2004 11:56:47 -0800")
References: <41911D1E.10608@ichips.intel.com>
	<419120FF.7010607@ichips.intel.com>
Message-ID: <52zn1q6a7g.fsf@topspin.com>

    Sean> Wouldn't this result in sending the GRH data buffer before
    Sean> the MAD buffer?

Yes, it sure looks that way.

    Sean> Does mthca check the size of sends that are
    Sean> posted to QP0/1 and report an error if they are larger than
    Sean> 256 bytes?

No, it will probably send it.  (And cause a problem on the receive side)

 - R.


From halr at voltaire.com  Tue Nov  9 12:24:16 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 15:24:16 -0500
Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify	receive	buffer
	allocation strategy]
In-Reply-To: <52is8e7r0i.fsf@topspin.com>
References: <1100021450.13933.339.camel@localhost.localdomain>
	<52is8e7r0i.fsf@topspin.com>
Message-ID: <1100031856.13933.353.camel@localhost.localdomain>

On Tue, 2004-11-09 at 14:37, Roland Dreier wrote:
>     Hal> One more thing on this I forgot to post: As I am not yet set
>     Hal> up with Kegel cross tools (and don't have a machine where the
>     Hal> pci_ macros are non trivial), I would appreciate it if
>     Hal> someone could verify these changes (or latest code) on some
>     Hal> architecture where the pci_ macros are non trivial.
> 
> It builds fine on all the architectures I test but (with r1184) the
> SMA doesn't seem to be working (port stays in INIT state).  I see the
> port_rcv_data counter going up so I know the SM is sweeping.  On i386
> I don't see anything in the log, and on ppc64 I see a stream of:
> 
>     Invalid directed route
> 
> in the kernel log.

In smi.c, smi_handle_dr_smp_send is indicating this packet is invalid
for some reason.

What are the hop_cnt and hop_ptr in the outgoing SMP ?

Is your configuration the same as Sean's (back to back HCAs) ?

Thanks.

-- Hal


From halr at voltaire.com  Tue Nov  9 12:25:37 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 15:25:37 -0500
Subject: [openib-general] error trying to bring up node
In-Reply-To: <419120FF.7010607@ichips.intel.com>
References: <41911D1E.10608@ichips.intel.com>
	<419120FF.7010607@ichips.intel.com>
Message-ID: <1100031937.13933.355.camel@localhost.localdomain>

On Tue, 2004-11-09 at 14:56, Sean Hefty wrote:
> Sean Hefty wrote:
> 
> > I have two nodes directly connected.  When trying to bring up the openib 
> > node, I receive a local length error on the CQ after trying to perform a 
> > send.
> > 
> > I'm continuing to debug...
> 
> static int agent_mad_send(struct ib_mad_agent *mad_agent,
> 			  struct ib_agent_port_private *port_priv,
> 			  struct ib_mad_private *mad,
> 			  struct ib_grh *grh,
> 			  struct ib_wc *wc)
> {
> ...
> 	/* PCI mapping */
> 	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
> 					  &mad->grh,
> 					  sizeof *mad -
> 						sizeof mad->header,
> 					  PCI_DMA_TODEVICE);
> 	gather_list.length = sizeof *mad - sizeof mad->header;
> 	gather_list.lkey = (*port_priv->mr).lkey;
> 
> 
> Wouldn't this result in sending the GRH data buffer before the MAD 
> buffer?  Does mthca check the size of sends that are posted to QP0/1 and 
> report an error if they are larger than 256 bytes?

Doesn't that just map starting at the GRH ? This is to handle PMA
responses which might have GRHs.

-- Hal


From roland at topspin.com  Tue Nov  9 12:30:48 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 12:30:48 -0800
Subject: [openib-general] error trying to bring up node
In-Reply-To: <1100031937.13933.355.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Tue, 09 Nov 2004 15:25:37 -0500")
References: <41911D1E.10608@ichips.intel.com>
	<419120FF.7010607@ichips.intel.com>
	<1100031937.13933.355.camel@localhost.localdomain>
Message-ID: <52u0ry69zb.fsf@topspin.com>

    Hal> Doesn't that just map starting at the GRH ? This is to handle
    Hal> PMA responses which might have GRHs.

Sure, it maps starting at the GRH and uses that as the start of the
gather segment used for the send (and tries to send more than 256
bytes).  This is wrong even when sending a packet with GRH (the
address vector has the global route information; you don't have to
supply a GRH when posting the send).

 - Roland


From mshefty at ichips.intel.com  Tue Nov  9 12:32:55 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 12:32:55 -0800
Subject: [openib-general] error trying to bring up node
In-Reply-To: <1100031937.13933.355.camel@localhost.localdomain>
References: <41911D1E.10608@ichips.intel.com>
	<419120FF.7010607@ichips.intel.com>
	<1100031937.13933.355.camel@localhost.localdomain>
Message-ID: <41912977.50806@ichips.intel.com>

Hal Rosenstock wrote:

> On Tue, 2004-11-09 at 14:56, Sean Hefty wrote:
> 
>>Sean Hefty wrote:
>>
>>
>>>I have two nodes directly connected.  When trying to bring up the openib 
>>>node, I receive a local length error on the CQ after trying to perform a 
>>>send.
>>>
>>>I'm continuing to debug...
>>
>>static int agent_mad_send(struct ib_mad_agent *mad_agent,
>>			  struct ib_agent_port_private *port_priv,
>>			  struct ib_mad_private *mad,
>>			  struct ib_grh *grh,
>>			  struct ib_wc *wc)
>>{
>>...
>>	/* PCI mapping */
>>	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
>>					  &mad->grh,
>>					  sizeof *mad -
>>						sizeof mad->header,
>>					  PCI_DMA_TODEVICE);
>>	gather_list.length = sizeof *mad - sizeof mad->header;
>>	gather_list.lkey = (*port_priv->mr).lkey;
>>
>>
>>Wouldn't this result in sending the GRH data buffer before the MAD 
>>buffer?  Does mthca check the size of sends that are posted to QP0/1 and 
>>report an error if they are larger than 256 bytes?
> 
> 
> Doesn't that just map starting at the GRH ? This is to handle PMA
> responses which might have GRHs.

It does.  But the GRH buffer shouldn't be sent by the user.  My thought 
was the this would result in the receiver mis-interpreting the received 
MAD, and probably dropping it.  But I'm seeing that the work request 
completes in error, which makes me think that there's still another 
error somewhere.

- Sean


From halr at voltaire.com  Tue Nov  9 12:30:29 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 15:30:29 -0500
Subject: [openib-general] error trying to bring up node
In-Reply-To: <1100031937.13933.355.camel@localhost.localdomain>
References: <41911D1E.10608@ichips.intel.com>
	<419120FF.7010607@ichips.intel.com>
	<1100031937.13933.355.camel@localhost.localdomain>
Message-ID: <1100032229.13933.360.camel@localhost.localdomain>

On Tue, 2004-11-09 at 15:25, Hal Rosenstock wrote:
> Doesn't that just map starting at the GRH ? This is to handle PMA
> responses which might have GRHs.

Never mind. I see the problem.

-- Hal


From halr at voltaire.com  Tue Nov  9 12:49:34 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 15:49:34 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and
 gather address	and length
Message-ID: <1100033372.17687.3.camel@hpc-1>

agent: Fix agent_mad_send PCI mapping and gather address and length

Index: agent.c
===================================================================
--- agent.c	(revision 1183)
+++ agent.c	(working copy)
@@ -116,10 +116,10 @@
 
 	/* PCI mapping */
 	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
-					  &mad->grh,
-					  sizeof *mad - sizeof mad->header,
+					  &mad->mad,
+					  sizeof(struct ib_mad),
 					  PCI_DMA_TODEVICE);
-	gather_list.length = sizeof *mad - sizeof mad->header;
+	gather_list.length = sizeof(struct ib_mad);
 	gather_list.lkey = (*port_priv->mr).lkey;
 
 	send_wr.next = NULL;
@@ -272,8 +272,7 @@
 	/* Unmap PCI */
 	pci_unmap_single(mad_agent->device->dma_device,
 			 pci_unmap_addr(agent_send_wr, mapping),
-			 sizeof(struct ib_mad_private) -
-			 sizeof(struct ib_mad_private_header),
+			 sizeof(struct ib_mad),
 			 PCI_DMA_TODEVICE);
 
 	ib_destroy_ah(agent_send_wr->ah);


From roland at topspin.com  Tue Nov  9 12:50:30 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 12:50:30 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping
	and gather address	and length
In-Reply-To: <1100033372.17687.3.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 09 Nov 2004 15:49:34 -0500")
References: <1100033372.17687.3.camel@hpc-1>
Message-ID: <52bre6692h.fsf@topspin.com>

OK, this works on my i386 system but I'm still getting 

    ib_mad: Invalid directed route

on ppc64.  I'll try to debug what exactly is happening (ie put some
prints in to see why smi_handle_dr_smp_send() is rejecting it).

 - R.


From roland at topspin.com  Tue Nov  9 12:53:13 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 12:53:13 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping
	and gather address	and length
In-Reply-To: <52bre6692h.fsf@topspin.com> (Roland Dreier's message of "Tue,
	09 Nov 2004 12:50:30 -0800")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
Message-ID: <527jou68xy.fsf@topspin.com>

    Roland> OK, this works on my i386 system but I'm still getting

    Roland> ib_mad: Invalid directed route

    Roland> on ppc64.  I'll try to debug what exactly is happening (ie
    Roland> put some prints in to see why smi_handle_dr_smp_send() is
    Roland> rejecting it).

By the way, the i386 system is connected directly to the switch
running the SM, while the ppc64 system is a few hops away.  So it's
just as likely to be a DR SMI handling problem as a ppc64 architecture
issue.

 - R.


From halr at voltaire.com  Tue Nov  9 12:55:43 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 15:55:43 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping	and
	gather address	and length
In-Reply-To: <527jou68xy.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
Message-ID: <1100033742.2170.11.camel@localhost.localdomain>

On Tue, 2004-11-09 at 15:53, Roland Dreier wrote:
> By the way, the i386 system is connected directly to the switch
> running the SM,

That's the config I run in too.

> while the ppc64 system is a few hops away.  

I think Sean's original config was a couple of hops.

> So it's
> just as likely to be a DR SMI handling problem as a ppc64 architecture
> issue.

My money's on a DR SMI issue :-)

-- Hal


From root at DYN318430BLD.linux.local  Tue Nov  9 13:32:15 2004
From: root at DYN318430BLD.linux.local (root)
Date: Tue, 9 Nov 2004 13:32:15 -0800 (PST)
Subject: [openib-general] [PATCH] Unnecessary initialization of sa_query in
	failure case.
In-Reply-To: <52sm7kb0zk.fsf@topspin.com>
Message-ID: <Pine.LNX.4.44.0411091247250.25020-100000@DYN318430BLD>

diff -ruNp org/core/sa_query.c new/core/sa_query.c
--- org/core/sa_query.c	2004-11-09 12:51:35.000000000 -0800
+++ new/core/sa_query.c	2004-11-09 13:30:38.000000000 -0800
@@ -547,7 +547,6 @@ int ib_sa_path_rec_get(struct ib_device
 	*sa_query = &query->sa_query;
 	ret = send_mad(&query->sa_query, timeout_ms);
 	if (ret) {
-		*sa_query = NULL;
 		kfree(query->sa_query.mad);
 		kfree(query);
 	}
@@ -623,7 +622,6 @@ int ib_sa_mcmember_rec_query(struct ib_d
 	*sa_query = &query->sa_query;
 	ret = send_mad(&query->sa_query, timeout_ms);
 	if (ret) {
-		*sa_query = NULL;
 		kfree(query->sa_query.mad);
 		kfree(query);
 	}


From roland at topspin.com  Tue Nov  9 15:03:00 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 15:03:00 -0800
Subject: [openib-general] Re: [PATCH] Unnecessary initialization of sa_query
	in failure case.
In-Reply-To: <Pine.LNX.4.44.0411091247250.25020-100000@DYN318430BLD>
	(root@dyn318430bld.linux.local's
	message of "Tue, 9 Nov 2004 13:32:15 -0800 (PST)")
References: <Pine.LNX.4.44.0411091247250.25020-100000@DYN318430BLD>
Message-ID: <52pt2m4od7.fsf@topspin.com>

Why is this initialization unnecessary?  If we delete these lines then
sa_query is left pointing to invalid memory when a send fails?

 - R.


From root at DYN318430BLD.linux.local  Tue Nov  9 14:06:47 2004
From: root at DYN318430BLD.linux.local (root)
Date: Tue, 9 Nov 2004 14:06:47 -0800 (PST)
Subject: [openib-general] Question on handle_outgoing_smp
Message-ID: <Pine.LNX.4.44.0411091346060.27735-100000@DYN318430BLD>

In following code :

if (smi_check_local_dr_smp(smp, mad_agent->device, mad_agent->port_num)) {
	...
	ret = mad_agent->device->process_mad(
                                        mad_agent->device,
                                        0,
                                        mad_agent->port_num,
                                        smp->dr_slid, /* ? */
                                        (struct ib_mad *)smp,
                                        (struct ib_mad *)&mad_priv->mad);

How do we guarantee that the process_mad() was supplied (not NULL) ?
That is if smi_check_local_smp didn't get called via
smi_check_local_dr_smp ?

thx,

- KK


From krkumar at us.ibm.com  Tue Nov  9 15:31:44 2004
From: krkumar at us.ibm.com (Krishna Kumar)
Date: Tue, 9 Nov 2004 15:31:44 -0800 (PST)
Subject: [openib-general] Re: [PATCH] Unnecessary initialization of sa_query
	in failure case.
In-Reply-To: <52pt2m4od7.fsf@topspin.com>
Message-ID: <Pine.LNX.4.44.0411091527590.28108-100000@DYN318430BLD>

On Tue, 9 Nov 2004, Roland Dreier wrote:

> Why is this initialization unnecessary?  If we delete these lines then
> sa_query is left pointing to invalid memory when a send fails?

Because ULP's should not use a pointers to-be-set-in-callee routines
if the call failed. In this case, path_rec_start and unicast_arp_start
should not use "query" if the call failed. And "query" is a stack
variable in those routines, so it won't hang around too long :-)

thanks,

- KK


From tduffy at sun.com  Tue Nov  9 15:49:35 2004
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 09 Nov 2004 15:49:35 -0800
Subject: [openib-general] Re: [openib-commits] r1186 -
	gen2/trunk/src/linux-kernel/infiniband/core
In-Reply-To: <20041109232308.607B72283D4@openib.ca.sandia.gov>
References: <20041109232308.607B72283D4@openib.ca.sandia.gov>
Message-ID: <1100044175.12438.3.camel@duffman>

On Tue, 2004-11-09 at 15:23 -0800, halr at openib.org wrote:
> Author: halr
> Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004)
> New Revision: 1186
> 
> Modified:
>    gen2/trunk/src/linux-kernel/infiniband/core/agent.c
> Log:
> Fix agent_mad_send PCI mapping and gather address and length

Please revert this change.  It seems to break x86_64 as well, at least
in my setup.

-tduffy

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041109/0ad3c1af/attachment.sig>

From roland at topspin.com  Tue Nov  9 15:54:07 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 15:54:07 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping
	and gather address	and length
In-Reply-To: <1100033742.2170.11.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Tue, 09 Nov 2004 15:55:43 -0500")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
Message-ID: <52llda4m00.fsf@topspin.com>

OK, I think I understand the problem, but I'm not sure what the
correct solution is.  When a DR SMP arrives at a CA from the SM,
hop_cnt == hop_ptr == number of hops in the directed route, and
somehow they are not updated correctly by the time the response
reaches handle_outgoing_smp().

I can't follow the code well enough to understand why all DR SMPs have
to go through both smi_handle_dr_smp_recv() and
smi_handle_dr_smp_send() but the patch below seems to correct things
for me (ports go to ACTIVE on all my systems).  (handle_outgoing_smp()
already calls smi_handle_dr_smp_recv() so it seems the response was
getting passed to smi_handle_dr_smp_recv() twice).

 - R.

Index: mad.c
===================================================================
--- mad.c	(revision 1186)
+++ mad.c	(working copy)
@@ -1144,16 +1144,6 @@
 						     &response->mad.mad);
 		if (ret & IB_MAD_RESULT_SUCCESS) {
 			if (ret & IB_MAD_RESULT_REPLY) {
-				if (response->mad.mad.mad_hdr.mgmt_class ==
-				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
-					if (!smi_handle_dr_smp_recv(
-					    (struct ib_smp *)&response->mad.mad,
-					    port_priv->device->node_type,
-					    port_priv->port_num,
-					    port_priv->device->phys_port_cnt)) {
-						goto out;
-					}
-				}
 				/* Send response */
 				grh = (void *)recv->header.recv_buf.mad -
 				      sizeof(struct ib_grh);


From Nitin.Hande at Sun.COM  Tue Nov  9 15:55:45 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Tue, 09 Nov 2004 15:55:45 -0800
Subject: [openib-general] Re: [openib-commits] r1186 -
	gen2/trunk/src/linux-kernel/infiniband/core
In-Reply-To: <1100044175.12438.3.camel@duffman>
References: <20041109232308.607B72283D4@openib.ca.sandia.gov>
	<1100044175.12438.3.camel@duffman>
Message-ID: <41915901.9000502@Sun.COM>

Tom Duffy wrote:
> On Tue, 2004-11-09 at 15:23 -0800, halr at openib.org wrote:
> 
>>Author: halr
>>Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004)
>>New Revision: 1186
>>
>>Modified:
>>   gen2/trunk/src/linux-kernel/infiniband/core/agent.c
>>Log:
>>Fix agent_mad_send PCI mapping and gather address and length
> 
> 
> Please revert this change.  It seems to break x86_64 as well, at least
> in my setup.
certainly it does break my x86_64 setup too. Can we revert back to
working set of bits please ?

Thanks
Nitin

> 
> -tduffy
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Tue Nov  9 16:01:15 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 16:01:15 -0800
Subject: [openib-general] Re: [openib-commits] r1186 -
	gen2/trunk/src/linux-kernel/infiniband/core
In-Reply-To: <41915901.9000502@Sun.COM> (Nitin Hande's message of "Tue, 09
	Nov 2004 15:55:45 -0800")
References: <20041109232308.607B72283D4@openib.ca.sandia.gov>
	<1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM>
Message-ID: <528y9a4lo4.fsf@topspin.com>

    Nitin> certainly it does break my x86_64 setup too. Can we revert
    Nitin> back to working set of bits please ?

It's actually not an architecture issue -- it's an issue if your node
is more than one hop from the SM.  You should be able to use the patch
I just posted to get things working again.  Let's give Hal a chance to
fix things up properly.

 - R.


From mshefty at ichips.intel.com  Tue Nov  9 16:08:46 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 16:08:46 -0800
Subject: [openib-general] Re: [openib-commits] r1186
	-	gen2/trunk/src/linux-kernel/infiniband/core
In-Reply-To: <528y9a4lo4.fsf@topspin.com>
References: <20041109232308.607B72283D4@openib.ca.sandia.gov>	<1100044175.12438.3.camel@duffman>
	<41915901.9000502@Sun.COM> <528y9a4lo4.fsf@topspin.com>
Message-ID: <41915C0E.9040807@ichips.intel.com>

Roland Dreier wrote:

>     Nitin> certainly it does break my x86_64 setup too. Can we revert
>     Nitin> back to working set of bits please ?
> 
> It's actually not an architecture issue -- it's an issue if your node
> is more than one hop from the SM.  You should be able to use the patch
> I just posted to get things working again.  Let's give Hal a chance to
> fix things up properly.

This patch just fixed the issues I was having as well, and I'm running 
with two systems directly connected.  Thanks.

- Sean


From tduffy at sun.com  Tue Nov  9 16:12:04 2004
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 09 Nov 2004 16:12:04 -0800
Subject: [openib-general] Re: [openib-commits] r1186 -
	gen2/trunk/src/linux-kernel/infiniband/core
In-Reply-To: <528y9a4lo4.fsf@topspin.com>
References: <20041109232308.607B72283D4@openib.ca.sandia.gov>
	<1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM>
	<528y9a4lo4.fsf@topspin.com>
Message-ID: <1100045524.12438.14.camel@duffman>

On Tue, 2004-11-09 at 16:01 -0800, Roland Dreier wrote:
>     Nitin> certainly it does break my x86_64 setup too. Can we revert
>     Nitin> back to working set of bits please ?
> 
> It's actually not an architecture issue -- it's an issue if your node
> is more than one hop from the SM.  You should be able to use the patch
> I just posted to get things working again.  Let's give Hal a chance to
> fix things up properly.

OK, your patch got rid of the "ib_mad: Invalid directed route" message
anyways.  And my port is going to ACTIVE now.

Thanks,

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041109/bec16de7/attachment.sig>

From Nitin.Hande at Sun.COM  Tue Nov  9 16:11:52 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Tue, 09 Nov 2004 16:11:52 -0800
Subject: [openib-general] Re: [openib-commits] r1186 -
	gen2/trunk/src/linux-kernel/infiniband/core
In-Reply-To: <528y9a4lo4.fsf@topspin.com>
References: <20041109232308.607B72283D4@openib.ca.sandia.gov>
	<1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM>
	<528y9a4lo4.fsf@topspin.com>
Message-ID: <41915CC8.9070000@Sun.COM>

Roland Dreier wrote:
>     Nitin> certainly it does break my x86_64 setup too. Can we revert
>     Nitin> back to working set of bits please ?
> 
> It's actually not an architecture issue -- it's an issue if your node
> is more than one hop from the SM.  You should be able to use the patch
> I just posted to get things working again.  Let's give Hal a chance to
> fix things up properly.
> 
>  - R.
Applying your patch, I do not see the "redirect message" anymore. But I
cannot ping the peer interface yet. I am on x86_64 arch btw.
Unfortunately I gotta run, will debug more later tonight.

Thanks
Nitin


From mshefty at ichips.intel.com  Tue Nov  9 17:12:53 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 09 Nov 2004 17:12:53 -0800
Subject: [openib-general] [PATCH] handle QP0/1 send queue overrun
Message-ID: <41916B15.8050909@ichips.intel.com>

The following patch adds support for handling QP0/1 send queue overrun, 
along with a couple of related fixes:

* The patch includes that provided by Roland in order to configure the 
fabric.
* The code no longer modifies the user's send_wr structures when sending 
a MAD.
* Sent MADs work requests are copied in order to handle both queuing and 
for error recovery (when added).
* The receive side code was slightly restructured to use a single 
function to repost receives.  If a receive cannot be posted for some 
reason (e.g. lack of memory), it will now try to refill the receive 
queue when posting an additional receive.  (This will also make it 
possible for the code to be lazier about reposting receives, which would 
allow for better batching of completions.)

Also, I switched my mailer, so I apologize in advance if I hose up my patch.

- Sean


Index: core/agent.c
===================================================================
--- core/agent.c	(revision 1186)
+++ core/agent.c	(working copy)
@@ -117,9 +117,9 @@
  	/* PCI mapping */
  	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
  					  &mad->mad,
-					  sizeof(struct ib_mad),
+					  sizeof mad->mad,
  					  PCI_DMA_TODEVICE);
-	gather_list.length = sizeof(struct ib_mad);
+	gather_list.length = sizeof mad->mad;
  	gather_list.lkey = (*port_priv->mr).lkey;

  	send_wr.next = NULL;
@@ -182,8 +182,7 @@
  		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
  		pci_unmap_single(mad_agent->device->dma_device,
  				 pci_unmap_addr(agent_send_wr, mapping),
-				 sizeof(struct ib_mad_private) -
-				 sizeof(struct ib_mad_private_header),
+				 sizeof mad->mad,
  				 PCI_DMA_TODEVICE);
  		ib_destroy_ah(agent_send_wr->ah);
  		kfree(agent_send_wr);
@@ -272,7 +271,7 @@
  	/* Unmap PCI */
  	pci_unmap_single(mad_agent->device->dma_device,
  			 pci_unmap_addr(agent_send_wr, mapping),
-			 sizeof(struct ib_mad),
+			 sizeof agent_send_wr->mad->mad,
  			 PCI_DMA_TODEVICE);

  	ib_destroy_ah(agent_send_wr->ah);
Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1186)
+++ core/mad.c	(working copy)
@@ -83,9 +83,8 @@
  static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
  			   struct ib_mad_agent_private *priv);
  static void remove_mad_reg_req(struct ib_mad_agent_private *priv);
-static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info,
-				   struct ib_mad_private *mad);
-static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info);
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info,
+				    struct ib_mad_private *mad);
  static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
  static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private 
*mad_send_wr,
  				    struct ib_mad_send_wc *mad_send_wc);
@@ -345,24 +344,11 @@
  }
  EXPORT_SYMBOL(ib_unregister_mad_agent);

-static void queue_mad(struct ib_mad_queue *mad_queue,
-		      struct ib_mad_list_head *mad_list)
-{
-	unsigned long flags;
-
-	mad_list->mad_queue = mad_queue;
-	spin_lock_irqsave(&mad_queue->lock, flags);
-	list_add_tail(&mad_list->list, &mad_queue->list);
-	mad_queue->count++;
-	spin_unlock_irqrestore(&mad_queue->lock, flags);
-}
-
  static void dequeue_mad(struct ib_mad_list_head *mad_list)
  {
  	struct ib_mad_queue *mad_queue;
  	unsigned long flags;

-	BUG_ON(!mad_list->mad_queue);
  	mad_queue = mad_list->mad_queue;
  	spin_lock_irqsave(&mad_queue->lock, flags);
  	list_del(&mad_list->list);
@@ -481,24 +467,35 @@
  }

  static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv,
-		       struct ib_mad_send_wr_private *mad_send_wr,
-		       struct ib_send_wr *send_wr,
-		       struct ib_send_wr **bad_send_wr)
+		       struct ib_mad_send_wr_private *mad_send_wr)
  {
  	struct ib_mad_qp_info *qp_info;
+	struct ib_send_wr *bad_send_wr;
+	unsigned long flags;
  	int ret;

  	/* Replace user's WR ID with our own to find WR upon completion */
  	qp_info = mad_agent_priv->qp_info;
-	mad_send_wr->wr_id = send_wr->wr_id;
-	send_wr->wr_id = (unsigned long)&mad_send_wr->mad_list;
-	queue_mad(&qp_info->send_queue, &mad_send_wr->mad_list);
+	mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id;
+	mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list;
+	mad_send_wr->mad_list.mad_queue = &qp_info->send_queue;

-	ret = ib_post_send(mad_agent_priv->agent.qp, send_wr, bad_send_wr);
-	if (ret) {
-		printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret);
-		dequeue_mad(&mad_send_wr->mad_list);
-		*bad_send_wr = send_wr;
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) {
+		list_add_tail(&mad_send_wr->mad_list.list,
+			      &qp_info->send_queue.list);
+		spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+		ret = ib_post_send(mad_agent_priv->agent.qp,
+				   &mad_send_wr->send_wr, &bad_send_wr);
+		if (ret) {
+			printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret);
+			dequeue_mad(&mad_send_wr->mad_list);
+		}
+	} else {
+		list_add_tail(&mad_send_wr->mad_list.list,
+			      &qp_info->overflow_list);
+		spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+		ret = 0;
  	}
  	return ret;
  }
@@ -511,9 +508,8 @@
  		     struct ib_send_wr *send_wr,
  		     struct ib_send_wr **bad_send_wr)
  {
-	int ret;
-	struct ib_send_wr	*cur_send_wr, *next_send_wr;
-	struct ib_mad_agent_private	*mad_agent_priv;
+	int ret = -EINVAL;
+	struct ib_mad_agent_private *mad_agent_priv;

  	/* Validate supplied parameters */
  	if (!bad_send_wr)
@@ -522,6 +518,9 @@
  	if (!mad_agent || !send_wr )
  		goto error2;

+	if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG)
+		goto error2;
+
  	if (!mad_agent->send_handler ||
  	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler))
  		goto error2;
@@ -531,30 +530,31 @@
  				      agent);

  	/* Walk list of send WRs and post each on send list */
-	cur_send_wr = send_wr;
-	while (cur_send_wr) {
+	while (send_wr) {
  		unsigned long			flags;
+		struct ib_send_wr		*next_send_wr;
  		struct ib_mad_send_wr_private	*mad_send_wr;
  		struct ib_smp			*smp;

-		if (!cur_send_wr->wr.ud.mad_hdr) {
-			*bad_send_wr = cur_send_wr;
+		/*
+		 * Save pointer to next work request to post in case the
+		 * current one completes, and the user modifies the work
+		 * request associated with the completion.
+		 */
+		if (!send_wr->wr.ud.mad_hdr) {
  			printk(KERN_ERR PFX "MAD header must be supplied "
-			       "in WR %p\n", cur_send_wr);
-			goto error1;
+			       "in WR %p\n", send_wr);
+			goto error2;
  		}
+		next_send_wr = (struct ib_send_wr *)send_wr->next;

-		next_send_wr = (struct ib_send_wr *)cur_send_wr->next;
-
-		smp = (struct ib_smp *)cur_send_wr->wr.ud.mad_hdr;
+		smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr;
  		if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
-			ret = handle_outgoing_smp(mad_agent, smp, cur_send_wr);
-			if (ret < 0) {	/* error */
-				*bad_send_wr = cur_send_wr;
-				goto error1;
-			} else if (ret == 1) {	/* locally consumed */
+			ret = handle_outgoing_smp(mad_agent, smp, send_wr);
+			if (ret < 0)		/* error */
+				goto error2;
+			else if (ret == 1)	/* locally consumed */
  				goto next;
-			}
  		}

  		/* Allocate MAD send WR tracking structure */
@@ -562,16 +562,21 @@
  				      (in_atomic() || irqs_disabled()) ?
  				      GFP_ATOMIC : GFP_KERNEL);
  		if (!mad_send_wr) {
-			*bad_send_wr = cur_send_wr;
  			printk(KERN_ERR PFX "No memory for "
  			       "ib_mad_send_wr_private\n");
-			return -ENOMEM;	
+			ret = -ENOMEM;
+			goto error2;
  		}

+		mad_send_wr->send_wr = *send_wr;
+		mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list;
+		memcpy(mad_send_wr->sg_list, send_wr->sg_list,
+		       sizeof *send_wr->sg_list * send_wr->num_sge);
+		mad_send_wr->send_wr.next = NULL;
  		mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid;
  		mad_send_wr->agent = mad_agent;
  		/* Timeout will be updated after send completes */
-		mad_send_wr->timeout = msecs_to_jiffies(cur_send_wr->wr.
+		mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr.
  							ud.timeout_ms);
  		/* One reference for each work request to QP + response */
  		mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
@@ -584,31 +589,24 @@
  			      &mad_agent_priv->send_list);
  		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);

-		cur_send_wr->next = NULL;
-		ret = ib_send_mad(mad_agent_priv, mad_send_wr,
-				  cur_send_wr, bad_send_wr);
+		ret = ib_send_mad(mad_agent_priv, mad_send_wr);
  		if (ret) {
-			/* Handle QP overrun separately... -ENOMEM */
-			/* Handle posting when QP is in error state... */
-
  			/* Fail send request */
  			spin_lock_irqsave(&mad_agent_priv->lock, flags);
  			list_del(&mad_send_wr->agent_list);
  			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
-
  			atomic_dec(&mad_agent_priv->refcount);
-			return ret;		
+			goto error2;
  		}
  next:
-		cur_send_wr = next_send_wr;
+		send_wr = next_send_wr;
  	}
-
  	return 0;	

  error2:
  	*bad_send_wr = send_wr;
  error1:
-	return -EINVAL;
+	return ret;
  }
  EXPORT_SYMBOL(ib_post_send_mad);

@@ -1125,7 +1123,6 @@

  	/* Give driver "right of first refusal" on incoming MAD */
  	if (port_priv->device->process_mad) {
-		struct ib_grh *grh;
  		int ret;

  		if (!response) {
@@ -1144,20 +1141,8 @@
  						     &response->mad.mad);
  		if (ret & IB_MAD_RESULT_SUCCESS) {
  			if (ret & IB_MAD_RESULT_REPLY) {
-				if (response->mad.mad.mad_hdr.mgmt_class ==
-				    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
-					if (!smi_handle_dr_smp_recv(
-					    (struct ib_smp *)&response->mad.mad,
-					    port_priv->device->node_type,
-					    port_priv->port_num,
-					    port_priv->device->phys_port_cnt)) {
-						goto out;
-					}
-				}
  				/* Send response */
-				grh = (void *)recv->header.recv_buf.mad -
-				      sizeof(struct ib_grh);
-				if (!agent_send(response, grh, wc,
+				if (!agent_send(response, &recv->grh, wc,
  						port_priv->device,
  						port_priv->port_num))
  					response = NULL;
@@ -1178,13 +1163,14 @@
  		 */
  		recv = NULL;
  	}
-
  out:
-	if (recv)
-		kmem_cache_free(ib_mad_cache, recv);
-
  	/* Post another receive request for this QP */
-	ib_mad_post_receive_mad(qp_info, response);
+	if (response) {
+		ib_mad_post_receive_mads(qp_info, response);
+		if (recv)
+			kmem_cache_free(ib_mad_cache, recv);
+	} else
+		ib_mad_post_receive_mads(qp_info, recv);
  }

  static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
@@ -1291,16 +1277,51 @@
  static void ib_mad_send_done_handler(struct ib_mad_port_private 
*port_priv,
  				     struct ib_wc *wc)
  {
-	struct ib_mad_send_wr_private	*mad_send_wr;
+	struct ib_mad_send_wr_private	*mad_send_wr, *queued_send_wr;
  	struct ib_mad_list_head		*mad_list;
+	struct ib_mad_qp_info		*qp_info;
+	struct ib_mad_queue		*send_queue;
+	struct ib_send_wr		*bad_send_wr;
+	unsigned long flags;
+	int ret;

  	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
  	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
  				   mad_list);
-	dequeue_mad(mad_list);
-	/* Restore client wr_id in WC */
+	send_queue = mad_list->mad_queue;
+	qp_info = send_queue->qp_info;
+
+retry:
+	queued_send_wr = NULL;
+	spin_lock_irqsave(&send_queue->lock, flags);
+	list_del(&mad_list->list);
+
+	/* Move queued send to the send queue. */
+	if (send_queue->count-- > send_queue->max_active) {
+		mad_list = container_of(qp_info->overflow_list.next,
+					struct ib_mad_list_head, list);
+		queued_send_wr = container_of(mad_list,
+					struct ib_mad_send_wr_private,
+					mad_list);
+		list_del(&mad_list->list);
+		list_add_tail(&mad_list->list, &send_queue->list);
+	}
+	spin_unlock_irqrestore(&send_queue->lock, flags);
+
+	/* Restore client wr_id in WC and complete send. */
  	wc->wr_id = mad_send_wr->wr_id;
  	ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc);
+
+	if (queued_send_wr) {
+		ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr,
+				&bad_send_wr);
+		if (ret) {
+			printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret);
+			mad_send_wr = queued_send_wr;
+			wc->status = IB_WC_LOC_QP_OP_ERR;
+			goto retry;
+		}
+	}
  }

  /*
@@ -1492,88 +1513,74 @@
  	queue_work(port_priv->wq, &port_priv->work);
  }

-static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info,
-				   struct ib_mad_private *mad)
+/*
+ * Allocate receive MADs and post receive WRs for them.
+ */
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info,
+				    struct ib_mad_private *mad)
  {
+	unsigned long flags;
+	int post, ret;
  	struct ib_mad_private *mad_priv;
  	struct ib_sge sg_list;
-	struct ib_recv_wr recv_wr;
-	struct ib_recv_wr *bad_recv_wr;
-	int ret;
+	struct ib_recv_wr recv_wr, *bad_recv_wr;
+	struct ib_mad_queue *recv_queue = &qp_info->recv_queue;

-	if (mad)
-		mad_priv = mad;
-	else {
-		/*
-		 * Allocate memory for receive buffer.
-		 * This is for both MAD and private header
-		 * which contains the receive tracking structure.
-		 * By prepending this header, there is one rather
-		 * than two memory allocations.
-		 */
-		mad_priv = kmem_cache_alloc(ib_mad_cache,
-					    (in_atomic() || irqs_disabled()) ?
-					    GFP_ATOMIC : GFP_KERNEL);
-		if (!mad_priv) {
-			printk(KERN_ERR PFX "No memory for receive buffer\n");
-			return -ENOMEM;
-		}
-	}
-
-	/* Setup scatter list */
-	sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device,
-				      &mad_priv->grh,
-				      sizeof *mad_priv -
-					sizeof mad_priv->header,
-				      PCI_DMA_FROMDEVICE);
+	/* Initialize common scatter list fields. */
  	sg_list.length = sizeof *mad_priv - sizeof mad_priv->header;
  	sg_list.lkey = (*qp_info->port_priv->mr).lkey;

-	/* Setup receive WR */
+	/* Initialize common receive WR fields. */
  	recv_wr.next = NULL;
  	recv_wr.sg_list = &sg_list;
  	recv_wr.num_sge = 1;
  	recv_wr.recv_flags = IB_RECV_SIGNALED;
-	recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
-	pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
-
-	/* Post receive WR. */
-	queue_mad(&qp_info->recv_queue, &mad_priv->header.mad_list);
-	ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
-	if (ret) {
-		dequeue_mad(&mad_priv->header.mad_list);
-		pci_unmap_single(qp_info->port_priv->device->dma_device,
-				 pci_unmap_addr(&mad_priv->header, mapping),
-				 sizeof *mad_priv - sizeof mad_priv->header,
-				 PCI_DMA_FROMDEVICE);
-
-		kmem_cache_free(ib_mad_cache, mad_priv);
-		printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx "
-		       "failed ret = %d\n",
-		       (unsigned long long) recv_wr.wr_id, ret);
-		return -EINVAL;
-	}
-
-	return 0;
-}

-/*
- * Allocate receive MADs and post receive WRs for them
- */
-static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info)
-{
-	int i, ret;
-
-	for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) {
-		ret = ib_mad_post_receive_mad(qp_info, NULL);
+	do {
+		/* Allocate and map receive buffer. */
+		if (mad) {
+			mad_priv = mad;
+			mad = NULL;
+		} else {
+			mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+			if (!mad_priv) {
+				printk(KERN_ERR PFX "No memory for receive buffer\n");
+				ret = -ENOMEM;
+				break;
+			}
+		}
+		sg_list.addr = pci_map_single(qp_info->port_priv->
+						device->dma_device,
+					&mad_priv->grh,
+					sizeof *mad_priv -
+						sizeof mad_priv->header,
+					PCI_DMA_FROMDEVICE);
+		pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
+		recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
+		mad_priv->header.mad_list.mad_queue = recv_queue;
+
+		/* Post receive WR. */
+		spin_lock_irqsave(&recv_queue->lock, flags);
+		post = (++recv_queue->count < recv_queue->max_active);
+		list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list);
+		spin_unlock_irqrestore(&recv_queue->lock, flags);
+		ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
  		if (ret) {
-			printk(KERN_ERR PFX "receive post %d failed "
-				"on %s port %d\n", i + 1,
-				qp_info->port_priv->device->name,
-				qp_info->port_priv->port_num);
+			spin_lock_irqsave(&recv_queue->lock, flags);
+			list_del(&mad_priv->header.mad_list.list);
+			recv_queue->count--;
+			spin_unlock_irqrestore(&recv_queue->lock, flags);
+			pci_unmap_single(qp_info->port_priv->device->dma_device,
+					 pci_unmap_addr(&mad_priv->header,
+							mapping),
+					 sizeof *mad_priv -
+					   sizeof mad_priv->header,
+					 PCI_DMA_FROMDEVICE);
+			kmem_cache_free(ib_mad_cache, mad_priv);
+			printk(KERN_ERR PFX "ib_post_recv failed: = %d\n", ret);
  			break;
  		}
-	}
+	} while (post);
  	return ret;
  }

@@ -1625,6 +1632,7 @@
  	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
  	INIT_LIST_HEAD(&qp_info->send_queue.list);
  	qp_info->send_queue.count = 0;
+	INIT_LIST_HEAD(&qp_info->overflow_list);
  	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
  }

@@ -1789,7 +1797,7 @@
  	}

  	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]);
+		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
  		if (ret) {
  			printk(KERN_ERR PFX "Couldn't post receive "
  			       "requests\n");
@@ -1851,6 +1859,7 @@
  	qp_info->port_priv = port_priv;
  	init_mad_queue(qp_info, &qp_info->send_queue);
  	init_mad_queue(qp_info, &qp_info->recv_queue);
+	INIT_LIST_HEAD(&qp_info->overflow_list);

  	memset(&qp_init_attr, 0, sizeof qp_init_attr);
  	qp_init_attr.send_cq = port_priv->cq;
@@ -1870,6 +1879,9 @@
  		ret = PTR_ERR(qp_info->qp);
  		goto error;		
  	}
+	/* Use minimum queue sizes unless the CQ is resized. */
+	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
+	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
  	return 0;

  error:
Index: core/mad_priv.h
===================================================================
--- core/mad_priv.h	(revision 1186)
+++ core/mad_priv.h	(working copy)
@@ -122,6 +122,8 @@
  	struct ib_mad_list_head mad_list;
  	struct list_head agent_list;
  	struct ib_mad_agent *agent;
+	struct ib_send_wr send_wr;
+	struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG];
  	u64 wr_id;			/* client WR ID */
  	u64 tid;
  	unsigned long timeout;
@@ -141,6 +143,7 @@
  	spinlock_t lock;
  	struct list_head list;
  	int count;
+	int max_active;
  	struct ib_mad_qp_info *qp_info;
  };

@@ -149,7 +152,7 @@
  	struct ib_qp *qp;
  	struct ib_mad_queue send_queue;
  	struct ib_mad_queue recv_queue;
-	/* struct ib_mad_queue overflow_queue; */
+	struct list_head overflow_list;
  };

  struct ib_mad_port_private {

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diffs
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041109/348c57b2/attachment.ksh>

From halr at voltaire.com  Tue Nov  9 19:26:07 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 09 Nov 2004 22:26:07 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping	and
	gather address	and length
In-Reply-To: <52llda4m00.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com>
Message-ID: <1100057166.17621.23.camel@hpc-1>

On Tue, 2004-11-09 at 18:54, Roland Dreier wrote:
> OK, I think I understand the problem, but I'm not sure what the
> correct solution is.  When a DR SMP arrives at a CA from the SM,
> hop_cnt == hop_ptr == number of hops in the directed route,

What was the number ?

> and somehow they are not updated correctly by the time the response
> reaches handle_outgoing_smp().
> 
> I can't follow the code well enough to understand why all DR SMPs have
> to go through both smi_handle_dr_smp_recv() and
> smi_handle_dr_smp_send() but the patch below seems to correct things
> for me (ports go to ACTIVE on all my systems).  (handle_outgoing_smp()
> already calls smi_handle_dr_smp_recv() so it seems the response was
> getting passed to smi_handle_dr_smp_recv() twice).

I integrated this patch and checked it back in. I don't think this is
the solution for all cases (and something else is broken).

The second call to smi_handle_dr_smp_recv was to validate the DR in the
response packet before sending it. The response would be a returning DR
packet (D bit 1). If hop_cnt == hop_ptr,   

I suspect this has been broken since r1163 (not including the other
things I broke in it today).

I will do some more work in understanding this tomorrow.

-- Hal


From roland at topspin.com  Tue Nov  9 20:55:46 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 20:55:46 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and gather address	and length
In-Reply-To: <1100057166.17621.23.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 09 Nov 2004 22:26:07 -0500")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
Message-ID: <52r7n22tgt.fsf@topspin.com>

    Roland> OK, I think I understand the problem, but I'm not sure
    Roland> what the correct solution is.  When a DR SMP arrives at a
    Roland> CA from the SM, hop_cnt == hop_ptr == number of hops in
    Roland> the directed route,

    Hal> What was the number ?

For one port it was 4 and for another it was 6.  It could really be
anything (it's just how many hops away the SM is).

    Hal> I integrated this patch and checked it back in. I don't think
    Hal> this is the solution for all cases (and something else is
    Hal> broken).

Could be.  I had a hard time checking the code in smi.c (which is
split between smi_handle_dr_smp_recv() and smi_handle_dr_smp_send() as
well as smi_check_forward_dr_smp(), but which has outgoing and
returning DR handling mixed together) against the IB spec (which
splits outgoing and returning DR handling).

    Hal> The second call to smi_handle_dr_smp_recv was to validate the
    Hal> DR in the response packet before sending it. The response
    Hal> would be a returning DR packet (D bit 1). If hop_cnt ==
    Hal> hop_ptr,

I guess the problem with calling smi_handle_dr_smp_recv() twice on the
same packet is that the function may alter the packet.

 - R.


From roland at topspin.com  Tue Nov  9 21:55:43 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 09 Nov 2004 21:55:43 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and gather address	and length
In-Reply-To: <52r7n22tgt.fsf@topspin.com> (Roland Dreier's message of "Tue,
	09 Nov 2004 20:55:46 -0800")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com>
Message-ID: <52ekj22qow.fsf@topspin.com>

It seems that MAD handling is still not quite right.  It seems in my
set up that IPoIB is not seeing the response to its MCMember
set... (it does look like the query is reaching the SM)

 - R.


From halr at voltaire.com  Wed Nov 10 06:28:11 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 09:28:11 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping	and
	gather address	and length
In-Reply-To: <52ekj22qow.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
Message-ID: <1100096891.801.25.camel@hpc-1>

On Wed, 2004-11-10 at 00:55, Roland Dreier wrote:
> It seems that MAD handling is still not quite right.  It seems in my
> set up that IPoIB is not seeing the response to its MCMember
> set... (it does look like the query is reaching the SM)

This is a separate issue from the ports not becoming active (DR handling
issue). I broke this part yesterday (not a good day at all :-( at either
r1184 and/or r1181 when I added what I thought was correct based on
Sean's emails (not dispatching additional error cases in
ib_mad_recv_done_handler (and then improperly thought I verified the
changes that things were still working)).

I can see now that this is wrong and have a fix for what stops IPoIB
from working. The problem was that the response was received by the MAD
layer but not dispatched due to the change(s) noted above.

So I am patching at least enough to get things operational for now.
Please confirm that it works for you. I will not touch things until I
hear that it does.

Also, it seems to me that no response needs to be handed to process_mad.
Does this optimization make sense ?

Sorry for the temporary inconvenience. I will try not to do this again.
It is no fun for anyone.

-- Hal

mad: In ib_mad_recv_done_handler, if process_mad returns SUCCESS but not
REPLY, received packet still needs to be dispatched

Index: mad.c
===================================================================
--- mad.c	(revision 1187)
+++ mad.c	(working copy)
@@ -1151,8 +1151,8 @@
 						port_priv->device,
 						port_priv->port_num))
 					response = NULL;
+				goto out;
 			}
-			goto out;
 		} 
 	}
 

From roland at topspin.com  Wed Nov 10 07:36:29 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 10 Nov 2004 07:36:29 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and gather address	and length
In-Reply-To: <1100096891.801.25.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 10 Nov 2004 09:28:11 -0500")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1>
Message-ID: <52actp3ede.fsf@topspin.com>

>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    Hal> I can see now that this is wrong and have a fix for what
    Hal> stops IPoIB from working. The problem was that the response
    Hal> was received by the MAD layer but not dispatched due to the
    Hal> change(s) noted above.

    Hal> So I am patching at least enough to get things operational
    Hal> for now.  Please confirm that it works for you. I will not
    Hal> touch things until I hear that it does.

Yes, IPoIB works for me again.

    Hal> Also, it seems to me that no response needs to be handed to
    Hal> process_mad.  Does this optimization make sense ?

I'm not sure I understand the question.  process_mad definitely needs
a buffer to return a response in.  Are you suggesting that process_mad
overwrite the input buffer when it generates a response?  That's
probably OK although I'm not sure if it's much of an improvement
(process_mad will probably have to allocate a response buffer
internally and copy the response when returning).

 - R.


From halr at voltaire.com  Wed Nov 10 07:53:21 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 10:53:21 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send	PCI	mapping	and
	gather address	and length
In-Reply-To: <52actp3ede.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com>
Message-ID: <1100102000.2836.6.camel@hpc-1>

On Wed, 2004-11-10 at 10:36, Roland Dreier wrote:
> Yes, IPoIB works for me again.

Thanks for validating.

>     Hal> Also, it seems to me that no response needs to be handed to
>     Hal> process_mad.  Does this optimization make sense ?
> 
> I'm not sure I understand the question.  process_mad definitely needs
> a buffer to return a response in.  Are you suggesting that process_mad
> overwrite the input buffer when it generates a response?  That's
> probably OK although I'm not sure if it's much of an improvement
> (process_mad will probably have to allocate a response buffer
> internally and copy the response when returning).

I'm asking about also checking the method prior to calling process_mad. 
If the method is a response method (e.g. GetResp for one), we could
bypass calling process_mad. Or is this not worth the extra checks in the
MAD layer as it is low enough overhead and adds additional protocol
knowledge into the MAD layer ?

-- Hal


From roland at topspin.com  Wed Nov 10 08:05:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 10 Nov 2004 08:05:14 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send	PCI	mapping
	and gather address	and length
In-Reply-To: <1100102000.2836.6.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 10 Nov 2004 10:53:21 -0500")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com>
	<1100102000.2836.6.camel@hpc-1>
Message-ID: <521xf13d1h.fsf@topspin.com>

    Hal> I'm asking about also checking the method prior to calling
    Hal> process_mad. If the method is a response method (e.g. GetResp
    Hal> for one), we could bypass calling process_mad. Or is this not
    Hal> worth the extra checks in the MAD layer as it is low enough
    Hal> overhead and adds additional protocol knowledge into the MAD
    Hal> layer ?

Oh, I see now.  I don't think that's worth doing.  I think keeping the
MAD code simpler is probably best right now.

 - R.


From Nitin.Hande at Sun.COM  Wed Nov 10 08:05:48 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Wed, 10 Nov 2004 08:05:48 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping and
	gather address	and length
In-Reply-To: <1100096891.801.25.camel@hpc-1>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1>
Message-ID: <41923C5C.7080501@Sun.COM>

Hal Rosenstock wrote:
> On Wed, 2004-11-10 at 00:55, Roland Dreier wrote:
> 
>>It seems that MAD handling is still not quite right.  It seems in my
>>set up that IPoIB is not seeing the response to its MCMember
>>set... (it does look like the query is reaching the SM)
> 
> 
> This is a separate issue from the ports not becoming active (DR handling
> issue). I broke this part yesterday (not a good day at all :-( at either
> r1184 and/or r1181 when I added what I thought was correct based on
> Sean's emails (not dispatching additional error cases in
> ib_mad_recv_done_handler (and then improperly thought I verified the
> changes that things were still working)).
> 
> I can see now that this is wrong and have a fix for what stops IPoIB
> from working. The problem was that the response was received by the MAD
> layer but not dispatched due to the change(s) noted above.
> 
> So I am patching at least enough to get things operational for now.
> Please confirm that it works for you. I will not touch things until I
> hear that it does.
IPoIB seems to be working for me. I am on x86_64 platform.

Thanks
Nitin


> 
> Also, it seems to me that no response needs to be handed to process_mad.
> Does this optimization make sense ?
> 
> Sorry for the temporary inconvenience. I will try not to do this again.
> It is no fun for anyone.
> 
> -- Hal
> 
> mad: In ib_mad_recv_done_handler, if process_mad returns SUCCESS but not
> REPLY, received packet still needs to be dispatched
> 
> Index: mad.c
> ===================================================================
> --- mad.c	(revision 1187)
> +++ mad.c	(working copy)
> @@ -1151,8 +1151,8 @@
>  						port_priv->device,
>  						port_priv->port_num))
>  					response = NULL;
> +				goto out;
>  			}
> -			goto out;
>  		} 
>  	}
>  
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Wed Nov 10 08:28:24 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 11:28:24 -0500
Subject: [openib-general] Question on handle_outgoing_smp
References: <Pine.LNX.4.44.0411091346060.27735-100000@DYN318430BLD>
Message-ID: <001d01c4c742$4e4f5f10$6814a8c0@Gripen>

root wrote:
> In following code :
>
> if (smi_check_local_dr_smp(smp, mad_agent->device,
> mad_agent->port_num)) { ...
> ret = mad_agent->device->process_mad(
>                                         mad_agent->device,
>                                         0,
>                                         mad_agent->port_num,
>                                         smp->dr_slid, /* ? */
>                                         (struct ib_mad *)smp,
>                                         (struct ib_mad
> *)&mad_priv->mad);
>
> How do we guarantee that the process_mad() was supplied (not NULL) ?
> That is if smi_check_local_smp didn't get called via
> smi_check_local_dr_smp ?

Sorry for the use of a bad mail client here but I didn't receive this on my
normal
email client.

A check for process_mad routine being supplied needs to be added to this
here.
I missed it here (but had it in the other place in mad.c where process_mad
is called).
I will issue a patch for this in a while.

Thanks.

-- Hal


From mshefty at ichips.intel.com  Wed Nov 10 08:55:34 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 10 Nov 2004 08:55:34 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and	gather address	and length
In-Reply-To: <1100096891.801.25.camel@hpc-1>
References: <1100033372.17687.3.camel@hpc-1>
	<52bre6692h.fsf@topspin.com>	<527jou68xy.fsf@topspin.com>	<1100033742.2170.11.camel@localhost.localdomain>	<52llda4m00.fsf@topspin.com>
	<1100057166.17621.23.camel@hpc-1>	<52r7n22tgt.fsf@topspin.com>
	<52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1>
Message-ID: <41924806.8060509@ichips.intel.com>

Hal Rosenstock wrote:
> This is a separate issue from the ports not becoming active (DR handling
> issue). I broke this part yesterday (not a good day at all :-( at either
> r1184 and/or r1181 when I added what I thought was correct based on
> Sean's emails (not dispatching additional error cases in
> ib_mad_recv_done_handler (and then improperly thought I verified the
> changes that things were still working)).

What exactly does it mean then when process_mad returns success?  Do any of the 
return bits from process_mad indicate that the MAD was for the HCA driver?

- Sean


From roland at topspin.com  Wed Nov 10 08:59:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 10 Nov 2004 08:59:54 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and	gather address	and length
In-Reply-To: <41924806.8060509@ichips.intel.com> (Sean Hefty's message of
	"Wed, 10 Nov 2004 08:55:34 -0800")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com>
Message-ID: <52wtwt1vxx.fsf@topspin.com>

    Sean> What exactly does it mean then when process_mad returns
    Sean> success?  Do any of the return bits from process_mad
    Sean> indicate that the MAD was for the HCA driver?

SUCCESS means that process_mad didn't encounter any errors.  If REPLY
or CONSUMED is set then process_mad actually handled the packet.

 - R.


From roland at topspin.com  Wed Nov 10 09:02:16 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 10 Nov 2004 09:02:16 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and	gather address	and length
In-Reply-To: <52wtwt1vxx.fsf@topspin.com> (Roland Dreier's message of "Wed,
	10 Nov 2004 08:59:54 -0800")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com>
	<52wtwt1vxx.fsf@topspin.com>
Message-ID: <52sm7h1vtz.fsf@topspin.com>

By the way, if I am reading the code correctly, it looks like the MAD
layer only checks for IB_MAD_RESULT_REPLY and not
IB_MAD_RESULT_CONSUMED.  If IB_MAD_RESULT_CONSUMED is set then the
packet is something like a trap repress handled by the SMA or a
locally generated trap that the driver forwarded to the SM, so the
packet should not go through agent dispatch.

 - R.


From halr at voltaire.com  Wed Nov 10 09:20:05 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 12:20:05 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping	and
	gather address	and length
In-Reply-To: <52wtwt1vxx.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com>
	<52wtwt1vxx.fsf@topspin.com>
Message-ID: <1100107204.2836.36.camel@hpc-1>

On Wed, 2004-11-10 at 11:59, Roland Dreier wrote:
>     Sean> What exactly does it mean then when process_mad returns
>     Sean> success?  Do any of the return bits from process_mad
>     Sean> indicate that the MAD was for the HCA driver?
> 
> SUCCESS means that process_mad didn't encounter any errors.  If REPLY
> or CONSUMED is set then process_mad actually handled the packet.

I would assume that REPLY and CONSUMED are also mutually exclusive.

-- Hal


From halr at voltaire.com  Wed Nov 10 09:26:00 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 12:26:00 -0500
Subject: [openib-general] [PATCH] mad: In handle_outgoing_smp,
 validate process_mad routine	exists prior to calling it
Message-ID: <1100107560.2836.41.camel@hpc-1>

mad: In handle_outgoing_smp, validate process_mad routine exists prior
to calling it (issue pointed out by KK)

Index: mad.c
===================================================================
--- mad.c	(revision 1189)
+++ mad.c	(working copy)
@@ -405,30 +405,32 @@
 			goto error1;
 		}
 
-		mad_agent_priv = container_of(mad_agent,
-					      struct ib_mad_agent_private,
-					      agent);
-		ret = mad_agent->device->process_mad(
-					mad_agent->device,
-					0,
-					mad_agent->port_num,
-					smp->dr_slid, /* ? */
-					(struct ib_mad *)smp,
-					(struct ib_mad *)&mad_priv->mad);
-		if ((ret & IB_MAD_RESULT_SUCCESS) &&
-		    (ret & IB_MAD_RESULT_REPLY)) {
-			if (!smi_handle_dr_smp_recv(
-					(struct ib_smp *)&mad_priv->mad,
-					mad_agent->device->node_type,
-					mad_agent->port_num,
-					mad_agent->device->phys_port_cnt)) {
-				ret = -EINVAL;
-				kmem_cache_free(ib_mad_cache, mad_priv);
-				goto error1;
+		if (mad_agent->device->process_mad) {
+			ret = mad_agent->device->process_mad(
+					    mad_agent->device,
+					    0,
+					    mad_agent->port_num,
+					    smp->dr_slid, /* ? */
+					    (struct ib_mad *)smp,
+					    (struct ib_mad *)&mad_priv->mad);
+			if ((ret & IB_MAD_RESULT_SUCCESS) &&
+			    (ret & IB_MAD_RESULT_REPLY)) {
+				if (!smi_handle_dr_smp_recv(
+					    (struct ib_smp *)&mad_priv->mad,
+					    mad_agent->device->node_type,
+					    mad_agent->port_num,
+					    mad_agent->device->phys_port_cnt)) {
+					ret = -EINVAL;
+					kmem_cache_free(ib_mad_cache, mad_priv);
+					goto error1;
+				}
 			}
 		}
 
 		/* See if response is solicited and there is a recv handler */
+		mad_agent_priv = container_of(mad_agent,
+					      struct ib_mad_agent_private,
+					      agent);
 		if (solicited_mad(&mad_priv->mad.mad) && 
 		    mad_agent_priv->agent.recv_handler) {
 			struct ib_wc wc;


From halr at voltaire.com  Wed Nov 10 09:36:17 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 12:36:17 -0500
Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun
In-Reply-To: <41916B15.8050909@ichips.intel.com>
References: <41916B15.8050909@ichips.intel.com>
Message-ID: <1100108177.2836.48.camel@hpc-1>

On Tue, 2004-11-09 at 20:12, Sean Hefty wrote:
> The following patch adds support for handling QP0/1 send queue overrun, 
> along with a couple of related fixes:
> 
> * The patch includes that provided by Roland in order to configure the 
> fabric.
> * The code no longer modifies the user's send_wr structures when sending 
> a MAD.
> * Sent MADs work requests are copied in order to handle both queuing and 
> for error recovery (when added).
> * The receive side code was slightly restructured to use a single 
> function to repost receives.  If a receive cannot be posted for some 
> reason (e.g. lack of memory), it will now try to refill the receive 
> queue when posting an additional receive.  (This will also make it 
> possible for the code to be lazier about reposting receives, which would 
> allow for better batching of completions.)

I will break this up into two chunks: 
1. the minor agent change
2. the rest (mad changes) excluding the already applied patch (to bring
the ports up to ACTIVE) which I believe is temporary.

> 
> Also, I switched my mailer, so I apologize in advance if I hose up my patch.

It seems to have doubled up the inline diffs (as well as including it as
an attachment) but no need to regenerate because of this.

-- Hal


From halr at voltaire.com  Wed Nov 10 10:02:44 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 13:02:44 -0500
Subject: [openib-general] [PATCH] agent: Fix	agent_mad_send	PCI	mapping	and
	gather address	and length
In-Reply-To: <521xf13d1h.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com>
	<1100102000.2836.6.camel@hpc-1> <521xf13d1h.fsf@topspin.com>
Message-ID: <1100109764.2836.50.camel@hpc-1>

On Wed, 2004-11-10 at 11:05, Roland Dreier wrote:
> I think keeping the MAD code simpler is probably best right now.

Hope that is for technical reasons and not for the recent missteps.

-- Hal


From halr at voltaire.com  Wed Nov 10 10:04:41 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 13:04:41 -0500
Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun
In-Reply-To: <1100108177.2836.48.camel@hpc-1>
References: <41916B15.8050909@ichips.intel.com>
	<1100108177.2836.48.camel@hpc-1>
Message-ID: <1100109881.2836.52.camel@hpc-1>

On Wed, 2004-11-10 at 12:36, Hal Rosenstock wrote:
> I will break this up into two chunks: 
> 1. the minor agent change

Thanks. Applied.

-- Hal


From roland at topspin.com  Wed Nov 10 10:02:04 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 10 Nov 2004 10:02:04 -0800
Subject: [openib-general] [PATCH] agent: Fix	agent_mad_send	PCI	mapping
	and gather address	and length
In-Reply-To: <1100109764.2836.50.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 10 Nov 2004 13:02:44 -0500")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com>
	<1100102000.2836.6.camel@hpc-1> <521xf13d1h.fsf@topspin.com>
	<1100109764.2836.50.camel@hpc-1>
Message-ID: <52bre51t2b.fsf@topspin.com>

    Roland> I think keeping the MAD code simpler is probably best right now.

    Hal> Hope that is for technical reasons and not for the recent missteps.

Yes, it's just that the MAD code is quite complicated already with
multiple tests for DR SMPs etc; mad.c alone is over 2000 lines now.  I
don't think you could even find a microbenchmark that could measure
the improvement in testing the response bit in the MAD code rather
than calling into process_mad for every packet, so I don't think we
need to add more code to the MAD layer to do it.

 - R.


From halr at voltaire.com  Wed Nov 10 10:32:49 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 13:32:49 -0500
Subject: [openib-general] Solicited response with no matching send request
Message-ID: <1100111569.2836.61.camel@hpc-1>

Hi,

I was just rerunning all of my test cases and have a question about the
MAD layer receive processing:

Currently if no matching send request is found, the received MAD is
freed (around line 1035 of the current mad.c).

In this case, timeout too short, etc., is this the correct behavior ?
Or should the receive packet be given to a matching MAD agent with a
receive handler (perhaps with a different status) ? The latter would
allow for an additional send model for requests which I don't think is
supported now at the cost of having the client throw away these receives
based on a new status code (perhaps some sort of timeout).

Just wondering...

Thanks.

-- Hal


From mshefty at ichips.intel.com  Wed Nov 10 10:43:56 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 10 Nov 2004 10:43:56 -0800
Subject: [openib-general] Solicited response with no matching send request
In-Reply-To: <1100111569.2836.61.camel@hpc-1>
References: <1100111569.2836.61.camel@hpc-1>
Message-ID: <4192616C.7070905@ichips.intel.com>

Hal Rosenstock wrote:

> Currently if no matching send request is found, the received MAD is
> freed (around line 1035 of the current mad.c).
> 
> In this case, timeout too short, etc., is this the correct behavior ?
> Or should the receive packet be given to a matching MAD agent with a
> receive handler (perhaps with a different status) ? The latter would
> allow for an additional send model for requests which I don't think is
> supported now at the cost of having the client throw away these receives
> based on a new status code (perhaps some sort of timeout).

I think that this is the behavior that you'd want, but I can see your 
view, and I'm open to changing it.  From a client's perspective, 
dropping an unmatched MAD keeps the client from having to handle receive 
MADs without having a send outstanding.  That is, I would think that a 
client that could make use of this MAD would have to be fairly complex.

I see a couple of cases where this would happen.  The first is the one 
you mention, where the timeout was too short.  If the client retries the 
request, then they would need to deal with an unmatched response coming 
in before they issued the retry, while the retry is active (where the 
retry is sent after the received had checked for a match), or after the 
retry completed (with the need to handle multiple unmatched responses.)

The second case where I can see this happening is if the client canceled 
the send, and I'm not sure that we'd want to give the client an 
unmatched response in this case.

- Sean


From halr at voltaire.com  Wed Nov 10 11:07:32 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 14:07:32 -0500
Subject: [openib-general] [PATCH] agent: Handle out of order send completions
Message-ID: <1100113652.2836.72.camel@hpc-1>

agent: Handle out of order send completions
(Issue pointed out by Sean)

Index: agent_priv.h
===================================================================
--- agent_priv.h	(revision 1183)
+++ agent_priv.h	(working copy)
@@ -46,7 +46,6 @@
 	struct ib_mad_agent *lr_smp_agent;    /* LR SM class */
 	struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */
 	struct ib_mr *mr;
-	u64 wr_id;
 };
 
 #endif	/* __IB_AGENT_PRIV_H__ */
Index: agent.c
===================================================================
--- agent.c	(revision 1192)
+++ agent.c	(working copy)
@@ -117,9 +117,9 @@
 	/* PCI mapping */
 	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
 					  &mad->mad,
-					  sizeof mad->mad,
+					  sizeof(mad->mad),
 					  PCI_DMA_TODEVICE);
-	gather_list.length = sizeof mad->mad;
+	gather_list.length = sizeof(mad->mad);
 	gather_list.lkey = (*port_priv->mr).lkey;
 
 	send_wr.next = NULL;
@@ -172,7 +172,7 @@
 		send_wr.wr.ud.remote_qkey = 0; /* for SMPs */
 	}
 	send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr;
-	send_wr.wr_id = ++port_priv->wr_id;
+	send_wr.wr_id = (unsigned long)&agent_send_wr->send_list;
 
 	pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr);
 
@@ -182,7 +182,7 @@
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 		pci_unmap_single(mad_agent->device->dma_device,
 				 pci_unmap_addr(agent_send_wr, mapping),
-				 sizeof mad->mad,
+				 sizeof(mad->mad),
 				 PCI_DMA_TODEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
 		kfree(agent_send_wr);
@@ -247,31 +247,18 @@
 		return;
 	}
 
-	/* Completion corresponds to first entry on posted MAD send list */
 	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	if (list_empty(&port_priv->send_posted_list)) {
-		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
-		printk(KERN_ERR SPFX "Send completion WR ID 0x%Lx but send "
-		       "list is empty\n",
-		       (unsigned long long) mad_send_wc->wr_id);
-		return;
-	}
-
-	agent_send_wr = list_entry(&port_priv->send_posted_list,
-				    struct ib_agent_send_wr,
-				    send_list);
-	send_wr = agent_send_wr->send_list.next;
-	agent_send_wr = container_of(send_wr, struct ib_agent_send_wr,
+	send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id;
+	agent_send_wr =  container_of(send_wr, struct ib_agent_send_wr,
 				     send_list);
-
-	/* Remove from posted send MAD list */
+	/* Remove completed send from posted send MAD list */
 	list_del(&agent_send_wr->send_list);
 	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 
 	/* Unmap PCI */
 	pci_unmap_single(mad_agent->device->dma_device,
 			 pci_unmap_addr(agent_send_wr, mapping),
-			 sizeof agent_send_wr->mad->mad,
+			 sizeof(agent_send_wr->mad->mad),
 			 PCI_DMA_TODEVICE);
 
 	ib_destroy_ah(agent_send_wr->ah);
@@ -306,7 +293,6 @@
 
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->port_num = port_num;
-	port_priv->wr_id = 0;
 	spin_lock_init(&port_priv->send_list_lock);
 	INIT_LIST_HEAD(&port_priv->send_posted_list);
 

From mshefty at ichips.intel.com  Wed Nov 10 11:07:00 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 10 Nov 2004 11:07:00 -0800
Subject: [openib-general] [PATCH] agent: Handle out of order send
	completions
In-Reply-To: <1100113652.2836.72.camel@hpc-1>
References: <1100113652.2836.72.camel@hpc-1>
Message-ID: <419266D4.6040005@ichips.intel.com>

Hal Rosenstock wrote:

> -	send_wr.wr_id = ++port_priv->wr_id;
> +	send_wr.wr_id = (unsigned long)&agent_send_wr->send_list;
{snip}
> +	send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id;
> +	agent_send_wr =  container_of(send_wr, struct ib_agent_send_wr,
>  				     send_list);

I think it may be clearer to set the wr_id to agent_send_wr, rather than 
a subfield.

Thanks for doing this btw; I can take it off my to do list.  :)

- Sean


From halr at voltaire.com  Wed Nov 10 12:43:03 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 15:43:03 -0500
Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun
In-Reply-To: <1100108177.2836.48.camel@hpc-1>
References: <41916B15.8050909@ichips.intel.com>
	<1100108177.2836.48.camel@hpc-1>
Message-ID: <1100119383.2836.81.camel@hpc-1>

On Wed, 2004-11-10 at 12:36, Hal Rosenstock wrote:
> I will break this up into two chunks: 
> 2. the rest (mad changes) excluding the already applied patch (to bring
> the ports up to ACTIVE) which I believe is temporary.

A few minor questions (before applying this):

1. Why was BUG_ON removed from dequeue_mad ?

2. A couple of questions related to send_wr->num_sge checking.
a. Should this be pushed down to mthca and detected there rather than at
the MAD layer ?
b. If it is to stay at the MAD layer, shouldn't there be a check inside
the while (send_wr) loop rather than above it ?

-- Hal


From halr at voltaire.com  Wed Nov 10 12:53:26 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 15:53:26 -0500
Subject: [openib-general] Solicited response with no matching send	request
In-Reply-To: <4192616C.7070905@ichips.intel.com>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com>
Message-ID: <1100120005.2836.86.camel@hpc-1>

On Wed, 2004-11-10 at 13:43, Sean Hefty wrote:
> Hal Rosenstock wrote:
> 
> > Currently if no matching send request is found, the received MAD is
> > freed (around line 1035 of the current mad.c).
> > 
> > In this case, timeout too short, etc., is this the correct behavior ?
> > Or should the receive packet be given to a matching MAD agent with a
> > receive handler (perhaps with a different status) ? The latter would
> > allow for an additional send model for requests which I don't think is
> > supported now at the cost of having the client throw away these receives
> > based on a new status code (perhaps some sort of timeout).
> 
> I think that this is the behavior that you'd want, but I can see your 
> view, and I'm open to changing it.  From a client's perspective, 
> dropping an unmatched MAD keeps the client from having to handle receive 
> MADs without having a send outstanding.  That is, I would think that a 
> client that could make use of this MAD would have to be fairly complex.

I don't know whether the SM or other managers would use this model so
it's just a thought to keep in mind for the future.

> I see a couple of cases where this would happen.  The first is the one 
> you mention, where the timeout was too short.  If the client retries the 
> request, then they would need to deal with an unmatched response coming 
> in before they issued the retry, while the retry is active (where the 
> retry is sent after the received had checked for a match), or after the 
> retry completed (with the need to handle multiple unmatched responses.)
> 
> The second case where I can see this happening is if the client canceled 
> the send, and I'm not sure that we'd want to give the client an 
> unmatched response in this case.

So we would also need to timeout send MAD cancellations (and not
eliminate them totally immediately) so we wouldn't give a receive back
in that case.

-- Hal


From halr at voltaire.com  Wed Nov 10 12:57:26 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 15:57:26 -0500
Subject: [openib-general] MAD agent code comments
In-Reply-To: <4190F7C6.8020509@ichips.intel.com>
References: <41900EEB.7050109@ichips.intel.com>
	<1100009558.13933.3.camel@localhost.localdomain>
	<4190F7C6.8020509@ichips.intel.com>
Message-ID: <1100120246.2836.90.camel@hpc-1>

Hi Sean,

On Tue, 2004-11-09 at 12:00, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > Since the agent does not use solicited sends, are its sends completed in
> > order (so this is only an issue for clients using solicited sends) ? 
> 
> I would think that solicited sends (i.e. responses) would be easier to 
> maintain order, since those wouldn't have a timeout.  

We are using solicited slightly differently. I am using it for sending a
request which has a timeout and is expected to elicit a response.

> But my preference 
> would be to not defined the API this way.  It makes queuing for QP 
> overrun and error handling difficult.
> 
> For example, a client posts 2 sends, both of which get queued.  If the 
> first send gets posted, but the second send fails when posting to the 
> QP, then we'd need to delay reporting the second send's completion. 
> This also makes it more difficult to go to multi-threaded completion 
> handling, if that were shown to be beneficial.

I posted a patch for this which you have seen.

-- Hal


From halr at voltaire.com  Wed Nov 10 13:19:29 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 16:19:29 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send	PCI	mapping	and
	gather address	and length
In-Reply-To: <52r7n22tgt.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com>
Message-ID: <1100121569.2836.112.camel@hpc-1>

I haven't cleared the other issues before getting back to this but
wanted to respond to some of the points below:

On Tue, 2004-11-09 at 23:55, Roland Dreier wrote:
>     Roland> OK, I think I understand the problem, but I'm not sure
>     Roland> what the correct solution is.  When a DR SMP arrives at a
>     Roland> CA from the SM, hop_cnt == hop_ptr == number of hops in
>     Roland> the directed route,
> 
>     Hal> What was the number ?
> 
> For one port it was 4 and for another it was 6.  It could really be
> anything (it's just how many hops away the SM is).

I think I understand how DR is supposed to work :-) I was just looking
for the actual values in the failed case to try to understand what is
code was doing as I don't have a configuration to recreate this (at
least yet).

>From what you indicated, it looks like it would be the following case
so no response would be sent:

	/* C14-13:2 */
	if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
		if (node_type != IB_NODE_SWITCH
			return 0;

but I'm not sure whether those were the values on entry to the
smi_dr_handle_smp_recv routine that was excised from the code.

>     Hal> I integrated this patch and checked it back in. I don't think
>     Hal> this is the solution for all cases (and something else is
>     Hal> broken).
> 
> Could be.  I had a hard time checking the code in smi.c (which is
> split between smi_handle_dr_smp_recv() and smi_handle_dr_smp_send() as
> well as smi_check_forward_dr_smp(), but which has outgoing and
> returning DR handling mixed together) against the IB spec (which
> splits outgoing and returning DR handling).

I had to squint hard the first time I went through this too (and
probably will again). I will explain how this works in sufficient detail
if this is of interest.

>     Hal> The second call to smi_handle_dr_smp_recv was to validate the
>     Hal> DR in the response packet before sending it. The response
>     Hal> would be a returning DR packet (D bit 1). If hop_cnt ==
>     Hal> hop_ptr,
> 
> I guess the problem with calling smi_handle_dr_smp_recv() twice on the
> same packet is that the function may alter the packet.

No, the second call to smi_handle_dr_smp_recv() was on the outgoing
response and not the incoming request. The thought was that a packet
coming from process_mad is much like an incoming received packet and
hence the call to smi_handle_dr_smp_recv. The routine validates the
packet but also can do some fixups depending on which case it falls
into. Guess it's only dangerous to validate this and wrong to fix it up.

The key to me is the following:
The split of responsibility on the DR header formation is a little
unclear to me. In the case of the SM, are the DR headers fully formed
before handing it to the MAD layer or is some DR fixup needed ?

-- Hal


From roland at topspin.com  Wed Nov 10 13:29:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 10 Nov 2004 13:29:40 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send	PCI	mapping
	and gather address	and length
In-Reply-To: <1100121569.2836.112.camel@hpc-1> (Hal Rosenstock's message of
	"Wed, 10 Nov 2004 16:19:29 -0500")
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <1100121569.2836.112.camel@hpc-1>
Message-ID: <52pt2lz92z.fsf@topspin.com>

    Roland> I guess the problem with calling smi_handle_dr_smp_recv()
    Roland> twice on the same packet is that the function may alter
    Roland> the packet.

    Hal> No, the second call to smi_handle_dr_smp_recv() was on the
    Hal> outgoing response and not the incoming request. The thought
    Hal> was that a packet coming from process_mad is much like an
    Hal> incoming received packet and hence the call to
    Hal> smi_handle_dr_smp_recv. The routine validates the packet but
    Hal> also can do some fixups depending on which case it falls
    Hal> into. Guess it's only dangerous to validate this and wrong to
    Hal> fix it up.

Maybe I'm misreading the code, but my patch deleted the call to
smi_handle_dr_smp_recv() before the call to agent_send.  agent_send()
eventually ends up in ib_post_send_mad(), which calls
handle_outgoing_smp() for directed route MADs, which ends up calling
smi_handle_dr_smp_recv() again.  Since smi_handle_dr_smp_recv() can
change the packet, calling it twice on the same packet seems to break things.

However I don't think it's a good idea to think of responses generated
by process_mad as an incoming received packet.  I think they should be
thought of as returning DR SMPs being passed to the SMI for sending
(as in section 14.2.2 of the IB spec).

    Hal> The key to me is the following: The split of responsibility
    Hal> on the DR header formation is a little unclear to me. In the
    Hal> case of the SM, are the DR headers fully formed before
    Hal> handing it to the MAD layer or is some DR fixup needed ?

My suggestion would be to follow the IB spec, and assume that the SM
follows the SMP initialization in 14.2.2.1 and have the MAD layer just
implement the SMI processing in 14.2.2.2.  (And I believe things
should work similarly for responses generated by the SMA -- the MAD
layer should just do SMI processing).

 - R.


From mshefty at ichips.intel.com  Wed Nov 10 13:30:13 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 10 Nov 2004 13:30:13 -0800
Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun
In-Reply-To: <1100119383.2836.81.camel@hpc-1>
References: <41916B15.8050909@ichips.intel.com>
	<1100108177.2836.48.camel@hpc-1> <1100119383.2836.81.camel@hpc-1>
Message-ID: <41928865.3020003@ichips.intel.com>

Hal Rosenstock wrote:
> 1. Why was BUG_ON removed from dequeue_mad ?

That can be put back.  I removed queue_mad, and was going to remove 
dequeue_mad, but decided to leave it.

> 2. A couple of questions related to send_wr->num_sge checking.
> a. Should this be pushed down to mthca and detected there rather than at
> the MAD layer ?
> b. If it is to stay at the MAD layer, shouldn't there be a check inside
> the while (send_wr) loop rather than above it ?

I put this check in the MAD layer, since it may be more restrictive than 
what mthca provides.  Looking at that part of the code, we can push the 
check to mthca by making the following changes:

Move sg_list[] in ib_mad_send_wr_private to the end of the structure.
Change the sg_list array size from IB_MAD_SEND_REQ_MAX_SG to 1.
Change the kmalloc in ib_post_send_mad() to use sizeof *mad_send_wr + 
sizeof *mad_send_wr->sg_list * (send_wr->num_sge - 1)

You are correct that the check needs to be within the while loop if it 
remains in the MAD code.

- Sean


From paul.baxter at dsl.pipex.com  Wed Nov 10 13:49:23 2004
From: paul.baxter at dsl.pipex.com (Paul Baxter)
Date: Wed, 10 Nov 2004 21:49:23 -0000
Subject: [openib-general] News: Roland, Hal,
	Sean et al might actually get paid!
Message-ID: <008e01c4c76f$24c5da20$8000000a@blorp>

Glad to see http://news.zdnet.com/2100-9593_22-5446887.html

One snippet from the article '..the grant will fund 8-10 full-time 
programmers.'

Does this equate to Sean, Roland, Hal working 80 hour weeks with some 
support from others merely working 40 hour weeks :)

Just wanted to say well done for the work to date but you guys are allowed 
to take the weekend off occasionally.

Lets hope this opportunity to sell the positive contribution to Infiniband 
and Linux gets heard. I know Roland wrote a partial rebuttal to Greg KH's 
LWN article, but I can't help feeling part of getting adoption of IB in 
Linux is the PR battle.


From halr at voltaire.com  Wed Nov 10 14:20:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 17:20:39 -0500
Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun
In-Reply-To: <41928865.3020003@ichips.intel.com>
References: <41916B15.8050909@ichips.intel.com>
	<1100108177.2836.48.camel@hpc-1> <1100119383.2836.81.camel@hpc-1>
	<41928865.3020003@ichips.intel.com>
Message-ID: <1100125238.2836.125.camel@hpc-1>

On Wed, 2004-11-10 at 16:30, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > 1. Why was BUG_ON removed from dequeue_mad ?
> 
> That can be put back.  I removed queue_mad, and was going to remove 
> dequeue_mad, but decided to leave it.

I added this back in.

> > 2. A couple of questions related to send_wr->num_sge checking.
> > a. Should this be pushed down to mthca and detected there rather than at
> > the MAD layer ?
> > b. If it is to stay at the MAD layer, shouldn't there be a check inside
> > the while (send_wr) loop rather than above it ?
> 
> I put this check in the MAD layer, since it may be more restrictive than 
> what mthca provides.  Looking at that part of the code, we can push the 
> check to mthca by making the following changes:
> 
> Move sg_list[] in ib_mad_send_wr_private to the end of the structure.
> Change the sg_list array size from IB_MAD_SEND_REQ_MAX_SG to 1.
> Change the kmalloc in ib_post_send_mad() to use sizeof *mad_send_wr + 
> sizeof *mad_send_wr->sg_list * (send_wr->num_sge - 1)
> 
> You are correct that the check needs to be within the while loop if it 
> remains in the MAD code.

I made the above changes by hand (moving the check down for at least the
time being).

Thanks! Applied.

(Nice work).

-- Hal


From mshefty at ichips.intel.com  Wed Nov 10 14:22:55 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 10 Nov 2004 14:22:55 -0800
Subject: [openib-general] [PATCH] [TRIVIAL] remove unneeded locking in
	ib_mad_return_posted_recv_mads
Message-ID: <419294BF.50207@ichips.intel.com>

Removed locking, since this is in cleanup code.

- Sean

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1197)
+++ core/mad.c	(working copy)
@@ -1602,7 +1602,6 @@
  	struct ib_mad_private *recv;
  	struct ib_mad_list_head *mad_list;

-	spin_lock_irqsave(&qp_info->recv_queue.lock, flags);
  	while (!list_empty(&qp_info->recv_queue.list)) {

  		mad_list = list_entry(qp_info->recv_queue.list.next,
@@ -1626,7 +1625,6 @@
  	}

  	qp_info->recv_queue.count = 0;
-	spin_unlock_irqrestore(&qp_info->recv_queue.lock, flags);
  }

  /*


From halr at voltaire.com  Wed Nov 10 14:30:19 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 17:30:19 -0500
Subject: [openib-general] [PATCH] agent: Fix	agent_mad_send	PCI	mapping	and
	gather address	and length
In-Reply-To: <52pt2lz92z.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <1100121569.2836.112.camel@hpc-1>
	<52pt2lz92z.fsf@topspin.com>
Message-ID: <1100125819.2836.136.camel@hpc-1>

On Wed, 2004-11-10 at 16:29, Roland Dreier wrote:
>     Roland> I guess the problem with calling smi_handle_dr_smp_recv()
>     Roland> twice on the same packet is that the function may alter
>     Roland> the packet.
> 
>     Hal> No, the second call to smi_handle_dr_smp_recv() was on the
>     Hal> outgoing response and not the incoming request. The thought
>     Hal> was that a packet coming from process_mad is much like an
>     Hal> incoming received packet and hence the call to
>     Hal> smi_handle_dr_smp_recv. The routine validates the packet but
>     Hal> also can do some fixups depending on which case it falls
>     Hal> into. Guess it's only dangerous to validate this and wrong to
>     Hal> fix it up.
> 
> Maybe I'm misreading the code, but my patch deleted the call to
> smi_handle_dr_smp_recv() before the call to agent_send.  

You're not. I was...

> agent_send() eventually ends up in ib_post_send_mad(), which calls
> handle_outgoing_smp() for directed route MADs, which ends up calling
> smi_handle_dr_smp_recv() again.  Since smi_handle_dr_smp_recv() can
> change the packet, calling it twice on the same packet seems to break things.

I'm with you now.

> However I don't think it's a good idea to think of responses generated
> by process_mad as an incoming received packet.  I think they should be
> thought of as returning DR SMPs being passed to the SMI for sending
> (as in section 14.2.2 of the IB spec).

Yup, there is a difference between a returning SMP being sent and an
incoming SMP being received in terms of SMI. I was being imprecise
again.

>     Hal> The key to me is the following: The split of responsibility
>     Hal> on the DR header formation is a little unclear to me. In the
>     Hal> case of the SM, are the DR headers fully formed before
>     Hal> handing it to the MAD layer or is some DR fixup needed ?
> 
> My suggestion would be to follow the IB spec, and assume that the SM
> follows the SMP initialization in 14.2.2.1 and have the MAD layer just
> implement the SMI processing in 14.2.2.2.  (And I believe things
> should work similarly for responses generated by the SMA -- the MAD
> layer should just do SMI processing).

That was the intention. 

I will figure out what is broke but not just yet... I may want something
tried by either you or Sean prior to my checking it in to be sure. I'll
let you know.

Thanks.

-- Hal


From mshefty at ichips.intel.com  Wed Nov 10 14:33:44 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 10 Nov 2004 14:33:44 -0800
Subject: [openib-general] [PATCH] adjust error checking in ib_post_send_mad
Message-ID: <41929748.1030204@ichips.intel.com>

Removes unneeded check and relocates other to while loop.

- Sean

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1197)
+++ core/mad.c	(working copy)
@@ -518,14 +518,10 @@
  	if (!bad_send_wr)
  		goto error1;

-	if (!mad_agent || !send_wr )
+	if (!mad_agent || !send_wr)
  		goto error2;

-	if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG)
-		goto error2;
-
-	if (!mad_agent->send_handler ||
-	    (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler))
+	if (!mad_agent->send_handler)
  		goto error2;

  	mad_agent_priv = container_of(mad_agent,
@@ -543,6 +539,9 @@
  		if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG)
  			goto error2;

+		if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)
+			goto error2;
+
  		if (!send_wr->wr.ud.mad_hdr) {
  			printk(KERN_ERR PFX "MAD header must be supplied "
  			       "in WR %p\n", send_wr);


From halr at voltaire.com  Wed Nov 10 14:43:08 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 17:43:08 -0500
Subject: [openib-general] Re: [PATCH] [TRIVIAL] remove unneeded locking in
	ib_mad_return_posted_recv_mads
In-Reply-To: <419294BF.50207@ichips.intel.com>
References: <419294BF.50207@ichips.intel.com>
Message-ID: <1100126588.2836.138.camel@hpc-1>

On Wed, 2004-11-10 at 17:22, Sean Hefty wrote:
> Removed locking, since this is in cleanup code.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Nov 10 14:54:02 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 17:54:02 -0500
Subject: [openib-general] Re: [PATCH] adjust error checking in
	ib_post_send_mad
In-Reply-To: <41929748.1030204@ichips.intel.com>
References: <41929748.1030204@ichips.intel.com>
Message-ID: <1100127242.2836.140.camel@hpc-1>

On Wed, 2004-11-10 at 17:33, Sean Hefty wrote:
> Removes unneeded check and relocates other to while loop.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Nov 10 16:19:30 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 19:19:30 -0500
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send	PCI	mapping	and
	gather address	and length
In-Reply-To: <52sm7h1vtz.fsf@topspin.com>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com>
	<52wtwt1vxx.fsf@topspin.com> <52sm7h1vtz.fsf@topspin.com>
Message-ID: <1100132370.3283.30.camel@localhost.localdomain>

On Wed, 2004-11-10 at 12:02, Roland Dreier wrote:
> By the way, if I am reading the code correctly, it looks like the MAD
> layer only checks for IB_MAD_RESULT_REPLY and not
> IB_MAD_RESULT_CONSUMED.  

You are reading the code correctly.

> If IB_MAD_RESULT_CONSUMED is set then the
> packet is something like a trap repress handled by the SMA or a
> locally generated trap that the driver forwarded to the SM, so the
> packet should not go through agent dispatch.

This is a patch which should occur shortly.

-- Hal


From halr at voltaire.com  Wed Nov 10 17:34:18 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 10 Nov 2004 20:34:18 -0500
Subject: [openib-general] [PATCH] mad: After calling process_mad,
	handle MAD being consumed
Message-ID: <1100136858.2739.3.camel@hpc-1>

mad: After calling process_mad, handle MAD being consumed

Index: mad.c
===================================================================
--- mad.c	(revision 1199)
+++ mad.c	(working copy)
@@ -400,16 +400,22 @@
 					    smp->dr_slid, /* ? */
 					    (struct ib_mad *)smp,
 					    (struct ib_mad *)&mad_priv->mad);
-			if ((ret & IB_MAD_RESULT_SUCCESS) &&
-			    (ret & IB_MAD_RESULT_REPLY)) {
-				if (!smi_handle_dr_smp_recv(
+			if (ret & IB_MAD_RESULT_SUCCESS) {
+				if (ret & IB_MAD_RESULT_CONSUMED) {
+					ret = 1;
+					goto error1;
+				}
+				if (ret & IB_MAD_RESULT_REPLY) {
+					if (!smi_handle_dr_smp_recv(
 					    (struct ib_smp *)&mad_priv->mad,
 					    mad_agent->device->node_type,
 					    mad_agent->port_num,
 					    mad_agent->device->phys_port_cnt)) {
-					ret = -EINVAL;
-					kmem_cache_free(ib_mad_cache, mad_priv);
-					goto error1;
+						ret = -EINVAL;
+						kmem_cache_free(ib_mad_cache,
+								mad_priv);
+						goto error1;
+					}
 				}
 			}
 		}
@@ -1147,6 +1153,8 @@
 						     recv->header.recv_buf.mad,
 						     &response->mad.mad);
 		if (ret & IB_MAD_RESULT_SUCCESS) {
+			if (ret & IB_MAD_RESULT_CONSUMED)
+				goto out;
 			if (ret & IB_MAD_RESULT_REPLY) {
 				/* Send response */
 				if (!agent_send(response, &recv->grh, wc,


From halr at voltaire.com  Thu Nov 11 04:45:54 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 07:45:54 -0500
Subject: [openib-general] [PATCH] agent: Handle out of order send
	completions
In-Reply-To: <419266D4.6040005@ichips.intel.com>
References: <1100113652.2836.72.camel@hpc-1>
	<419266D4.6040005@ichips.intel.com>
Message-ID: <1100177153.3283.68.camel@localhost.localdomain>

On Wed, 2004-11-10 at 14:07, Sean Hefty wrote:
> Hal Rosenstock wrote:
> 
> > -	send_wr.wr_id = ++port_priv->wr_id;
> > +	send_wr.wr_id = (unsigned long)&agent_send_wr->send_list;
> {snip}
> > +	send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id;
> > +	agent_send_wr =  container_of(send_wr, struct ib_agent_send_wr,
> >  				     send_list);
> 
> I think it may be clearer to set the wr_id to agent_send_wr, rather than 
> a subfield.

Yes, that would be better (clearer and less code). Patch shortly for
this.

-- Hal


From halr at voltaire.com  Thu Nov 11 05:35:31 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 08:35:31 -0500
Subject: [openib-general] [PATCH] agent: Better wr_id in send WR makes for
	slightly simpler completion handling
Message-ID: <1100180130.9470.38.camel@hpc-1>

agent: Better wr_id in send WR makes for slightly simpler completion
handling (comment from Sean)

Index: agent.c
===================================================================
--- agent.c	(revision 1200)
+++ agent.c	(working copy)
@@ -172,7 +172,7 @@
 		send_wr.wr.ud.remote_qkey = 0; /* for SMPs */
 	}
 	send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr;
-	send_wr.wr_id = (unsigned long)&agent_send_wr->send_list;
+	send_wr.wr_id = (unsigned long)agent_send_wr;
 
 	pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr);
 
@@ -236,7 +236,6 @@
 {
 	struct ib_agent_port_private	*port_priv;
 	struct ib_agent_send_wr		*agent_send_wr;
-	struct list_head		*send_wr;
 	unsigned long			flags;
 
 	/* Find matching MAD agent */
@@ -247,10 +246,8 @@
 		return;
 	}
 
+	agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id;
 	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id;
-	agent_send_wr =  container_of(send_wr, struct ib_agent_send_wr,
-				     send_list);
 	/* Remove completed send from posted send MAD list */
 	list_del(&agent_send_wr->send_list);
 	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);


From mlleinin at hpcn.ca.sandia.gov  Thu Nov 11 06:01:00 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 11 Nov 2004 06:01:00 -0800
Subject: [openib-general] New OpenIB webpages
Message-ID: <1100181660.14334.548.camel@trinity>


  As some of you may have noticed, we migrated over to the new OpenIB
web pages yesterday.  The FAQ and a few other items are still a work in
progress.  Let me know if there are any errors or if folks have other
feedback/suggestions.

   Thanks,

	- Matt


From tduffy at sun.com  Thu Nov 11 07:06:53 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 07:06:53 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100181660.14334.548.camel@trinity>
References: <1100181660.14334.548.camel@trinity>
Message-ID: <1100185613.22128.5.camel@duffman>

On Thu, 2004-11-11 at 06:01 -0800, Matt Leininger wrote:
> 
>   As some of you may have noticed, we migrated over to the new OpenIB
> web pages yesterday.  The FAQ and a few other items are still a work in
> progress.  Let me know if there are any errors or if folks have other
> feedback/suggestions.

Well done.  The new page looks great.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/a42bfb47/attachment.sig>

From roland at topspin.com  Thu Nov 11 07:41:44 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 07:41:44 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100181660.14334.548.camel@trinity> (Matt Leininger's message
	of "Thu, 11 Nov 2004 06:01:00 -0800")
References: <1100181660.14334.548.camel@trinity>
Message-ID: <52ekj0z93b.fsf@topspin.com>

    Matt>   As some of you may have noticed, we migrated over to the
    Matt> new OpenIB web pages yesterday.  The FAQ and a few other
    Matt> items are still a work in progress.  Let me know if there
    Matt> are any errors or if folks have other feedback/suggestions.

Looks great.  One suggestions: under news, it's probably worth linking
to or mentioning the PathForward funding announcement.

 - R.


From tduffy at sun.com  Thu Nov 11 08:14:21 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 08:14:21 -0800
Subject: [openib-general] openib.org/bugzilla
Message-ID: <1100189661.25996.2.camel@duffman>

I just signed up for an account, but the email confirmation had the
wrong address.  It said to go to:

http://cvs-mirror.mozilla.org/webtools/bugzilla/userprefs.cgi

Also, it seems there is no gen2 version in the query field.

Thanks,

-tduffy

-- 
Tom Duffy <tduffy at sun.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/dba47c51/attachment.sig>

From roland at topspin.com  Thu Nov 11 08:31:57 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 08:31:57 -0800
Subject: [openib-general] [PATCH] ipoib: Free AHs
Message-ID: <52actoz6rm.fsf@topspin.com>

This patch corrects the fact that IPoIB leaks all of its address
handles by creating a list of dead AHs and freeing an AH once all the
sends using it complete.

Index: ulp/ipoib/ipoib_verbs.c
===================================================================
--- ulp/ipoib/ipoib_verbs.c	(revision 1201)
+++ ulp/ipoib/ipoib_verbs.c	(working copy)
@@ -171,16 +171,6 @@
 	return -EINVAL;
 }
 
-void ipoib_qp_destroy(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-	if (ib_destroy_qp(priv->qp))
-		ipoib_warn(priv, "ib_qp_destroy failed\n");
-
-	priv->qp = NULL;
-}
-
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
Index: ulp/ipoib/ipoib_main.c
===================================================================
--- ulp/ipoib/ipoib_main.c	(revision 1201)
+++ ulp/ipoib/ipoib_main.c	(working copy)
@@ -177,7 +177,7 @@
 	struct ipoib_path *path = path_ptr;
 	struct ipoib_dev_priv *priv = netdev_priv(path->dev);
 	struct sk_buff *skb;
-	struct ib_ah *ah;
+	struct ipoib_ah *ah;
 
 	ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n",
 		  status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid));
@@ -195,10 +195,10 @@
 			.port_num      = priv->port
 		};
 
-		ah = ib_create_ah(priv->pd, &av);
+		ah = ipoib_create_ah(path->dev, priv->pd, &av);
 	}
 
-	if (IS_ERR(ah))
+	if (!ah)
 		goto err;
 
 	path->ah = ah;
@@ -299,7 +299,7 @@
 {
 	struct sk_buff *skb = skb_ptr;
 	struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
-	struct ib_ah *ah;
+	struct ipoib_ah *ah;
 
 	ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n",
 		  status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid));
@@ -307,6 +307,10 @@
 	if (status)
 		goto err;
 
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		goto err;
+
 	{
 		struct ib_ah_attr av = {
 			.dlid 	       = be16_to_cpu(pathrec->dlid),
@@ -317,13 +321,15 @@
 			.port_num      = priv->port
 		};
 
-		ah = ib_create_ah(priv->pd, &av);
+		ah->ah = ib_create_ah(priv->pd, &av);
 	}
 
-	if (IS_ERR(ah))
+	if (IS_ERR(ah->ah)) {
+		kfree(ah);
 		goto err;
+	}
 
-	*(struct ib_ah **) skb->cb = ah;
+	*(struct ipoib_ah **) skb->cb = ah;
 
 	if (dev_queue_xmit(skb))
 		ipoib_warn(priv, "dev_queue_xmit failed "
@@ -337,10 +343,15 @@
 
 static void unicast_arp_finish(struct sk_buff *skb)
 {
-	struct ib_ah *ah = *(struct ib_ah **) skb->cb;
+	struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
+	struct ipoib_ah *ah = *(struct ipoib_ah **) skb->cb;
+	unsigned long flags;
 
-	if (ah)
-		ib_destroy_ah(ah);
+	if (ah) {
+		spin_lock_irqsave(&priv->lock, flags);
+		list_add_tail(&ah->list, &priv->dead_ahs);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
 }
 
 /*
@@ -443,7 +454,7 @@
 			 * now we can just send the packet.
 			 */
 			if (skb->destructor == unicast_arp_finish) {
-				ipoib_send(dev, skb, *(struct ib_ah **) skb->cb,
+				ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb,
 					   be32_to_cpup((u32 *) phdr->hwaddr));
 				return 0;
 			}
@@ -454,14 +465,7 @@
 					   skb->dst ? "neigh" : "dst",
 					   be16_to_cpup((u16 *) skb->data),
 					   be32_to_cpup((u32 *) phdr->hwaddr),
-					   phdr->hwaddr[ 4], phdr->hwaddr[ 5],
-					   phdr->hwaddr[ 6], phdr->hwaddr[ 7],
-					   phdr->hwaddr[ 8], phdr->hwaddr[ 9],
-					   phdr->hwaddr[10], phdr->hwaddr[11],
-					   phdr->hwaddr[12], phdr->hwaddr[13],
-					   phdr->hwaddr[14], phdr->hwaddr[15],
-					   phdr->hwaddr[16], phdr->hwaddr[17],
-					   phdr->hwaddr[18], phdr->hwaddr[19]);
+					   IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4)));
 
 			/* put the pseudoheader back on */			  
 			skb_push(skb, sizeof *phdr);
@@ -529,10 +533,17 @@
 
 static void ipoib_neigh_destructor(struct neighbour *neigh)
 {
-	ipoib_dbg(netdev_priv(neigh->dev),
-		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
+	struct ipoib_dev_priv *priv = netdev_priv(neigh->dev);
+	struct ipoib_path     *path = IPOIB_PATH(neigh);
+
+	ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
 		  be32_to_cpup((__be32 *) neigh->ha),
 		  IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4))));
+
+	if (path && path->ah) {
+		ipoib_put_ah(path->ah);
+		kfree(path);
+	}
 }
 
 static int ipoib_neigh_setup(struct neighbour *neigh)
@@ -683,12 +694,14 @@
 	sema_init(&priv->mcast_mutex, 1);
 
 	INIT_LIST_HEAD(&priv->child_intfs);
+	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
 	INIT_WORK(&priv->pkey_task,    ipoib_pkey_poll,          priv->dev);
 	INIT_WORK(&priv->mcast_task,   ipoib_mcast_join_task,    priv->dev);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush,       priv->dev);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev);
+	INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah,            priv->dev);
 }
 
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
Index: ulp/ipoib/ipoib_multicast.c
===================================================================
--- ulp/ipoib/ipoib_multicast.c	(revision 1201)
+++ ulp/ipoib/ipoib_multicast.c	(working copy)
@@ -36,7 +36,7 @@
 /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */
 struct ipoib_mcast {
 	struct ib_sa_mcmember_rec mcmember;
-	struct ib_ah             *address_handle;
+	struct ipoib_ah          *ah;
 
 	struct rb_node    rb_node;
 	struct list_head  list;
@@ -69,11 +69,8 @@
 	ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n",
 			IPOIB_GID_ARG(mcast->mcmember.mgid));
 
-	if (mcast->address_handle != NULL) {
-		int ret = ib_destroy_ah(mcast->address_handle);
-		if (ret < 0)
-			ipoib_warn(priv, "ib_destroy_ah failed (ret = %d)\n", ret);
-	}
+	if (mcast->ah)
+		ipoib_put_ah(mcast->ah);
 
 	while (!skb_queue_empty(&mcast->pkt_queue)) {
 		struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
@@ -108,7 +105,7 @@
 	INIT_LIST_HEAD(&mcast->list);
 	skb_queue_head_init(&mcast->pkt_queue);
 
-	mcast->address_handle = NULL;
+	mcast->ah    = NULL;
 	mcast->query = NULL;
 
 	return mcast;
@@ -224,14 +221,14 @@
 
 		av.grh.dgid = mcast->mcmember.mgid;
 
-		mcast->address_handle = ib_create_ah(priv->pd, &av);
-		if (IS_ERR(mcast->address_handle)) {
+		mcast->ah = ipoib_create_ah(dev, priv->pd, &av);
+		if (!mcast->ah) {
 			ipoib_warn(priv, "ib_address_create failed\n");
 		} else {
 			ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT
 					" AV %p, LID 0x%04x, SL %d\n",
 					IPOIB_GID_ARG(mcast->mcmember.mgid),
-					mcast->address_handle,
+					mcast->ah->ah,
 					be16_to_cpu(mcast->mcmember.mlid),
 					mcast->mcmember.sl);
 		}
@@ -661,7 +658,7 @@
 		list_add_tail(&mcast->list, &priv->multicast_list);
 	}
 
-	if (!mcast->address_handle) {
+	if (!mcast->ah) {
 		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
 			skb_queue_tail(&mcast->pkt_queue, skb);
 		else
@@ -682,14 +679,15 @@
 
 out:
 	spin_unlock_irqrestore(&priv->lock, flags);
-	if (mcast && mcast->address_handle) {
+	if (mcast && mcast->ah) {
 		if (skb->dst            &&
 		    skb->dst->neighbour &&
 		    !IPOIB_PATH(skb->dst->neighbour)) {
 			struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC);
 
 			if (path) {
-				path->ah  	= mcast->address_handle;
+				kref_get(&mcast->ah->ref);
+				path->ah  	= mcast->ah;
 				path->qpn 	= IB_MULTICAST_QPN;
 				path->dev 	= dev;
 				path->neighbour = skb->dst->neighbour;
@@ -697,7 +695,7 @@
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->address_handle, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
 	}
 }
 
@@ -951,9 +949,9 @@
 
 	mcast = rb_entry(iter->rb_node, struct ipoib_mcast, rb_node);
 
-	*mgid = mcast->mcmember.mgid;
-	*created = mcast->created;
-	*queuelen = skb_queue_len(&mcast->pkt_queue);
-	*complete = mcast->address_handle != NULL;
+	*mgid      = mcast->mcmember.mgid;
+	*created   = mcast->created;
+	*queuelen  = skb_queue_len(&mcast->pkt_queue);
+	*complete  = !!mcast->ah;
 	*send_only = (mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)) ? 1 : 0;
 }
Index: ulp/ipoib/ipoib.h
===================================================================
--- ulp/ipoib/ipoib.h	(revision 1201)
+++ ulp/ipoib/ipoib.h	(working copy)
@@ -31,6 +31,7 @@
 #include <linux/workqueue.h>
 #include <linux/pci.h>
 #include <linux/config.h>
+#include <linux/kref.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -65,6 +66,7 @@
 	IPOIB_PKEY_STOP 	  = 4,
 	IPOIB_FLAG_SUBINTERFACE   = 5,
 	IPOIB_MCAST_RUN 	  = 6,
+	IPOIB_STOP_REAPER         = 7,
 
 	IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -109,6 +111,7 @@
 	struct work_struct mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
+	struct work_struct ah_reap_task;
 
 	struct ib_device *ca;
 	u8            	  port;
@@ -134,18 +137,28 @@
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
+	struct list_head dead_ahs;
+
 	struct proc_dir_entry *mcast_proc_entry;
 
 	struct ib_event_handler event_handler;
 
 	struct net_device_stats stats;
 
+	struct list_head child_intfs;
 	struct list_head list;
-	struct list_head child_intfs;
 };
 
+struct ipoib_ah {
+	struct net_device *dev;
+	struct ib_ah      *ah;
+	struct list_head   list;
+	struct kref        ref;
+	unsigned           last_send;
+};
+
 struct ipoib_path {
-	struct ib_ah       *ah;
+	struct ipoib_ah    *ah;
 	u32                 qpn;
 	struct sk_buff_head queue;
 
@@ -166,8 +179,17 @@
 
 void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
 
+struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
+				 struct ib_pd *pd, struct ib_ah_attr *attr);
+void ipoib_free_ah(struct kref *kref);
+static inline void ipoib_put_ah(struct ipoib_ah *ah)
+{
+	kref_put(&ah->ref, ipoib_free_ah);
+}
+
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn);
+void ipoib_reap_ah(void *dev_ptr);
 
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
 
@@ -213,7 +235,6 @@
 		       union ib_gid *mgid);
 
 int ipoib_qp_create(struct net_device *dev);
-void ipoib_qp_destroy(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
 void ipoib_transport_dev_cleanup(struct net_device *dev);
 
Index: ulp/ipoib/ipoib_ib.c
===================================================================
--- ulp/ipoib/ipoib_ib.c	(revision 1201)
+++ ulp/ipoib/ipoib_ib.c	(working copy)
@@ -29,10 +29,44 @@
 
 static DECLARE_MUTEX(pkey_sem);
 
-static int _ipoib_ib_receive(struct ipoib_dev_priv *priv,
-			     u64 work_request_id,
-			     dma_addr_t addr)
+struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
+				 struct ib_pd *pd, struct ib_ah_attr *attr)
 {
+	struct ipoib_ah *ah;
+
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		return NULL;
+
+	ah->dev       = dev;
+	ah->last_send = 0;
+	kref_init(&ah->ref);
+
+	ah->ah = ib_create_ah(pd, attr);
+	if (IS_ERR(ah->ah)) {
+		kfree(ah);
+		ah = NULL;
+	}
+
+	return ah;
+}
+
+void ipoib_free_ah(struct kref *kref)
+{
+	struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref);
+	struct ipoib_dev_priv *priv = netdev_priv(ah->dev);
+
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	list_add_tail(&ah->list, &priv->dead_ahs);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int ipoib_ib_receive(struct ipoib_dev_priv *priv,
+			    u64 work_request_id,
+			    dma_addr_t addr)
+{
 	struct ib_sge list = {
 		.addr    = addr,
 		.length  = IPOIB_BUF_SIZE,
@@ -50,8 +84,8 @@
 }
 
 /* =============================================================== */
-/*.._ipoib_ib_post_receive -- post a receive buffer                */
-static int _ipoib_ib_post_receive(struct net_device *dev, int id)
+/*..ipoib_ib_post_receive -- post a receive buffer                */
+static int ipoib_ib_post_receive(struct net_device *dev, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
@@ -72,24 +106,24 @@
 			      PCI_DMA_FROMDEVICE);
 	pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr);
 
-	ret = _ipoib_ib_receive(priv, id, addr);
+	ret = ipoib_ib_receive(priv, id, addr);
 	if (ret)
-		ipoib_warn(priv, "_ipoib_ib_receive failed for buf %d (%d)\n",
+		ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n",
 			   id, ret);
 
 	return ret;
 }
 
 /* =============================================================== */
-/*.._ipoib_ib_post_receives -- post all receive buffers            */
-static int _ipoib_ib_post_receives(struct net_device *dev)
+/*..ipoib_ib_post_receives -- post all receive buffers            */
+static int ipoib_ib_post_receives(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i;
 
 	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) {
-		if (_ipoib_ib_post_receive(dev, i)) {
-			ipoib_warn(priv, "_ipoib_ib_post_receive failed for buf %d\n", i);
+		if (ipoib_ib_post_receive(dev, i)) {
+			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
 			return -EIO;
 		}
 	}
@@ -108,7 +142,7 @@
 
 	if (entry->status != IB_WC_SUCCESS) {
 		ipoib_warn(priv, "got failed completion event "
-			   "(status=%d, wrid=%d, op=%d)",
+			   "(status=%d, wrid=%d, op=%d)\n",
 			   entry->status, wr_id, entry->opcode);
 
 		if (entry->opcode == IB_WC_SEND) {
@@ -163,8 +197,8 @@
 			}
 
 			/* repost receive */
-			if (_ipoib_ib_post_receive(dev, wr_id))
-				ipoib_warn(priv, "_ipoib_ib_post_receive failed "
+			if (ipoib_ib_post_receive(dev, wr_id))
+				ipoib_warn(priv, "ipoib_ib_post_receive failed "
 					   "for buf %d\n", wr_id);
 		} else
 			ipoib_warn(priv, "completion event with wrid %d\n",
@@ -262,7 +296,7 @@
 /* =============================================================== */
 /*..ipoib_send -- schedule an IB send work request                 */
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ib_ah *address, u32 qpn)
+		struct ipoib_ah *address, u32 qpn)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_buf *tx_req;
@@ -302,7 +336,7 @@
 	pci_unmap_addr_set(tx_req, mapping, addr);
 
 	if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1),
-		      address, qpn, addr, skb->len)) {
+		      address->ah, qpn, addr, skb->len)) {
 		ipoib_warn(priv, "post_send failed\n");
 		++priv->stats.tx_errors;
 		tx_req->skb = NULL;
@@ -312,6 +346,7 @@
 
 		dev->trans_start = jiffies;
 
+		address->last_send = priv->tx_head;
 		++priv->tx_head;
 
 		spin_lock_irqsave(&priv->lock, flags);
@@ -323,6 +358,38 @@
 	}
 }
 
+void __ipoib_reap_ah(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_ah *ah, *tah;
+	LIST_HEAD(remove_list);
+
+	spin_lock_irq(&priv->lock);
+	list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list)
+		if (ah->last_send <= priv->tx_tail) {
+			list_del(&ah->list);
+			list_add_tail(&ah->list, &remove_list);
+		}
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(ah, tah, &remove_list, list) {
+		ipoib_dbg(priv, "Reaping ah %p\n", ah->ah);
+		ib_destroy_ah(ah->ah);
+		kfree(ah);
+	}
+}
+
+void ipoib_reap_ah(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	__ipoib_reap_ah(dev);
+
+	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
+		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
+}
+
 int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -334,12 +401,15 @@
 		return -1;
 	}
 
-	ret = _ipoib_ib_post_receives(dev);
+	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
-		ipoib_warn(priv, "_ipoib_ib_post_receives returned %d\n", ret);
+		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
 		return -1;
 	}
 
+	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
+	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
+
 	return 0;
 }
 
@@ -395,8 +465,10 @@
 	int i;
 
 	/* Kill the existing QP and allocate a new one */
-	if (priv->qp != NULL)
-		ipoib_qp_destroy(dev);
+	if (priv->qp != NULL) {
+		ib_destroy_qp(priv->qp);
+		priv->qp = NULL;
+	}
 
 	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) {
 		if (priv->rx_ring[i].skb) {
@@ -463,14 +535,24 @@
 /*..ipoib_ib_dev_cleanup -- clean up IB resources for iface        */
 void ipoib_ib_dev_cleanup(struct net_device *dev)
 {
-	ipoib_dbg(netdev_priv(dev), "cleaning up ib_dev\n");
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	ipoib_dbg(priv, "cleaning up ib_dev\n");
+
 	ipoib_mcast_stop_thread(dev);
 
 	/* Delete the broadcast address and the local address */
 	ipoib_mcast_dev_down(dev);
 
 	ipoib_transport_dev_cleanup(dev);
+
+	set_bit(IPOIB_STOP_REAPER, &priv->flags);
+	cancel_delayed_work(&priv->ah_reap_task);
+	flush_workqueue(ipoib_workqueue);
+	while (!list_empty(&priv->dead_ahs)) {
+		__ipoib_reap_ah(dev);
+		yield();
+	}
 }
 
 /*


From tduffy at sun.com  Thu Nov 11 09:07:44 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 09:07:44 -0800
Subject: [openib-general] [Fwd: [Bug 1] New: kernel prints out error message
	for each ib interface]
Message-ID: <1100192864.25996.5.camel@duffman>


-------------- next part --------------
An embedded message was scrubbed...
From: bugzilla-daemon at openib.org
Subject: [Bug 1] New: kernel prints out error message for each ib interface
Date: Thu, 11 Nov 2004 09:08:19 -0800 (PST)
Size: 2770
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/e01cf331/attachment.mht>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/e01cf331/attachment.sig>

From tduffy at sun.com  Thu Nov 11 09:07:55 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 09:07:55 -0800
Subject: [openib-general] [Fwd: [Bug 2] New: ipoib does not work with ipv6]
Message-ID: <1100192875.25996.7.camel@duffman>


-------------- next part --------------
An embedded message was scrubbed...
From: bugzilla-daemon at openib.org
Subject: [Bug 2] New: ipoib does not work with ipv6
Date: Thu, 11 Nov 2004 09:18:36 -0800 (PST)
Size: 3781
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/983b09f4/attachment.mht>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/983b09f4/attachment.sig>

From roland at topspin.com  Thu Nov 11 09:09:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 09:09:14 -0800
Subject: [openib-general] [PATCH] Remove use of SPIN_LOCK_UNLOCKED
Message-ID: <521xf0z51h.fsf@topspin.com>

In the upstream kernel, the use of SPIN_LOCK_UNLOCKED is being
phased out (look for changesets like "Lock initializer unifying").
This patch converts the MAD layer to use spin_lock_init() instead,
please apply.

 - R.

Index: core/agent.c
===================================================================
--- core/agent.c	(revision 1202)
+++ core/agent.c	(working copy)
@@ -30,7 +30,7 @@
 #include <asm/bug.h>
 
 
-static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED;
+spinlock_t ib_agent_port_list_lock;
 static LIST_HEAD(ib_agent_port_list);
 
 extern kmem_cache_t *ib_mad_cache;
@@ -382,4 +382,3 @@
 
 	return 0;
 }
-
Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1202)
+++ core/mad.c	(working copy)
@@ -74,7 +74,7 @@
 static u32 ib_mad_client_id = 0;
 
 /* Port list lock */
-static spinlock_t ib_mad_port_list_lock = SPIN_LOCK_UNLOCKED;
+static spinlock_t ib_mad_port_list_lock;
 
 
 /* Forward declarations */
@@ -2132,6 +2132,9 @@
 {
 	int ret;
 
+	spin_lock_init(&ib_mad_port_list_lock);
+	spin_lock_init(&ib_agent_port_list_lock);
+
 	ib_mad_cache = kmem_cache_create("ib_mad",
 					 sizeof(struct ib_mad_private),
 					 0,
@@ -2171,4 +2174,3 @@
 
 module_init(ib_mad_init_module);
 module_exit(ib_mad_cleanup_module);
-
Index: core/agent.h
===================================================================
--- core/agent.h	(revision 1202)
+++ core/agent.h	(working copy)
@@ -26,6 +26,8 @@
 #ifndef __AGENT_H_
 #define __AGENT_H_
 
+extern spinlock_t ib_agent_port_list_lock;
+
 extern int ib_agent_port_open(struct ib_device *device,
 			      int port_num);
 

From halr at voltaire.com  Thu Nov 11 09:23:50 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 12:23:50 -0500
Subject: [openib-general] [PATCH] ipoib: Free AHs
In-Reply-To: <52actoz6rm.fsf@topspin.com>
References: <52actoz6rm.fsf@topspin.com>
Message-ID: <1100193829.28921.2.camel@hpc-1>

On Thu, 2004-11-11 at 11:31, Roland Dreier wrote:
> This patch corrects the fact that IPoIB leaks all of its address
> handles by creating a list of dead AHs and freeing an AH once all the
> sends using it complete.

A couple of compile warnings:
drivers/infiniband/ulp/ipoib/ipoib_main.c: In function
`ipoib_neigh_destructor':
drivers/infiniband/ulp/ipoib/ipoib_main.c:536: warning: unused variable
`priv'

and

drivers/infiniband/ulp/ipoib/ipoib_multicast.c: In function
`ipoib_mcast_free':
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:67: warning: unused
variable `priv'

Here's a trivial patch for these.

-- Hal

Index: ipoib_main.c
===================================================================
--- ipoib_main.c	(revision 1205)
+++ ipoib_main.c	(working copy)
@@ -533,10 +533,10 @@
 
 static void ipoib_neigh_destructor(struct neighbour *neigh)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(neigh->dev);
 	struct ipoib_path     *path = IPOIB_PATH(neigh);
 
-	ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
+	ipoib_dbg(netdev_priv(neigh->dev),
+		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
 		  be32_to_cpup((__be32 *) neigh->ha),
 		  IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4))));
 
Index: ipoib_multicast.c
===================================================================
--- ipoib_multicast.c	(revision 1205)
+++ ipoib_multicast.c	(working copy)
@@ -64,9 +64,9 @@
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n",
+	ipoib_dbg_mcast(netdev_priv(dev),
+			"deleting multicast group " IPOIB_GID_FMT "\n",
 			IPOIB_GID_ARG(mcast->mcmember.mgid));
 
 	if (mcast->ah)


From roland at topspin.com  Thu Nov 11 09:21:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 09:21:54 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
Message-ID: <52wtwsxpvx.fsf@topspin.com>

Tom/Nitin, can you guys tell me if the latest IPoIB code works with
IBSRM without any workarounds?  I think the multicast group joining
and creating should be spec compliant now but I'd like to make sure
the old problems are really gone.

Thanks,
  Roland


From roland at topspin.com  Thu Nov 11 09:24:11 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 09:24:11 -0800
Subject: [openib-general] [PATCH] ipoib: Free AHs
In-Reply-To: <1100193829.28921.2.camel@hpc-1> (Hal Rosenstock's message of
	"Thu, 11 Nov 2004 12:23:50 -0500")
References: <52actoz6rm.fsf@topspin.com> <1100193829.28921.2.camel@hpc-1>
Message-ID: <52sm7gxps4.fsf@topspin.com>

Thanks, applied.

 - R.


From halr at voltaire.com  Thu Nov 11 09:37:34 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 12:37:34 -0500
Subject: [openib-general] [PATCH] ipoib: Free AHs
In-Reply-To: <52actoz6rm.fsf@topspin.com>
References: <52actoz6rm.fsf@topspin.com>
Message-ID: <1100194653.28921.16.camel@hpc-1>

On Thu, 2004-11-11 at 11:31, Roland Dreier wrote:
> This patch corrects the fact that IPoIB leaks all of its address
> handles by creating a list of dead AHs and freeing an AH once all the
> sends using it complete.

Unfortunately I still see:

ib0: ib_dealloc_pd failed

when I removed ib_ipoib

and then

ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, ecb9b000 busy

when I removed ib_mthca.

Should the latter go away with this latest change ?

-- Hal


From roland at topspin.com  Thu Nov 11 09:36:31 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 09:36:31 -0800
Subject: [openib-general] [PATCH] ipoib: Free AHs
In-Reply-To: <1100194653.28921.16.camel@hpc-1> (Hal Rosenstock's message of
	"Thu, 11 Nov 2004 12:37:34 -0500")
References: <52actoz6rm.fsf@topspin.com> <1100194653.28921.16.camel@hpc-1>
Message-ID: <52oei4xp7k.fsf@topspin.com>

    Hal> Unfortunately I still see:

    Hal> ib0: ib_dealloc_pd failed

    Hal> when I removed ib_ipoib

I understand why that happens: I try to free the PD before waiting for
all the AHs to be reaped.  This should be fixed soon.

 - R.


From halr at voltaire.com  Thu Nov 11 09:44:23 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 12:44:23 -0500
Subject: [openib-general] [PATCH] Remove use of SPIN_LOCK_UNLOCKED
In-Reply-To: <521xf0z51h.fsf@topspin.com>
References: <521xf0z51h.fsf@topspin.com>
Message-ID: <1100195062.28921.24.camel@hpc-1>

On Thu, 2004-11-11 at 12:09, Roland Dreier wrote:
> In the upstream kernel, the use of SPIN_LOCK_UNLOCKED is being
> phased out (look for changesets like "Lock initializer unifying").
> This patch converts the MAD layer to use spin_lock_init() instead,
> please apply.

Thanks. Applied.

-- Hal


From tduffy at sun.com  Thu Nov 11 09:44:34 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 09:44:34 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <52wtwsxpvx.fsf@topspin.com>
References: <52wtwsxpvx.fsf@topspin.com>
Message-ID: <1100195074.25996.14.camel@duffman>

On Thu, 2004-11-11 at 09:21 -0800, Roland Dreier wrote:
> Tom/Nitin, can you guys tell me if the latest IPoIB code works with
> IBSRM without any workarounds?  I think the multicast group joining
> and creating should be spec compliant now but I'd like to make sure

Yes, this is working. (awesome!)

As long as I only try to bring up the ib0.8001 interface. If I bring up
ib0, ib_ipoib freaks out and continuously prints (very rapidly):

ib0: multicast join failed for ff12:401b:7fff:0:0:0:ffff:ffff, status -22

I think this is an issue with IBSRM because pkey 7fff does not exist.  I
don't know whose fault this is.  IBSRM continuously prints:

mcast:  smc_mcast_process_add_request: Could not add member: status 0x600
mcast:  smc_mcast_check_new_group: Required components not set in comp_mask: required 0x00000000000130c6, set 0x0000000000010083 - mgid ff12401b7fff0000:00000000ffffffff
mcast:  smc_mcast_add_member: Could not verify attributes for new group: status 0x600

Bringing down ib0 stops the barrage.

Thanks,

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/8c490689/attachment.sig>

From roland at topspin.com  Thu Nov 11 09:47:11 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 09:47:11 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <1100195074.25996.14.camel@duffman> (Tom Duffy's message of
	"Thu, 11 Nov 2004 09:44:34 -0800")
References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman>
Message-ID: <52k6ssxops.fsf@topspin.com>

    Tom> As long as I only try to bring up the ib0.8001 interface. If
    Tom> I bring up ib0, ib_ipoib freaks out and continuously prints
    Tom> (very rapidly):

    Tom> ib0: multicast join failed for ff12:401b:7fff:0:0:0:ffff:ffff, status -22

Hmm, looks like the backoff code isn't working properly (this should
only happen every 16 seconds or so).  I'll try to figure out what's
going on here.

Thanks for testing.

 - R.


From roland at topspin.com  Thu Nov 11 09:52:17 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 09:52:17 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100181660.14334.548.camel@trinity> (Matt Leininger's message
	of "Thu, 11 Nov 2004 06:01:00 -0800")
References: <1100181660.14334.548.camel@trinity>
Message-ID: <52fz3gxoha.fsf@topspin.com>

    Matt> The FAQ and a few other items are still a work in progress.

A couple of suggestions for the FAQ:

in "How do I submit source code patches?"

    I suggest adding something like "Please make sure that patches are
    licensed under the same terms as the original code (dual GPL/BSD
    for most of the OpenIB stack)."

in "What version of the Linux kernel do you support?"

    I suggest changing the answer to something like OpenIB
    supports the latest 2.6 kernel (currently 2.6.9).

in "What are all these upper layer protocols like IPoIB, DAPL, MPI, SDP,
SRP, and others?"

    add a link to the IETF ipoib WG at <http://ietf.org/html.charters/ipoib-charter.html>

 - R.


From tduffy at sun.com  Thu Nov 11 09:55:27 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 09:55:27 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <52fz3gxoha.fsf@topspin.com>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
Message-ID: <1100195727.25996.20.camel@duffman>

On Thu, 2004-11-11 at 09:52 -0800, Roland Dreier wrote:
> in "What are all these upper layer protocols like IPoIB, DAPL, MPI, SDP,
> SRP, and others?"
> 
>     add a link to the IETF ipoib WG at <http://ietf.org/html.charters/ipoib-charter.html>

Maybe also worth mentioning that only IPoIB is supported at this time.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/f810e9c6/attachment.sig>

From Nitin.Hande at Sun.COM  Thu Nov 11 09:58:45 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 11 Nov 2004 09:58:45 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
Message-ID: <4193A855.5030102@Sun.COM>

signed off by: Nitin Hande <Nitin.Hande at Sun.Com>

I would appreciate if someone can review my patch to enable inet6
address on ib interface. This is the first cut, will like to hear from
all. I plan to setup a bugzilla account and append this patch to the bug
that Tom has created for inet6.


diff -Nurp -X dontdiff
/build1/nitin/linux/linux-2.6.9/net/ipv6/addrconf.c
linux-2.6.9/net/ipv6/addrconf.c
--- /build1/nitin/linux/linux-2.6.9/net/ipv6/addrconf.c	2004-11-10
14:43:53.568970000 -0800
+++ linux-2.6.9/net/ipv6/addrconf.c	2004-11-10 15:07:40.196227944 -0800
@@ -1110,6 +1110,13 @@ static int ipv6_generate_eui64(u8 *eui,
 		memset(eui, 0, 7);
 		eui[7] = *(u8*)dev->dev_addr;
 		return 0;
+	case ARPHRD_INFINIBAND:
+		/* XXX: replace len with IPOIB_HW_ADDR_LEN later */
+		if (dev->addr_len != 20)
+			return -1;
+		memcpy(eui, dev->dev_addr + 12, 8);
+		eui[0] ^= 2;
+		return 0;
 	}
 	return -1;
 }
@@ -1809,6 +1816,7 @@ static void addrconf_dev_config(struct n
 	if ((dev->type != ARPHRD_ETHER) &&
 	    (dev->type != ARPHRD_FDDI) &&
 	    (dev->type != ARPHRD_IEEE802_TR) &&
+	    (dev->type != ARPHRD_INFINIBAND) &&
 	    (dev->type != ARPHRD_ARCNET)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;


--------------------------------------------
Usage and output:

Playing with link local address:
================================
sins-stinger-8:~/ipoibcfg/src # ifconfig ib0.8001 inet6 up

sins-stinger-8:~/ipoibcfg/src # ifconfig ib0.8001
ib0.8001  Link encap:UNSPEC  HWaddr
00-02-00-14-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:192.168.100.107  Bcast:192.168.100.255
Mask:255.255.255.0
          inet6 addr: fe80::202:c901:976:1f81/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:31 errors:0 dropped:0 overruns:0 frame:0
          TX packets:40 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:2832 (2.7 Kb)  TX bytes:3532 (3.4 Kb)

sins-stinger-8:~/ipoibcfg/src # ip -6 addr show
1: lo: <LOOPBACK,UP> mtu 16436
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
7: ib0.8001: <BROADCAST,MULTICAST,UP> mtu 2044 qlen 128
    inet6 fe80::202:c901:976:1f81/64 scope link
       valid_lft forever preferred_lft forever

sins-stinger-8:~/ipoibcfg/src # route -A inet6
Kernel IPv6 routing table
Destination                                 Next Hop
            Flags Metric Ref    Use Iface
::1/128                                     ::
            U     0      33       2 lo
fe80::202:c901:976:1f81/128                 ::
            U     0      9        2 lo
fe80::202:c901:976:5161/128                 fe80::202:c901:976:5161
            UC    0      2        0 ib0.8001
fe80::/64                                   ::
            U     256    0        0 ib0.8001
ff00::/8                                    ::
            U     256    0        0 ib0.8001

sins-stinger-8:~/ipoibcfg/src # ping6 -I ib0.8001 fe80::202:c901:976:5161
PING fe80::202:c901:976:5161(fe80::202:c901:976:5161) from
fe80::202:c901:976:1f81 ib0.8001: 56 data bytes
64 bytes from fe80::202:c901:976:5161: icmp_seq=1 ttl=64 time=2.77 ms
64 bytes from fe80::202:c901:976:5161: icmp_seq=2 ttl=64 time=0.067 ms
64 bytes from fe80::202:c901:976:5161: icmp_seq=3 ttl=64 time=0.066 ms
------------------------------------------------------------


global address and ssh test
================================

sins-stinger-8:~ # ifconfig ib0.8001 inet6 add 2222::2/64
sins-stinger-8:~ # ifconfig ib0.8001
ib0.8001  Link encap:UNSPEC  HWaddr
00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00
          inet6 addr: 2222::2/64 Scope:Global
          inet6 addr: fe80::202:c901:976:1f81/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:549 errors:0 dropped:0 overruns:0 frame:0
          TX packets:174 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:41510 (40.5 Kb)  TX bytes:25115 (24.5 Kb)

sins-stinger-8:~/ipoibcfg/src # ssh 2222::1
The authenticity of host '2222::1 (2222::1)' can't be established.
RSA key fingerprint is c5:47:5d:44:85:09:a9:b5:38:d7:48:78:f0:77:30:eb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '2222::1' (RSA) to the list of known hosts.
Password:
Last login: Thu Nov 11 09:36:10 2004 from sr1-umpk-04.sfbay.sun.com
sins-stinger-04:~ # ifconfig ib0.8001
ib0.8001  Link encap:UNSPEC  HWaddr
00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00
          inet6 addr: 2222::1/64 Scope:Global
          inet6 addr: fe80::202:c901:976:5161/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:703 errors:0 dropped:0 overruns:0 frame:0
          TX packets:652 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:72617 (70.9 Kb)  TX bytes:66817 (65.2 Kb)
-----------------------------------------------

Interoperability between Solaris and Linux:
==============================================

sins-stinger-04:~/ipoibcfg/src # ifconfig ib0.8001
ib0.8001  Link encap:UNSPEC  HWaddr
00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00
          inet6 addr: 2222::1/64 Scope:Global
          inet6 addr: fe80::202:c901:976:5161/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:726 errors:0 dropped:0 overruns:0 frame:0
          TX packets:668 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:74673 (72.9 Kb)  TX bytes:69193 (67.5 Kb)

sins-stinger-04:~/ipoibcfg/src # uname -a
Linux sins-stinger-04 2.6.9 #4 SMP Tue Nov 9 20:25:28 PST 2004 x86_64
x86_64 x86_64 GNU/Linux
sins-stinger-04:~/ipoibcfg/src # ping6 -I ib0.8001 fe80::202:c901:976:5b01
PING fe80::202:c901:976:5b01(fe80::202:c901:976:5b01) from
fe80::202:c901:976:5161 ib0.8001: 56 data bytes
64 bytes from fe80::202:c901:976:5b01: icmp_seq=1 ttl=255 time=0.401 ms
64 bytes from fe80::202:c901:976:5b01: icmp_seq=2 ttl=255 time=0.228 ms
64 bytes from fe80::202:c901:976:5b01: icmp_seq=3 ttl=255 time=0.237 ms

--- fe80::202:c901:976:5b01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.228/0.288/0.401/0.081 ms

root at caseate# ifconfig ibd1 inet6
ibd1: flags=2000841<UP,RUNNING,MULTICAST,IPv6> mtu 2044 index 4
        inet6 fe80::202:c901:976:5b01/10
root at caseate#
root at caseate# uname -a
SunOS caseate.SFBay.Sun.COM 5.10 s10_70 sun4u sparc SUNW,Sun-Fire-280R
root at caseate# ping fe80::202:c901:976:5161
fe80::202:c901:976:5161 is alive
root at caseate#


IThanks
Nitin


From roland at topspin.com  Thu Nov 11 10:11:43 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 10:11:43 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <4193A855.5030102@Sun.COM> (Nitin Hande's message of "Thu, 11
	Nov 2004 09:58:45 -0800")
References: <4193A855.5030102@Sun.COM>
Message-ID: <52bre4xnkw.fsf@topspin.com>

    Nitin> I would appreciate if someone can review my patch to enable
    Nitin> inet6 address on ib interface. This is the first cut, will
    Nitin> like to hear from all. I plan to setup a bugzilla account
    Nitin> and append this patch to the bug that Tom has created for
    Nitin> inet6.

This looks right to me.  My only questions are:

+		eui[0] ^= 2;

I remember some discussion about whether IBTA GUIDs are already
modified EUI-64 or not.  Is this the correct transformation or should
we be doing something like "eui[0] |= 2;" (ie assume the universal bit
should always be set in our IPv6 address)?  What does S10 do here?

Do we need to add an ipv6_ib_mc_map() function and call it in ndisc.c?

Also, does the IPoIB driver need any modification to use IPv6
multicast groups correctly?

Obviously IPv6 is working for you -- are ND packets being sent to the
IPv4 broadcast group?

If it's OK with you, I'll check in this patch as linux-2.6.9-ipoib-ipv6.diff.

Thanks,
  Roland


From roland at topspin.com  Thu Nov 11 10:15:17 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 10:15:17 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <4193A855.5030102@Sun.COM> (Nitin Hande's message of "Thu, 11
	Nov 2004 09:58:45 -0800")
References: <4193A855.5030102@Sun.COM>
Message-ID: <527josxney.fsf@topspin.com>

signed off by: Nitin Hande <Nitin.Hande at Sun.Com>
By the way, the proper format for

    signed off by: Nitin Hande <Nitin.Hande at Sun.Com>

is really

    Signed-off-by: Nitin Hande <Nitin.Hande at Sun.Com>

(see Documentation/SubmittingPatches).

 - R.


From halr at voltaire.com  Thu Nov 11 10:14:37 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 13:14:37 -0500
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <52fz3gxoha.fsf@topspin.com>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
Message-ID: <1100196877.3283.120.camel@localhost.localdomain>

On Thu, 2004-11-11 at 12:52, Roland Dreier wrote:
> in "What version of the Linux kernel do you support?"
> 
>     I suggest changing the answer to something like OpenIB
>     supports the latest 2.6 kernel (currently 2.6.9).

Not indicating the current version (2.6.9) makes for less frequent web
page updates. Is just saying latest 2.6 kernel sufficient ?

-- Hal


From halr at voltaire.com  Thu Nov 11 10:27:22 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 13:27:22 -0500
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52bre4xnkw.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com>
Message-ID: <1100197642.3283.131.camel@localhost.localdomain>

On Thu, 2004-11-11 at 13:11, Roland Dreier wrote:
>  My only questions are:
> 
> +		eui[0] ^= 2;
> 
> I remember some discussion about whether IBTA GUIDs are already
> modified EUI-64 or not.  Is this the correct transformation or should
> we be doing something like "eui[0] |= 2;" (ie assume the universal bit
> should always be set in our IPv6 address)?  

IBTA GUIDs are EUI-64. The only issue I recall was whether the polarity
of the U/G bit was consistent with IEEE. This was updated at IBA 1.2. It
now says "manufacturer assigns EUI-64 with global scope set. May also
assign additional EUI-64 with local scope."

> What does S10 do here?

What's S10 ?

> Do we need to add an ipv6_ib_mc_map() function and call it in ndisc.c?

This is needed. IPv6 multicast mapping is slightly different from the
IPv4 mapping.

-- Hal


From tduffy at sun.com  Thu Nov 11 10:35:23 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 10:35:23 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <1100197642.3283.131.camel@localhost.localdomain>
References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com>
	<1100197642.3283.131.camel@localhost.localdomain>
Message-ID: <1100198123.25996.33.camel@duffman>

On Thu, 2004-11-11 at 13:27 -0500, Hal Rosenstock wrote:
> What's S10 ?

Solaris 10.  Which has IPv6oIB.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/478f6c0a/attachment.sig>

From tduffy at sun.com  Thu Nov 11 10:39:58 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 10:39:58 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100196877.3283.120.camel@localhost.localdomain>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
	<1100196877.3283.120.camel@localhost.localdomain>
Message-ID: <1100198398.25996.35.camel@duffman>

On Thu, 2004-11-11 at 13:14 -0500, Hal Rosenstock wrote:
> Not indicating the current version (2.6.9) makes for less frequent web
> page updates. Is just saying latest 2.6 kernel sufficient ?

How about making the FAQ a WIKI :-)

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/a1361b77/attachment.sig>

From roland at topspin.com  Thu Nov 11 10:46:21 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 10:46:21 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <1100197642.3283.131.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 11 Nov 2004 13:27:22 -0500")
References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com>
	<1100197642.3283.131.camel@localhost.localdomain>
Message-ID: <52sm7gw7eq.fsf@topspin.com>

    Hal> IBTA GUIDs are EUI-64. The only issue I recall was whether
    Hal> the polarity of the U/G bit was consistent with IEEE. This
    Hal> was updated at IBA 1.2. It now says "manufacturer assigns
    Hal> EUI-64 with global scope set. May also assign additional
    Hal> EUI-64 with local scope."

Uh-oh -- none of the HCAs I have access to have the universal bit set
in their port GUIDs.

 - R.


From iod00d at hp.com  Thu Nov 11 10:47:15 2004
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 11 Nov 2004 10:47:15 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100196877.3283.120.camel@localhost.localdomain>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
	<1100196877.3283.120.camel@localhost.localdomain>
Message-ID: <20041111184715.GE32218@cup.hp.com>

On Thu, Nov 11, 2004 at 01:14:37PM -0500, Hal Rosenstock wrote:
> Not indicating the current version (2.6.9) makes for less frequent web
> page updates. Is just saying latest 2.6 kernel sufficient ?

Probably not since SLES9-ia64 is based on 2.6.5 and it won't work as-is.
Making ithe FAQ a wiki (tduffy) is a good idea.

grant


From halr at voltaire.com  Thu Nov 11 10:50:28 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 13:50:28 -0500
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52sm7gw7eq.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com>
	<1100197642.3283.131.camel@localhost.localdomain>
	<52sm7gw7eq.fsf@topspin.com>
Message-ID: <1100199028.3283.157.camel@localhost.localdomain>

On Thu, 2004-11-11 at 13:46, Roland Dreier wrote:
>     Hal> IBTA GUIDs are EUI-64. The only issue I recall was whether
>     Hal> the polarity of the U/G bit was consistent with IEEE. This
>     Hal> was updated at IBA 1.2. It now says "manufacturer assigns
>     Hal> EUI-64 with global scope set. May also assign additional
>     Hal> EUI-64 with local scope."
> 
> Uh-oh -- none of the HCAs I have access to have the universal bit set
> in their port GUIDs.

That's the old way (where old < IBA 1.2).

I can dig out more emails on this and any recommendations.

In the older versions of IBA, the bit was inverted due to some language
ambiguity. It was supposed to be global.

I would think we want to be compliant with the IBA 1.2 definition but
if there are practical matters with this...

-- Hal


From Nitin.Hande at Sun.COM  Thu Nov 11 11:02:36 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 11 Nov 2004 11:02:36 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52bre4xnkw.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com>
Message-ID: <4193B74C.2060408@Sun.COM>

All,

Thanks for your comments,

Roland Dreier wrote:
>     Nitin> I would appreciate if someone can review my patch to enable
>     Nitin> inet6 address on ib interface. This is the first cut, will
>     Nitin> like to hear from all. I plan to setup a bugzilla account
>     Nitin> and append this patch to the bug that Tom has created for
>     Nitin> inet6.
> 
> This looks right to me.  My only questions are:
> 
> +		eui[0] ^= 2;
> 
> I remember some discussion about whether IBTA GUIDs are already
> modified EUI-64 or not.  Is this the correct transformation or should
> we be doing something like "eui[0] |= 2;" (ie assume the universal bit
> should always be set in our IPv6 address)?  What does S10 do here?
Yes, I see S10 setting the bit as eur[0] |= 2. I will update that in my
patch.

> 
> Do we need to add an ipv6_ib_mc_map() function and call it in ndisc.c?
yes, I will code that function and send a new patch including the
comments received so far...

> 
> Also, does the IPoIB driver need any modification to use IPv6
> multicast groups correctly?
> 
> Obviously IPv6 is working for you -- are ND packets being sent to the
> IPv4 broadcast group?
Yes.

> 
> If it's OK with you, I'll check in this patch as linux-2.6.9-ipoib-ipv6.diff.
Let me update the patch and if it looks okay, you can then go ahead.
Hope that is fine....

Thanks
Nitin

> 
> Thanks,
>   Roland
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Thu Nov 11 11:31:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 11:31:53 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <4193A855.5030102@Sun.COM> (Nitin Hande's message of "Thu, 11
	Nov 2004 09:58:45 -0800")
References: <4193A855.5030102@Sun.COM>
Message-ID: <52oei4w5au.fsf@topspin.com>

I just tested, and the IPv6 ND packets are being sent to the MGID
ff12:401b:ffff:0:0:0:ffff:ffff.  This makes sense because
net/ipv6/ndisc.c uses dev->broadcast in ndisc_mc_map() if it doesn't
know about the interface type.

I'll see if creating ipv6_ib_mc_map() helps.

 - R.


From roland at topspin.com  Thu Nov 11 12:13:42 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 12:13:42 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52oei4w5au.fsf@topspin.com> (Roland Dreier's message of "Thu,
	11 Nov 2004 11:31:53 -0800")
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
Message-ID: <52bre4w3d5.fsf@topspin.com>

OK, with the patch below all the correct IPv6 groups seem to be
created and used.  Ping works at least...

One question about IPv6 and IPoIB: currently the IPoIB driver joins
the IPv4 broadcast group and then uses those parameters to join or
create (as needed) the other groups, including all IPv6 multicast
groups.  Is this correct, or is there a distinguished IPv6 MCG that is
supposed to be used as a base as the IPv4 broadcast group is?

Thanks,
  Roland

Signed-off-by: Nitin Hande <Nitin.Hande at Sun.Com>
Signed-off-by: Roland Dreier <roland at topspin.com>

Index: linux-2.6.9/include/net/if_inet6.h
===================================================================
--- linux-2.6.9.orig/include/net/if_inet6.h	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/net/if_inet6.h	2004-11-11 11:38:20.000000000 -0800
@@ -266,5 +266,20 @@
 {
 	buf[0] = 0x00;
 }
+
+static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x60;		/* IPv6 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	memcpy(buf + 10, addr->s6_addr + 6, 10);
+}
 #endif
 #endif
Index: linux-2.6.9/net/ipv6/addrconf.c
===================================================================
--- linux-2.6.9.orig/net/ipv6/addrconf.c	2004-10-18 14:55:24.000000000 -0700
+++ linux-2.6.9/net/ipv6/addrconf.c	2004-11-11 11:35:23.000000000 -0800
@@ -1110,6 +1110,13 @@
 		memset(eui, 0, 7);
 		eui[7] = *(u8*)dev->dev_addr;
 		return 0;
+	case ARPHRD_INFINIBAND:
+		/* XXX: replace len with IPOIB_HW_ADDR_LEN later */
+		if (dev->addr_len != 20)
+			return -1;
+		memcpy(eui, dev->dev_addr + 12, 8);
+		eui[0] |= 2;
+		return 0;
 	}
 	return -1;
 }
@@ -1809,6 +1816,7 @@
 	if ((dev->type != ARPHRD_ETHER) && 
 	    (dev->type != ARPHRD_FDDI) &&
 	    (dev->type != ARPHRD_IEEE802_TR) &&
+	    (dev->type != ARPHRD_INFINIBAND) &&
 	    (dev->type != ARPHRD_ARCNET)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;
Index: linux-2.6.9/net/ipv6/ndisc.c
===================================================================
--- linux-2.6.9.orig/net/ipv6/ndisc.c	2004-10-18 14:54:32.000000000 -0700
+++ linux-2.6.9/net/ipv6/ndisc.c	2004-11-11 11:35:50.000000000 -0800
@@ -260,6 +260,9 @@
 	case ARPHRD_ARCNET:
 		ipv6_arcnet_mc_map(addr, buf);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ipv6_ib_mc_map(addr, buf);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(buf, dev->broadcast, dev->addr_len);


From roland at topspin.com  Thu Nov 11 13:02:55 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 13:02:55 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <1100195074.25996.14.camel@duffman> (Tom Duffy's message of
	"Thu, 11 Nov 2004 09:44:34 -0800")
References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman>
Message-ID: <52y8h8umio.fsf@topspin.com>

    Tom> As long as I only try to bring up the ib0.8001 interface. If
    Tom> I bring up ib0, ib_ipoib freaks out and continuously prints
    Tom> (very rapidly):

OK, I think I fixed this.  When you get a chance to retest, try
bringing up ib0 as see if it still acts freaky.

Thanks,
  Roland


From halr at voltaire.com  Thu Nov 11 13:01:23 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 16:01:23 -0500
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52bre4w3d5.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com>
Message-ID: <1100206883.3283.211.camel@localhost.localdomain>

On Thu, 2004-11-11 at 15:13, Roland Dreier wrote:
> One question about IPv6 and IPoIB: currently the IPoIB driver joins
> the IPv4 broadcast group and then uses those parameters to join or
> create (as needed) the other groups, including all IPv6 multicast
> groups.  Is this correct, or is there a distinguished IPv6 MCG that is
> supposed to be used as a base as the IPv4 broadcast group is?

There are no broadcast addresses in IPv6, their function being
superseded by multicast addresses. Neighbor Solicitation messages are
multicast to the solicited-node multicast address of the target address.
So I don't think there is a "master" IPv6 group.

The IPoIB I-D does say that all group parameters should come from the
broadcast group. But that's an IPv4 group so I'm not sure about IPv6 as
a node could have an IPv6 interface but no IPv4 interface.

-- Hal


From roland at topspin.com  Thu Nov 11 13:09:17 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 13:09:17 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52bre4w3d5.fsf@topspin.com> (Roland Dreier's message of "Thu,
	11 Nov 2004 12:13:42 -0800")
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com>
Message-ID: <52u0rwum82.fsf@topspin.com>

By the way, can anyone explain the following to me (an IPv6 rookie):

    # ping6 -I ib0 fe80::202:c901:78c:e461
    PING fe80::202:c901:78c:e461(fe80::202:c901:78c:e461) from fe80::202:c901:7fc:c711 ib0: 56 data bytes
    64 bytes from fe80::202:c901:78c:e461: icmp_seq=1 ttl=64 time=32.2 ms
    64 bytes from fe80::202:c901:78c:e461: icmp_seq=2 ttl=64 time=14.7 ms
    64 bytes from fe80::202:c901:78c:e461: icmp_seq=3 ttl=64 time=14.6 ms

    --- fe80::202:c901:78c:e461 ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2001ms
    rtt min/avg/max/mdev = 14.682/20.557/32.274/8.286 ms

    # ping6 fe80::202:c901:78c:e461
    connect: Invalid argument

    # ssh -6 fe80::202:c901:78c:e461
    ssh: connect to host fe80::202:c901:78c:e461 port 22: Invalid argument

ssh works fine if I assign non-autoconfig'ed addresses.

Ethernet behaves the same way so I don't think it's something to do
with the IPoIB driver, but I would like to understand it better (if
only for my own edification).

Thanks,
  Roland


From Nitin.Hande at Sun.COM  Thu Nov 11 14:03:13 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 11 Nov 2004 14:03:13 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52u0rwum82.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com>
Message-ID: <4193E1A1.1060202@Sun.COM>

Roland Dreier wrote:
> By the way, can anyone explain the following to me (an IPv6 rookie):
> 
>     # ping6 -I ib0 fe80::202:c901:78c:e461
>     PING fe80::202:c901:78c:e461(fe80::202:c901:78c:e461) from fe80::202:c901:7fc:c711 ib0: 56 data bytes
>     64 bytes from fe80::202:c901:78c:e461: icmp_seq=1 ttl=64 time=32.2 ms
>     64 bytes from fe80::202:c901:78c:e461: icmp_seq=2 ttl=64 time=14.7 ms
>     64 bytes from fe80::202:c901:78c:e461: icmp_seq=3 ttl=64 time=14.6 ms
> 
>     --- fe80::202:c901:78c:e461 ping statistics ---
>     3 packets transmitted, 3 received, 0% packet loss, time 2001ms
>     rtt min/avg/max/mdev = 14.682/20.557/32.274/8.286 ms
> 
>     # ping6 fe80::202:c901:78c:e461
>     connect: Invalid argument
In order to ping link local address you need to specify an outgoing
interface. Thats mentioned in man ping.
- I interface address
Set source address to specified interface address. Argument may be
numeric IP address or  name  of  device.When pinging IPv6 link-local
address this option is required.

> 
>     # ssh -6 fe80::202:c901:78c:e461
>     ssh: connect to host fe80::202:c901:78c:e461 port 22: Invalid argument
> 
> ssh works fine if I assign non-autoconfig'ed addresses.
> 
> Ethernet behaves the same way so I don't think it's something to do
> with the IPoIB driver, but I would like to understand it better (if
> only for my own edification).
On Solaris I see ssh just working fine on auto-config'ed address. Need
more time to understand linux code.

Thanks
Nitin
> 
> Thanks,
>   Roland
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Tom.Duffy at Sun.COM  Thu Nov 11 14:12:24 2004
From: Tom.Duffy at Sun.COM (Tom Duffy)
Date: Thu, 11 Nov 2004 14:12:24 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <52y8h8umio.fsf@topspin.com>
References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman>
	<52y8h8umio.fsf@topspin.com>
Message-ID: <1100211144.25996.55.camel@duffman>

On Thu, 2004-11-11 at 13:02 -0800, Roland Dreier wrote:
>     Tom> As long as I only try to bring up the ib0.8001 interface. If
>     Tom> I bring up ib0, ib_ipoib freaks out and continuously prints
>     Tom> (very rapidly):
> 
> OK, I think I fixed this.  When you get a chance to retest, try
> bringing up ib0 as see if it still acts freaky.

Yes, this is fixed now.

Thanks,

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/bdb87306/attachment.sig>

From Tom.Duffy at Sun.COM  Thu Nov 11 14:14:50 2004
From: Tom.Duffy at Sun.COM (Tom Duffy)
Date: Thu, 11 Nov 2004 14:14:50 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <52y8h8umio.fsf@topspin.com>
References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman>
	<52y8h8umio.fsf@topspin.com>
Message-ID: <1100211290.25996.58.camel@duffman>

On Thu, 2004-11-11 at 13:02 -0800, Roland Dreier wrote:
> OK, I think I fixed this.  When you get a chance to retest, try
> bringing up ib0 as see if it still acts freaky.

Oops.  Spoke too soon.  It seems `ifconfig ib0 down` now hangs.

Can't [ctrl]-c it either.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/2e7d6501/attachment.sig>

From halr at voltaire.com  Thu Nov 11 14:21:59 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 17:21:59 -0500
Subject: [Fwd: Re: [openib-general] [PATCH] Enable inet6 on ib interface]
Message-ID: <1100211719.3283.238.camel@localhost.localdomain>

Here's some text from the IPoIB I-D relative to this:

[AARCH] requires the interface identifier be created in the
    "Modified EUI-64" format when derived from an EUI-64 identifier.
    [IBTA] is unclear if the GUID should use IEEE EUI-64 format or the
    "Modified EUI-64" format.  Therefore, when creating an interface
    identifier from the GUID an implementation MUST do the following:

        => Determine if the GUID is a modified EUI-64 identifier ("u"
        bit is toggled) as defined by [AARCH]

        => If the GUID is a modified EUI-64 identifier then the "u" bit
        MUST NOT be toggled when creating the interface identifier

        => If the GUID is an umodified EUI-64 identifier then the "u"
        bit MUST be toggled in compliance with [AARCH]

I'm not sure how one determines whether the GUID is modified or 
unmodified EUI-64.

Here's an email from the LWG chair to the IPoIB WG back on August 9:

[Ipoverib] Update on status of eui-64 in IB

________________________________________________________________________
      * To: ipoverib at ietf.org
      * Subject: [Ipoverib] Update on status of eui-64 in IB
      * From: Daniel Cassiday <Daniel.Cassiday at Sun.COM>
      * Date: Mon, 09 Aug 2004 17:28:42 -0400
      * List-help: <mailto:ipoverib-request at ietf.org?subject=help>
      * List-id: IP over InfiniBand WG Discussion List
        <ipoverib.ietf.org>
      * List-post: <mailto:ipoverib at ietf.org>
      * List-subscribe:
        <https://www1.ietf.org/mailman/listinfo/ipoverib>,
        <mailto:ipoverib-request at ietf.org?subject=subscribe>
      * List-unsubscribe:
        <https://www1.ietf.org/mailman/listinfo/ipoverib>,
        <mailto:ipoverib-request at ietf.org?subject=unsubscribe>
      * Reply-to: Daniel.Cassiday at Sun.COM
      * Sender: ipoverib-bounces at ietf.org
      * User-agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.1)
        Gecko/20020823 Netscape/7.0

________________________________________________________________________
A while back it was pointed out that the IB specification was unclear on
how to set the universal/local bit in the EUI-64. This was causing a
problem in the ipoverib wg on how to generate an interface identifier
from this EUI-64.

The IBTA has looked into this and planning is to modify the IB spec to
clarify that the universal/local bit should be cleared when defining the
EUI-64. The spec with this modification is currently under internal
review. Pending approval (which is expected) the clarification will be
included in the upcoming 1.2 release of the spec.


This means that the IBA will conform to the IEEE definition of
universal/local bit, and that for ipoverib, interface identifiers should
be generated from the EUI-64 as per RFC 2373 (i.e. the universal/local
bit should be inverted).
(Note, at one point the IBTA Link WG considered using a special value in
the OUI field (i.e. this is where the vendor id appears) to indicate
local scope but this was discarded in favor of the simplier fix defined
above.)


_______________________________________________
IPoverIB mailing list
IPoverIB at ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib

-----Forwarded Message-----

From: Hal Rosenstock <halr at voltaire.com>
To: Roland Dreier <roland at topspin.com>
Cc: Nitin Hande <Nitin.Hande at Sun.COM>, openib-general at openib.org
Subject: Re: [openib-general] [PATCH] Enable inet6 on ib interface
Date: 11 Nov 2004 13:50:28 -0500

On Thu, 2004-11-11 at 13:46, Roland Dreier wrote:
>     Hal> IBTA GUIDs are EUI-64. The only issue I recall was whether
>     Hal> the polarity of the U/G bit was consistent with IEEE. This
>     Hal> was updated at IBA 1.2. It now says "manufacturer assigns
>     Hal> EUI-64 with global scope set. May also assign additional
>     Hal> EUI-64 with local scope."
> 
> Uh-oh -- none of the HCAs I have access to have the universal bit set
> in their port GUIDs.

That's the old way (where old < IBA 1.2).

I can dig out more emails on this and any recommendations.

In the older versions of IBA, the bit was inverted due to some language
ambiguity. It was supposed to be global.

I would think we want to be compliant with the IBA 1.2 definition but
if there are practical matters with this...

-- Hal

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mshefty at ichips.intel.com  Thu Nov 11 14:42:39 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 11 Nov 2004 14:42:39 -0800
Subject: [openib-general] QP error handling
Message-ID: <4193EADF.50001@ichips.intel.com>

I'm trying to force errors on QP0/1 to see if my changes can recover 
from them.  I force the errors by sending with an invalid lkey.  Based 
on the implementation of mthca, what can be expected?

I'm not seeing the QP event handler get invoked.  I do receive a 
completion error, followed by flushed work requests.  Attempts to modify 
the QP directly to RTS fail -- I was hoping that the QP would enter SQE 
state.

- Sean


From roland at topspin.com  Thu Nov 11 14:46:45 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 14:46:45 -0800
Subject: [openib-general] Re: QP error handling
In-Reply-To: <4193EADF.50001@ichips.intel.com> (Sean Hefty's message of
	"Thu, 11 Nov 2004 14:42:39 -0800")
References: <4193EADF.50001@ichips.intel.com>
Message-ID: <52ekj0uhpm.fsf@topspin.com>

    Sean> I'm trying to force errors on QP0/1 to see if my changes can
    Sean> recover from them.  I force the errors by sending with an
    Sean> invalid lkey.  Based on the implementation of mthca, what
    Sean> can be expected?

    Sean> I'm not seeing the QP event handler get invoked.  I do
    Sean> receive a completion error, followed by flushed work
    Sean> requests.  Attempts to modify the QP directly to RTS fail --
    Sean> I was hoping that the QP would enter SQE state.

mthca currently doesn't handle these 'asynchronous' state transitions
(ie transition to error).  It continues to think the QP is in the RTS
state.  Proper handling needs to be implemented.

However should there be a QP event for a send with invalid L_Key?  I
would have thought the failed completion entry would be all the
consumer gets.

 - R.


From mshefty at ichips.intel.com  Thu Nov 11 14:58:05 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 11 Nov 2004 14:58:05 -0800
Subject: [openib-general] Re: QP error handling
In-Reply-To: <52ekj0uhpm.fsf@topspin.com>
References: <4193EADF.50001@ichips.intel.com> <52ekj0uhpm.fsf@topspin.com>
Message-ID: <4193EE7D.6030800@ichips.intel.com>

Roland Dreier wrote:
> mthca currently doesn't handle these 'asynchronous' state transitions
> (ie transition to error).  It continues to think the QP is in the RTS
> state.  Proper handling needs to be implemented.

Ok - thanks for the info.

> However should there be a QP event for a send with invalid L_Key?  I
> would have thought the failed completion entry would be all the
> consumer gets.

I don't think an async event is necessary.  I was working off the failed 
completion entry, but when the modify_qp call failed, I was trying to 
determine if the QP was going into the error state (which would disallow 
the transition) by checking for a callback to the async event handler.

- Sean


From roland at topspin.com  Thu Nov 11 15:01:15 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 15:01:15 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <1100211290.25996.58.camel@duffman> (Tom Duffy's message of
	"Thu, 11 Nov 2004 14:14:50 -0800")
References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman>
	<52y8h8umio.fsf@topspin.com> <1100211290.25996.58.camel@duffman>
Message-ID: <52d5ykuh1g.fsf@topspin.com>

    Tom> Oops.  Spoke too soon.  It seems `ifconfig ib0 down` now hangs.

I think this should fix it (already checked in).

 - R.

Index: infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_multicast.c	(revision 1213)
+++ infiniband/ulp/ipoib/ipoib_multicast.c	(working copy)
@@ -379,6 +379,8 @@
 	if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 		mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
+	mcast->query = NULL;
+
 	down(&mcast_mutex);
 	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
 		if (status == -ETIMEDOUT)


From Tom.Duffy at Sun.COM  Thu Nov 11 15:31:10 2004
From: Tom.Duffy at Sun.COM (Tom Duffy)
Date: Thu, 11 Nov 2004 15:31:10 -0800
Subject: [openib-general] IPoIB w/ IBSRM?
In-Reply-To: <52d5ykuh1g.fsf@topspin.com>
References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman>
	<52y8h8umio.fsf@topspin.com> <1100211290.25996.58.camel@duffman>
	<52d5ykuh1g.fsf@topspin.com>
Message-ID: <1100215870.25996.64.camel@duffman>

On Thu, 2004-11-11 at 15:01 -0800, Roland Dreier wrote:
>     Tom> Oops.  Spoke too soon.  It seems `ifconfig ib0 down` now hangs.
> 
> I think this should fix it (already checked in).

Yuppers.

Thanks,

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/63b0372b/attachment.sig>

From halr at voltaire.com  Thu Nov 11 15:56:34 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 18:56:34 -0500
Subject: [openib-general] Link Width Active
Message-ID: <1100217394.3369.2.camel@localhost.localdomain>

Hi,

Is there a way to display PortInfo components other than PortState ? For
example, LinkWidthActive might be useful (as might some others). I
couldn't find it in /sys/class/infiniband/mthca0/port/1.

Thanks.

-- Hal


From tduffy at sun.com  Thu Nov 11 16:07:04 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 16:07:04 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52bre4w3d5.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com>
Message-ID: <1100218024.25996.73.camel@duffman>

On Thu, 2004-11-11 at 12:13 -0800, Roland Dreier wrote:
> OK, with the patch below all the correct IPv6 groups seem to be
> created and used.  Ping works at least...

With the updated patch (and with Nitin's original patch), when I bring
up ipv6, I am not getting the correct link local address.  I can assign
it a global address and ping just fine, but the lower 64 bits of the
IPv6 address are NULL (except for the set link local 2 (mentioned
earlier)):

ib0.8001  Link encap:UNSPEC  HWaddr 00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00
          inet6 addr: 2222::2/64 Scope:Global
          inet6 addr: fe80::200:0:0:0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:15 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:1512 (1.4 Kb)  TX bytes:1752 (1.7 Kb)

Any ideas?

-tduffy

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/0eb53051/attachment.sig>

From Nitin.Hande at Sun.COM  Thu Nov 11 16:16:20 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 11 Nov 2004 16:16:20 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52u0rwum82.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com>
Message-ID: <419400D4.2060900@Sun.COM>

Roland Dreier wrote:
> By the way, can anyone explain the following to me (an IPv6 rookie):
> 
>     # ping6 -I ib0 fe80::202:c901:78c:e461
>     PING fe80::202:c901:78c:e461(fe80::202:c901:78c:e461) from fe80::202:c901:7fc:c711 ib0: 56 data bytes
>     64 bytes from fe80::202:c901:78c:e461: icmp_seq=1 ttl=64 time=32.2 ms
>     64 bytes from fe80::202:c901:78c:e461: icmp_seq=2 ttl=64 time=14.7 ms
>     64 bytes from fe80::202:c901:78c:e461: icmp_seq=3 ttl=64 time=14.6 ms
> 
>     --- fe80::202:c901:78c:e461 ping statistics ---
>     3 packets transmitted, 3 received, 0% packet loss, time 2001ms
>     rtt min/avg/max/mdev = 14.682/20.557/32.274/8.286 ms
> 
>     # ping6 fe80::202:c901:78c:e461
>     connect: Invalid argument
> 
>     # ssh -6 fe80::202:c901:78c:e461
>     ssh: connect to host fe80::202:c901:78c:e461 port 22: Invalid argument
> 
> ssh works fine if I assign non-autoconfig'ed addresses.
Allright, so looking further more now I can get ssh  working, the sytax
is very peculiar for linux

sins-stinger-04:/etc/ssh # uname -a
Linux sins-stinger-04 2.6.9 #5 SMP Thu Nov 11 12:54:00 PST 2004 x86_64
x86_64 x86_64 GNU/Linux

sins-stinger-04:/etc/ssh # ssh fe80::209:3dff:fe00:4766%eth1
The authenticity of host 'fe80::209:3dff:fe00:4766%eth1
(fe80::209:3dff:fe00:4766%eth1)' can't be established.
RSA key fingerprint is ce:f5:ea:82:2a:42:a2:f9:e0:01:ba:ef:63:3c:cb:2a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'fe80::209:3dff:fe00:4766%eth1' (RSA) to the
list of known hosts.

Password:
Last login: Thu Nov 11 17:15:01 2004 from sr1-umpk-04.sfbay.sun.com
sins-stinger-8:~ # uname -a
Linux sins-stinger-8 2.6.9 #9 SMP Wed Nov 10 09:42:29 PST 2004 x86_64
x86_64 x86_64 GNU/Linux
sins-stinger-8:~ #

Based on some googling I found that for linux, since Link Local
addresses are not routable, you need to provide the scope (by specifying
an outgoing interface) to ssh in linux. This is very different from
Solaris implementation where it still derives the scope of link local
address and thereby its outgoing interface too. Does that sounds okay ?

Thanks
Nitin


> 
> Ethernet behaves the same way so I don't think it's something to do
> with the IPoIB driver, but I would like to understand it better (if
> only for my own edification).
> 
> Thanks,
>   Roland
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Thu Nov 11 16:18:21 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 16:18:21 -0800
Subject: [openib-general] Re: Link Width Active
In-Reply-To: <1100217394.3369.2.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 11 Nov 2004 18:56:34 -0500")
References: <1100217394.3369.2.camel@localhost.localdomain>
Message-ID: <523bzfvs1e.fsf@topspin.com>

    Hal> Hi, Is there a way to display PortInfo components other than
    Hal> PortState ? For example, LinkWidthActive might be useful (as
    Hal> might some others). I couldn't find it in
    Hal> /sys/class/infiniband/mthca0/port/1.

Sure, we just need to add more attributes in core/sysfs.c.
LinkWidthActive would require processing a PortInfo MAD (PortState
comes from ib_port_query()) but it's not too much work to implement.

 - R.


From roland at topspin.com  Thu Nov 11 16:21:59 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 16:21:59 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <1100218024.25996.73.camel@duffman> (Tom Duffy's message of
	"Thu, 11 Nov 2004 16:07:04 -0800")
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman>
Message-ID: <52y8h7udaw.fsf@topspin.com>

    Tom> With the updated patch (and with Nitin's original patch),
    Tom> when I bring up ipv6, I am not getting the correct link local
    Tom> address.  I can assign it a global address and ping just
    Tom> fine, but the lower 64 bits of the IPv6 address are NULL
    Tom> (except for the set link local 2 (mentioned earlier)):

    Tom> Any ideas?

Yup, looks like the device addr for child interfaces isn't being set
correctly; compare:

    # ip addr show ib0
    13: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast qlen 128
        link/[32] 00:04:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:07:8c:e4:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

vs.

    # ip addr show ib0.8001
    15: ib0.8001: <BROADCAST,MULTICAST> mtu 2044 qdisc noop qlen 128
        link/[32] 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:80:01:00:00:00:00:00:00:ff:ff:ff:ff
    
I should have a patch fairly soon.

 - R.


From roland at topspin.com  Thu Nov 11 16:38:57 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 16:38:57 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52y8h7udaw.fsf@topspin.com> (Roland Dreier's message of "Thu,
	11 Nov 2004 16:21:59 -0800")
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman>
	<52y8h7udaw.fsf@topspin.com>
Message-ID: <52u0rvucim.fsf@topspin.com>

I think we just need to copy our address to the child interface.  This
patch seems to fix it for me (already checked in).

(By the way, how does IPv6 handle autoconfig for VLAN interfaces?
With this change you can get duplicate autoconfig'ed addresses,
although they will be in different partitions.  I'm not sure if this
causes any problems...)

 - R.

Index: infiniband/ulp/ipoib/ipoib_vlan.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_vlan.c	(revision 1212)
+++ infiniband/ulp/ipoib/ipoib_vlan.c	(working copy)
@@ -74,6 +74,7 @@
 
 	priv->pkey = pkey;
 
+	memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, IPOIB_HW_ADDR_LEN);
 	priv->dev->broadcast[8] = pkey >> 8;
 	priv->dev->broadcast[9] = pkey & 0xff;
 

From roland at topspin.com  Thu Nov 11 16:46:37 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 16:46:37 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <419400D4.2060900@Sun.COM> (Nitin Hande's message of "Thu, 11
	Nov 2004 16:16:20 -0800")
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com>
	<419400D4.2060900@Sun.COM>
Message-ID: <52pt2juc5u.fsf@topspin.com>

    Nitin> Based on some googling I found that for linux, since Link
    Nitin> Local addresses are not routable, you need to provide the
    Nitin> scope (by specifying an outgoing interface) to ssh in
    Nitin> linux. This is very different from Solaris implementation
    Nitin> where it still derives the scope of link local address and
    Nitin> thereby its outgoing interface too. Does that sounds okay ?

Thanks, that works for me.  Not very intuitive but I guess it makes
sense.  In fact I don't see how Solaris can deduce the interface from
a link local IPv6 address...

 - R.


From tduffy at sun.com  Thu Nov 11 17:40:53 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 11 Nov 2004 17:40:53 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52u0rvucim.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman>
	<52y8h7udaw.fsf@topspin.com>  <52u0rvucim.fsf@topspin.com>
Message-ID: <1100223653.24741.3.camel@duffman>

On Thu, 2004-11-11 at 16:38 -0800, Roland Dreier wrote:
> I think we just need to copy our address to the child interface.  This
> patch seems to fix it for me (already checked in).

Yup, this fixes it.  You rock.

> (By the way, how does IPv6 handle autoconfig for VLAN interfaces?
> With this change you can get duplicate autoconfig'ed addresses,
> although they will be in different partitions.  I'm not sure if this
> causes any problems...)

Would you really bring both interfaces up?  If this is a problem, the
spec should have the pkey be part of the link local address.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041111/4433b15e/attachment.sig>

From mshefty at ichips.intel.com  Thu Nov 11 17:41:22 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 11 Nov 2004 17:41:22 -0800
Subject: [openib-general] [PATCH] [1/2] SQE handling on MAD QPs
Message-ID: <419414C2.4090300@ichips.intel.com>

This patch recovers from send queue errors on QP 0/1.  (It should also "work" in the case 
of fatal errors, but does not try to recover.)  Code was tested by forcing send errors and 
checking that the port could still go to active.

Patch can be applied separately from patch to mthca, but requires other patch to work 
properly.

- Sean

Index: mad.c
===================================================================
--- mad.c	(revision 1209)
+++ mad.c	(working copy)
@@ -90,6 +90,8 @@
  				    struct ib_mad_send_wc *mad_send_wc);
  static void timeout_sends(void *data);
  static int solicited_mad(struct ib_mad *mad);
+static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
+					 enum ib_qp_state cur_state);

  /*
   * Returns a ib_mad_port_private structure or NULL for a device/port.
@@ -591,6 +593,7 @@
  		/* Timeout will be updated after send completes */
  		mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr.
  							ud.timeout_ms);
+		mad_send_wr->retry = 0;
  		/* One reference for each work request to QP + response */
  		mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
  		mad_send_wr->status = IB_WC_SUCCESS;
@@ -1339,6 +1342,70 @@
  	}
  }

+static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_list_head *mad_list;
+	int flags;
+
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	list_for_each_entry(mad_list, &qp_info->send_queue.list, list) {
+		mad_send_wr = container_of(mad_list,
+					   struct ib_mad_send_wr_private,
+					   mad_list);
+		mad_send_wr->retry = 1;
+	}
+	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+}
+
+static void mad_error_handler(struct ib_mad_port_private *port_priv,
+			      struct ib_wc *wc)
+{
+	struct ib_mad_list_head *mad_list;
+	struct ib_mad_qp_info *qp_info;
+	struct ib_mad_send_wr_private *mad_send_wr;
+	int ret;
+
+	/* Determine if failure was a send or receive */
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	qp_info = mad_list->mad_queue->qp_info;
+	if (mad_list->mad_queue == &qp_info->recv_queue) {
+		/*
+		* Receive errors indicate that the QP has entered the error
+		* state - error handling/shutdown code will cleanup.
+		*/
+		return;
+	}
+
+	/*
+	 * Send errors will transition the QP to SQE - move
+	 * QP to RTS and repost flushed work requests.
+	 */
+	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
+				   mad_list);
+	if (wc->status == IB_WC_WR_FLUSH_ERR) {
+		if (mad_send_wr->retry) {
+			/* Repost send. */
+			struct ib_send_wr *bad_send_wr;
+
+			mad_send_wr->retry = 0;
+			ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr,
+					&bad_send_wr);
+			if (ret)
+				ib_mad_send_done_handler(port_priv, wc);
+		} else
+			ib_mad_send_done_handler(port_priv, wc);
+	} else {
+		/* Transition QP to RTS and fail offending send. */
+		ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE);
+		if (ret)
+			printk(KERN_ERR PFX "mad_error_handler - unable to "
+			       "transition QP to RTS : %d\n", ret);
+		ib_mad_send_done_handler(port_priv, wc);
+		mark_sends_for_retry(qp_info);
+	}
+}
+
  /*
   * IB MAD completion callback
   */
@@ -1346,34 +1413,25 @@
  {
  	struct ib_mad_port_private *port_priv;
  	struct ib_wc wc;
-	struct ib_mad_list_head *mad_list;
-	struct ib_mad_qp_info *qp_info;

  	port_priv = (struct ib_mad_port_private*)data;
  	ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
  	
  	while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) {
-		if (wc.status != IB_WC_SUCCESS) {
-			/* Determine if failure was a send or receive */
-			mad_list = (struct ib_mad_list_head *)
-				   (unsigned long)wc.wr_id;
-			qp_info = mad_list->mad_queue->qp_info;
-			if (mad_list->mad_queue == &qp_info->send_queue)
-				wc.opcode = IB_WC_SEND;
-			else
-				wc.opcode = IB_WC_RECV;
-		}
-		switch (wc.opcode) {
-		case IB_WC_SEND:
-			ib_mad_send_done_handler(port_priv, &wc);
-			break;
-		case IB_WC_RECV:
-			ib_mad_recv_done_handler(port_priv, &wc);
-			break;
-		default:
-			BUG_ON(1);
-			break;
-		}
+		if (wc.status == IB_WC_SUCCESS) {
+			switch (wc.opcode) {
+			case IB_WC_SEND:
+				ib_mad_send_done_handler(port_priv, &wc);
+				break;
+			case IB_WC_RECV:
+				ib_mad_recv_done_handler(port_priv, &wc);
+				break;
+			default:
+				BUG_ON(1);
+				break;
+			}
+		} else
+			mad_error_handler(port_priv, &wc);
  	}
  }

@@ -1717,7 +1775,8 @@
  /*
   * Modify QP into Ready-To-Send state
   */
-static inline int ib_mad_change_qp_state_to_rts(struct ib_qp *qp)
+static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
+					 enum ib_qp_state cur_state)
  {
  	int ret;
  	struct ib_qp_attr *attr;
@@ -1729,11 +1788,12 @@
  		       "ib_qp_attr\n");
  		return -ENOMEM;
  	}
-
  	attr->qp_state = IB_QPS_RTS;
-	attr->sq_psn = IB_MAD_SEND_Q_PSN;
-	attr_mask = IB_QP_STATE | IB_QP_SQ_PSN;
-
+	attr_mask = IB_QP_STATE;
+	if (cur_state == IB_QPS_RTR) {
+		attr->sq_psn = IB_MAD_SEND_Q_PSN;
+		attr_mask |= IB_QP_SQ_PSN;
+	}
  	ret = ib_modify_qp(qp, attr, attr_mask);
  	kfree(attr);

@@ -1793,7 +1853,8 @@
  			goto error;
  		}

-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp);
+		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp,
+						    IB_QPS_RTR);
  		if (ret) {
  			printk(KERN_ERR PFX "Couldn't change QP%d state to "
  			       "RTS\n", i);
@@ -1852,6 +1913,15 @@
  	}
  }

+static void qp_event_handler(struct ib_event *event, void *qp_context)
+{
+	struct ib_mad_qp_info	*qp_info = qp_context;
+
+	/* It's worse than that! He's dead, Jim! */
+	printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n",
+		event->event, qp_info->qp->qp_num);
+}
+
  static void init_mad_queue(struct ib_mad_qp_info *qp_info,
  			   struct ib_mad_queue *mad_queue)
  {
@@ -1884,6 +1954,8 @@
  	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
  	qp_init_attr.qp_type = qp_type;
  	qp_init_attr.port_num = port_priv->port_num;
+	qp_init_attr.qp_context = qp_info;
+	qp_init_attr.event_handler = qp_event_handler;
  	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
  	if (IS_ERR(qp_info->qp)) {
  		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
Index: mad_priv.h
===================================================================
--- mad_priv.h	(revision 1209)
+++ mad_priv.h	(working copy)
@@ -127,6 +127,7 @@
  	u64 wr_id;			/* client WR ID */
  	u64 tid;
  	unsigned long timeout;
+	int retry;
  	int refcount;
  	enum ib_wc_status status;
  };


From mshefty at ichips.intel.com  Thu Nov 11 17:45:22 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 11 Nov 2004 17:45:22 -0800
Subject: [openib-general] [PATCH] [2/2] change QP state to SQE 
Message-ID: <419415B2.3060907@ichips.intel.com>

This should transition the QP state to SQE when encountering a
send error on the CQ.  There may be a better way of doing this;
I didn't spend a lot of time studying the code.

- Sean


Index: mthca_dev.h
===================================================================
--- mthca_dev.h	(revision 1209)
+++ mthca_dev.h	(working copy)
@@ -311,6 +311,7 @@

  void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
  		    enum ib_event_type event_type);
+void mthca_qp_send_error(struct mthca_qp *qp);
  int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
  int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
  		    struct ib_send_wr **bad_wr);
Index: mthca_cq.c
===================================================================
--- mthca_cq.c	(revision 1209)
+++ mthca_cq.c	(working copy)
@@ -330,6 +330,9 @@
  		break;
  	}

+	if (cqe->syndrome != SYNDROME_WR_FLUSH_ERR && is_send)
+		mthca_qp_send_error(qp);
+
  	err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe);
  	if (err)
  		return err;
Index: mthca_qp.c
===================================================================
--- mthca_qp.c	(revision 1209)
+++ mthca_qp.c	(working copy)
@@ -288,6 +288,12 @@
  		wake_up(&qp->wait);
  }

+void mthca_qp_send_error(struct mthca_qp *qp)
+{
+	if (qp->state == IB_QPS_RTS)
+		qp->state = IB_QPS_SQE;
+}
+
  static int to_mthca_state(enum ib_qp_state ib_state)
  {
  	switch (ib_state) {


From roland at topspin.com  Thu Nov 11 18:09:58 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 18:09:58 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <1100223653.24741.3.camel@duffman> (Tom Duffy's message of
	"Thu, 11 Nov 2004 17:40:53 -0800")
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman>
	<52y8h7udaw.fsf@topspin.com> <52u0rvucim.fsf@topspin.com>
	<1100223653.24741.3.camel@duffman>
Message-ID: <52lld7u8ax.fsf@topspin.com>

    Tom> Would you really bring both interfaces up?  If this is a
    Tom> problem, the spec should have the pkey be part of the link
    Tom> local address.

It actually seems to work fine to bring up multiple IPv6 interfaces
that end up with the same link local address (like ib0 and ib0.8001).
The fact that Linux forces you to specify an interface when using a
link local address comes to the rescue.  And I can't think of any
issues, since different partitions are really pretty disjoint.

 - R.


From roland at topspin.com  Thu Nov 11 18:11:08 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 18:11:08 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <419415B2.3060907@ichips.intel.com> (Sean Hefty's message of
	"Thu, 11 Nov 2004 17:45:22 -0800")
References: <419415B2.3060907@ichips.intel.com>
Message-ID: <52hdnvu88z.fsf@topspin.com>

    Sean> This should transition the QP state to SQE when encountering
    Sean> a send error on the CQ.  There may be a better way of doing
    Sean> this; I didn't spend a lot of time studying the code.

Thanks for the patch... let me look at how I want to do this (and
probably handle transitions to ERR while I'm at it).

 - R.


From halr at voltaire.com  Thu Nov 11 20:00:30 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 23:00:30 -0500
Subject: [openib-general] PD dealloc and AH busy problems remain
Message-ID: <1100232030.3369.50.camel@localhost.localdomain>

Hi,

Don't know what the proper expectation is (whether the below meant that
the PD dealloc problem and the AH busy problem should be gone),  

r1211 | roland | 2004-11-11 15:36:46 -0500 (Thu, 11 Nov 2004) | 1 line
Move final reap of AHs to a more correct location

but they are not (just in case you thought they should). The PD alloc
problem is now intermittent on IPoIB module removal (an improvement). AH
busy on mthca module removal is still regular.

If the message in the log didn't mean this, ignore this.

-- Hal


From roland at topspin.com  Thu Nov 11 20:41:23 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 20:41:23 -0800
Subject: [openib-general] Re: PD dealloc and AH busy problems remain
In-Reply-To: <1100232030.3369.50.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 11 Nov 2004 23:00:30 -0500")
References: <1100232030.3369.50.camel@localhost.localdomain>
Message-ID: <52d5yju1ak.fsf@topspin.com>

    Hal> but they are not (just in case you thought they should). The
    Hal> PD alloc problem is now intermittent on IPoIB module removal
    Hal> (an improvement). AH busy on mthca module removal is still
    Hal> regular.

Thanks for pointing this out, it was the kick in the rear I needed to
really investigate this.  It turns out there were two bugs (I think).
In any case my logs are clean with these changes.

 - R.

Index: infiniband/core/sa_query.c
===================================================================
--- infiniband/core/sa_query.c	(revision 1212)
+++ infiniband/core/sa_query.c	(working copy)
@@ -632,7 +632,7 @@
 }
 EXPORT_SYMBOL(ib_sa_mcmember_rec_query);
 
-static void send_handler(struct ib_mad_agent *mad_agent,
+static void send_handler(struct ib_mad_agent *agent,
 			 struct ib_mad_send_wc *mad_send_wc)
 {
 	struct ib_sa_query *query;
@@ -660,6 +660,12 @@
 		break;
 	}
 
+	pci_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(query, mapping),
+			 sizeof (struct ib_sa_mad),
+			 PCI_DMA_TODEVICE);
+	kref_put(&query->sm_ah->ref, free_sm_ah);
+
 	query->release(query);
 
 	spin_lock_irqsave(&idr_lock, flags);


Index: infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_verbs.c	(revision 1217)
+++ infiniband/ulp/ipoib/ipoib_verbs.c	(working copy)
@@ -210,7 +210,7 @@
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	if (priv->qp != NULL) {
+	if (priv->qp) {
 		if (ib_destroy_qp(priv->qp))
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
 
Index: infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_main.c	(revision 1212)
+++ infiniband/ulp/ipoib/ipoib_main.c	(working copy)
@@ -306,10 +306,6 @@
 	if (status)
 		goto err;
 
-	ah = kmalloc(sizeof *ah, GFP_KERNEL);
-	if (!ah)
-		goto err;
-
 	{
 		struct ib_ah_attr av = {
 			.dlid 	       = be16_to_cpu(pathrec->dlid),
@@ -320,13 +316,11 @@
 			.port_num      = priv->port
 		};
 
-		ah->ah = ib_create_ah(priv->pd, &av);
+		ah = ipoib_create_ah(skb->dev, priv->pd, &av);
 	}
 
-	if (IS_ERR(ah->ah)) {
-		kfree(ah);
+	if (!ah)
 		goto err;
-	}
 
 	*(struct ipoib_ah **) skb->cb = ah;
 
@@ -459,13 +453,17 @@
 				return 0;
 			}
 
-			if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP)
+			if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) {
 				ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
 					   IPOIB_GID_FMT "\n",
 					   skb->dst ? "neigh" : "dst",
 					   be16_to_cpup((u16 *) skb->data),
 					   be32_to_cpup((u32 *) phdr->hwaddr),
 					   IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4)));
+				dev_kfree_skb_any(skb);
+				++priv->stats.tx_dropped;
+				return 0;
+			}
 
 			/* put the pseudoheader back on */			  
 			skb_push(skb, sizeof *phdr);
Index: infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_ib.c	(revision 1216)
+++ infiniband/ulp/ipoib/ipoib_ib.c	(working copy)
@@ -48,7 +48,8 @@
 	if (IS_ERR(ah->ah)) {
 		kfree(ah);
 		ah = NULL;
-	}
+	} else
+		ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah);
 
 	return ah;
 }
@@ -61,7 +62,12 @@
 	unsigned long flags;
 
 	spin_lock_irqsave(&priv->lock, flags);
-	list_add_tail(&ah->list, &priv->dead_ahs);
+	if (ah->last_send <= priv->tx_tail) {
+		ipoib_dbg(priv, "Freeing ah %p\n", ah->ah);
+		ib_destroy_ah(ah->ah);
+		kfree(ah);
+	} else
+		list_add_tail(&ah->list, &priv->dead_ahs);
 	spin_unlock_irqrestore(&priv->lock, flags);
 }
 

From halr at voltaire.com  Thu Nov 11 20:54:34 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 11 Nov 2004 23:54:34 -0500
Subject: [openib-general] IPv6 MGID formation question
Message-ID: <1100235273.3369.65.camel@localhost.localdomain>

It looks to me like the MGIDs for IPv6 are only getting the low 32 bits
of the address rather than 80 bits.

    |   8    |  4 |  4 |     16 bits     | 16 bits |      80 bits      |
    +------ -+----+----+-----------------+---------+-------------------+
    |11111111|0001|scop|<IPoIB signature>|< P_Key >|      group ID     |
    +--------+----+----+-----------------+---------+-------------------+

Local interface address:
inet6 addr: fe80::208:f104:396:71/64 Scope:Link

MGID is displayed as 
MGID ff12:601b:ffff:0:0:1:ff96:71
(Haven't looked on the IB wire yet).

IPv6 comes up in IPv6 over IPv4 tunneling mode but I don't think this should
affect the MGID used.

I have the latest bits (and patches) installed.

-- Hal


From roland at topspin.com  Thu Nov 11 21:18:10 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 11 Nov 2004 21:18:10 -0800
Subject: [openib-general] Re: IPv6 MGID formation question
In-Reply-To: <1100235273.3369.65.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 11 Nov 2004 23:54:34 -0500")
References: <1100235273.3369.65.camel@localhost.localdomain>
Message-ID: <528y97tzl9.fsf@topspin.com>

    Hal> It looks to me like the MGIDs for IPv6 are only getting the
    Hal> low 32 bits of the address rather than 80 bits.

I think your setup is fine:

    Hal> Local interface address: inet6 addr: fe80::208:f104:396:71/64

The IPv6 solicited-node multicast address corresponding to this
address is ff02:0:0:0:0:1:ff96:71.  The ND code will join this group
when the interface is brought up.

    Hal> MGID is displayed as MGID ff12:601b:ffff:0:0:1:ff96:71

This is the correct MGID for that solicited-node address.

(If you're actually sending to other IPv6 multicast addresses and
getting the wrong MGID then something is screwy.  It's hard to think
of what could be wrong with out ipv6_ib_mc_map() function though...)

 - R.


From mlleinin at hpcn.ca.sandia.gov  Thu Nov 11 21:20:28 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 11 Nov 2004 21:20:28 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <52ekj0z93b.fsf@topspin.com>
References: <1100181660.14334.548.camel@trinity> <52ekj0z93b.fsf@topspin.com>
Message-ID: <1100236828.3722.560.camel@trinity>

On Thu, 2004-11-11 at 07:41 -0800, Roland Dreier wrote:
>     Matt>   As some of you may have noticed, we migrated over to the
>     Matt> new OpenIB web pages yesterday.  The FAQ and a few other
>     Matt> items are still a work in progress.  Let me know if there
>     Matt> are any errors or if folks have other feedback/suggestions.
> 
> Looks great.  One suggestions: under news, it's probably worth linking
> to or mentioning the PathForward funding announcement.
> 
  I didn't have time to add this before another day of SC04 started.
The PathForward announcements are now links under News.

  - Matt


From mlleinin at hpcn.ca.sandia.gov  Thu Nov 11 21:33:19 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 11 Nov 2004 21:33:19 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <52fz3gxoha.fsf@topspin.com>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
Message-ID: <1100237599.14334.564.camel@trinity>

On Thu, 2004-11-11 at 09:52 -0800, Roland Dreier wrote:
>     Matt> The FAQ and a few other items are still a work in progress.
> 
> A couple of suggestions for the FAQ:
> 
> in "How do I submit source code patches?"
> 
>     I suggest adding something like "Please make sure that patches are
>     licensed under the same terms as the original code (dual GPL/BSD
>     for most of the OpenIB stack)."
> 
> in "What version of the Linux kernel do you support?"
> 
>     I suggest changing the answer to something like OpenIB
>     supports the latest 2.6 kernel (currently 2.6.9).
> 
> in "What are all these upper layer protocols like IPoIB, DAPL, MPI, SDP,
> SRP, and others?"
> 
>     add a link to the IETF ipoib WG at <http://ietf.org/html.charters/ipoib-charter.html>
> 
  Done.  Thanks.

	- Matt


From mlleinin at hpcn.ca.sandia.gov  Thu Nov 11 21:33:48 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 11 Nov 2004 21:33:48 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100196877.3283.120.camel@localhost.localdomain>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
	<1100196877.3283.120.camel@localhost.localdomain>
Message-ID: <1100237628.14336.566.camel@trinity>

On Thu, 2004-11-11 at 13:14 -0500, Hal Rosenstock wrote:
> On Thu, 2004-11-11 at 12:52, Roland Dreier wrote:
> > in "What version of the Linux kernel do you support?"
> > 
> >     I suggest changing the answer to something like OpenIB
> >     supports the latest 2.6 kernel (currently 2.6.9).
> 
> Not indicating the current version (2.6.9) makes for less frequent web
> page updates. Is just saying latest 2.6 kernel sufficient ?
> 
  I don't mind keeping it updated.

	- Matt


From mlleinin at hpcn.ca.sandia.gov  Thu Nov 11 22:03:07 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 11 Nov 2004 22:03:07 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <20041111184715.GE32218@cup.hp.com>
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
	<1100196877.3283.120.camel@localhost.localdomain>
	<20041111184715.GE32218@cup.hp.com>
Message-ID: <1100239387.3722.581.camel@trinity>

On Thu, 2004-11-11 at 10:47 -0800, Grant Grundler wrote:
> On Thu, Nov 11, 2004 at 01:14:37PM -0500, Hal Rosenstock wrote:
> > Not indicating the current version (2.6.9) makes for less frequent web
> > page updates. Is just saying latest 2.6 kernel sufficient ?
> 
> Probably not since SLES9-ia64 is based on 2.6.5 and it won't work as-is.
> Making ithe FAQ a wiki (tduffy) is a good idea.
> 
  FAQ wiki does sound good.  I'll look into it.  

	- Matt


From itoumsn at nttdata.co.jp  Thu Nov 11 22:14:21 2004
From: itoumsn at nttdata.co.jp (Masanori ITOH)
Date: Fri, 12 Nov 2004 15:14:21 +0900 (JST)
Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA
Message-ID: <20041112.151421.120503395.itoumsn@nttdata.co.jp>


Hello folks,

As I mentioned fomerly on this list, I have a working u/kDAPL on top of
the gen1 stack and I've finally finished all internal procedures
to make it public.
# Actually, it took me about one month and a half. Sigh... :(

I would like to put that into the OpenIB contributors area
(Somewhere like 'https://openib.org/svn/trunk/contrib/nttdata/'.),
and could anyone tell me how I can do that?

Thanks in advance,
Masanori

---
Masanori ITOH  Open Source Software Development Center, NTT DATA CORPORATION
               e-mail: itoumsn at nttdata.co.jp
               phone : +81-3-3523-8122 (ext. 172-7199)


From halr at voltaire.com  Fri Nov 12 06:55:42 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 09:55:42 -0500
Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs
In-Reply-To: <419414C2.4090300@ichips.intel.com>
References: <419414C2.4090300@ichips.intel.com>
Message-ID: <1100271340.6671.1.camel@hpc-1>

On Thu, 2004-11-11 at 20:41, Sean Hefty wrote:
> This patch recovers from send queue errors on QP 0/1.  (It should also "work" in the case 
> of fatal errors, but does not try to recover.)  Code was tested by forcing send errors and 
> checking that the port could still go to active.
> 
> Patch can be applied separately from patch to mthca, but requires other patch to work 
> properly.

I am having difficulty applying this patch. For some reason, all the
changes are rejected. Could this be a patch version issue ? My version
of patch is 2.5.4. Should I upgrade and try ?

-- Hal


From Nitin.Hande at Sun.COM  Fri Nov 12 07:44:04 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Fri, 12 Nov 2004 07:44:04 -0800
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52lld7u8ax.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman>
	<52y8h7udaw.fsf@topspin.com> <52u0rvucim.fsf@topspin.com>
	<1100223653.24741.3.camel@duffman> <52lld7u8ax.fsf@topspin.com>
Message-ID: <4194DA44.5060003@Sun.COM>

Roland Dreier wrote:
>     Tom> Would you really bring both interfaces up?  If this is a
>     Tom> problem, the spec should have the pkey be part of the link
>     Tom> local address.
> 
> It actually seems to work fine to bring up multiple IPv6 interfaces
> that end up with the same link local address (like ib0 and ib0.8001).
> The fact that Linux forces you to specify an interface when using a
> link local address comes to the rescue.  And I can't think of any
> issues, since different partitions are really pretty disjoint.
Btw on vlan, I know that vlan-id's are a part of link local addresses.
That way all the link local addresses are unique and as a result during
DAD process they join different solicited multicast groups.

Thanks
Nitin

> 
>  - R.
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Fri Nov 12 08:05:32 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 12 Nov 2004 08:05:32 -0800
Subject: [openib-general] New OpenIB webpages
In-Reply-To: <1100239387.3722.581.camel@trinity> (Matt Leininger's message
	of "Thu, 11 Nov 2004 22:03:07 -0800")
References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com>
	<1100196877.3283.120.camel@localhost.localdomain>
	<20041111184715.GE32218@cup.hp.com>
	<1100239387.3722.581.camel@trinity>
Message-ID: <52oei3rr1v.fsf@topspin.com>

    Matt> FAQ wiki does sound good.  I'll look into it.

In general having a wiki would be great (there have been a few times
in the past where I would have liked to have been able to create a
quick wiki page).

 - R.


From halr at voltaire.com  Fri Nov 12 08:06:14 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 11:06:14 -0500
Subject: [openib-general] Re: PD dealloc and AH busy problems remain
In-Reply-To: <52d5yju1ak.fsf@topspin.com>
References: <1100232030.3369.50.camel@localhost.localdomain>
	<52d5yju1ak.fsf@topspin.com>
Message-ID: <1100275574.3369.419.camel@localhost.localdomain>

On Thu, 2004-11-11 at 23:41, Roland Dreier wrote:
> Thanks for pointing this out, it was the kick in the rear I needed to
> really investigate this.  It turns out there were two bugs (I think).
> In any case my logs are clean with these changes.

So are mine now :-) I'll keep an eye on this reoccuring but otherwise
assume this is fixed. Thanks.

-- Hal


From halr at voltaire.com  Fri Nov 12 08:16:15 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 11:16:15 -0500
Subject: [openib-general] [PATCH] Enable inet6 on ib interface
In-Reply-To: <52pt2juc5u.fsf@topspin.com>
References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com>
	<52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com>
	<419400D4.2060900@Sun.COM> <52pt2juc5u.fsf@topspin.com>
Message-ID: <1100276175.3369.440.camel@localhost.localdomain>

On Thu, 2004-11-11 at 19:46, Roland Dreier wrote:
> In fact I don't see how Solaris can deduce the interface from
> a link local IPv6 address...

I don't see how this would work either (at least for Linux):

Here's my config:

eth1
inet6 addr: fe80::230:48ff:fe27:212f/64 Scope:Link

ib0
inet6 addr: fe80::208:f104:396:71/64 Scope:Link

ip -6 route show
fe80::/64 dev eth1  metric 256  mtu 1500 advmss 1440
fe80::/64 dev ib0  metric 256  mtu 2044 advmss 1984
ff00::/8 dev eth1  metric 256  mtu 1500 advmss 1440
ff00::/8 dev ib0  metric 256  mtu 2044 advmss 1984

So some help looks like it is needed to select the outgoing local
interface. It's not just a routing calculating on the destination
address as it appears to be in Solaris.

-- Hal


From mshefty at ichips.intel.com  Fri Nov 12 09:13:35 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 09:13:35 -0800
Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs
In-Reply-To: <1100271340.6671.1.camel@hpc-1>
References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1>
Message-ID: <4194EF3F.80608@ichips.intel.com>

Hal Rosenstock wrote:
> On Thu, 2004-11-11 at 20:41, Sean Hefty wrote:
> 
>>This patch recovers from send queue errors on QP 0/1.  (It should also "work" in the case 
>>of fatal errors, but does not try to recover.)  Code was tested by forcing send errors and 
>>checking that the port could still go to active.
>>
>>Patch can be applied separately from patch to mthca, but requires other patch to work 
>>properly.
> 
> 
> I am having difficulty applying this patch. For some reason, all the
> changes are rejected. Could this be a patch version issue ? My version
> of patch is 2.5.4. Should I upgrade and try ?

Not sure what the issue is.  Let me make sure that I've pulled the latest code and 
resubmit the patch.

- Sean


From mshefty at ichips.intel.com  Fri Nov 12 09:15:56 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 09:15:56 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <52hdnvu88z.fsf@topspin.com>
References: <419415B2.3060907@ichips.intel.com> <52hdnvu88z.fsf@topspin.com>
Message-ID: <4194EFCC.4070802@ichips.intel.com>

Roland Dreier wrote:

>     Sean> This should transition the QP state to SQE when encountering
>     Sean> a send error on the CQ.  There may be a better way of doing
>     Sean> this; I didn't spend a lot of time studying the code.
> 
> Thanks for the patch... let me look at how I want to do this (and
> probably handle transitions to ERR while I'm at it).

That's fine.  This was just the easiest change that I could find in 
order to test my mad changes.

- Sean


From halr at voltaire.com  Fri Nov 12 09:18:32 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 12:18:32 -0500
Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs
In-Reply-To: <4194EF3F.80608@ichips.intel.com>
References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1>
	<4194EF3F.80608@ichips.intel.com>
Message-ID: <1100279912.3369.507.camel@localhost.localdomain>

On Fri, 2004-11-12 at 12:13, Sean Hefty wrote:
> Not sure what the issue is.  Let me make sure that I've pulled the latest code and 
> resubmit the patch.

It looks right to me. Does it work for you ? Can you send a normal
rather than unified diff ?

-- Hal


From mshefty at ichips.intel.com  Fri Nov 12 09:21:50 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 09:21:50 -0800
Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI	mapping
	and gather address	and length
In-Reply-To: <1100107204.2836.36.camel@hpc-1>
References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com>
	<527jou68xy.fsf@topspin.com>
	<1100033742.2170.11.camel@localhost.localdomain>
	<52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1>
	<52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com>
	<1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com>
	<52wtwt1vxx.fsf@topspin.com> <1100107204.2836.36.camel@hpc-1>
Message-ID: <4194F12E.9040205@ichips.intel.com>

Hal Rosenstock wrote:

> On Wed, 2004-11-10 at 11:59, Roland Dreier wrote:
> 
>>    Sean> What exactly does it mean then when process_mad returns
>>    Sean> success?  Do any of the return bits from process_mad
>>    Sean> indicate that the MAD was for the HCA driver?
>>
>>SUCCESS means that process_mad didn't encounter any errors.  If REPLY
>>or CONSUMED is set then process_mad actually handled the packet.
> 
> 
> I would assume that REPLY and CONSUMED are also mutually exclusive.

I believe that's the case, but maybe it would make more sense if they 
weren't, and let CONSUMED indicate that MAD was for the HCA driver.

 From an API perspective, I think we only need to know if the HCA driver 
intercepted the MAD, and if so, was a reply generated.

- Sean


From roland at topspin.com  Fri Nov 12 09:41:55 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 12 Nov 2004 09:41:55 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <419415B2.3060907@ichips.intel.com> (Sean Hefty's message of
	"Thu, 11 Nov 2004 17:45:22 -0800")
References: <419415B2.3060907@ichips.intel.com>
Message-ID: <52bre3rml8.fsf@topspin.com>

I thought about this a little, and it seems that having the CQ poll
operation update the QP state is not the right solution.  It seems it
would be better to add support for the "Current QP state" modifier for
the modify QP operation and expect the consumer to use that to
indicate that the QP is in SQE state.

 - R.


From mshefty at ichips.intel.com  Fri Nov 12 09:52:06 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 09:52:06 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <52bre3rml8.fsf@topspin.com>
References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com>
Message-ID: <4194F846.5030703@ichips.intel.com>

Roland Dreier wrote:

> I thought about this a little, and it seems that having the CQ poll
> operation update the QP state is not the right solution.  It seems it
> would be better to add support for the "Current QP state" modifier for
> the modify QP operation and expect the consumer to use that to
> indicate that the QP is in SQE state.

That would work fine, and be only a minor update to the MAD code.  Will 
you be generated a patch for mthca?

- Sean


From mshefty at ichips.intel.com  Fri Nov 12 09:54:51 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 09:54:51 -0800
Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs
In-Reply-To: <1100279912.3369.507.camel@localhost.localdomain>
References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1>
	<4194EF3F.80608@ichips.intel.com>
	<1100279912.3369.507.camel@localhost.localdomain>
Message-ID: <20041112095451.206ce08c.mshefty@ichips.intel.com>

On Fri, 12 Nov 2004 12:18:32 -0500
Hal Rosenstock <halr at voltaire.com> wrote:

> On Fri, 2004-11-12 at 12:13, Sean Hefty wrote:
> > Not sure what the issue is.  Let me make sure that I've pulled the latest code and 
> > resubmit the patch.
> 
> It looks right to me. Does it work for you ? Can you send a normal
> rather than unified diff ?

Can you try this version?  I'll also revert back to the original code and see if
I can apply the patch.

- Sean


Index: include/ib_mad.h
===================================================================
--- include/ib_mad.h	(revision 1221)
+++ include/ib_mad.h	(working copy)
@@ -250,6 +250,8 @@
  * @mad_agent - Specifies the associated registration to post the send to.
  * @send_wr - Specifies the information needed to send the MAD(s).
  * @bad_send_wr - Specifies the MAD on which an error was encountered.
+ *
+ * Sent MADs are not guaranteed to complete in the order that they were posted.
  */
 int ib_post_send_mad(struct ib_mad_agent *mad_agent,
 		     struct ib_send_wr *send_wr,
Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1221)
+++ core/mad.c	(working copy)
@@ -90,6 +90,8 @@
 				    struct ib_mad_send_wc *mad_send_wc);
 static void timeout_sends(void *data);
 static int solicited_mad(struct ib_mad *mad);
+static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
+					 enum ib_qp_state cur_state);
 
 /*
  * Returns a ib_mad_port_private structure or NULL for a device/port.
@@ -591,6 +593,7 @@
 		/* Timeout will be updated after send completes */
 		mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr.
 							ud.timeout_ms);
+		mad_send_wr->retry = 0;
 		/* One reference for each work request to QP + response */
 		mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
 		mad_send_wr->status = IB_WC_SUCCESS;
@@ -1339,6 +1342,70 @@
 	}
 }
 
+static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_list_head *mad_list;
+	int flags;
+
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	list_for_each_entry(mad_list, &qp_info->send_queue.list, list) {
+		mad_send_wr = container_of(mad_list,
+					   struct ib_mad_send_wr_private,
+					   mad_list);
+		mad_send_wr->retry = 1;
+	}
+	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+}
+
+static void mad_error_handler(struct ib_mad_port_private *port_priv,
+			      struct ib_wc *wc)
+{
+	struct ib_mad_list_head *mad_list;
+	struct ib_mad_qp_info *qp_info;
+	struct ib_mad_send_wr_private *mad_send_wr;
+	int ret;
+
+	/* Determine if failure was a send or receive */
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	qp_info = mad_list->mad_queue->qp_info;
+	if (mad_list->mad_queue == &qp_info->recv_queue) {
+		/*
+		* Receive errors indicate that the QP has entered the error 
+		* state - error handling/shutdown code will cleanup.
+		*/
+		return;
+	}
+
+	/*
+	 * Send errors will transition the QP to SQE - move
+	 * QP to RTS and repost flushed work requests.
+	 */
+	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
+				   mad_list);
+	if (wc->status == IB_WC_WR_FLUSH_ERR) {
+		if (mad_send_wr->retry) {
+			/* Repost send. */
+			struct ib_send_wr *bad_send_wr;
+
+			mad_send_wr->retry = 0;
+			ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr,
+					&bad_send_wr);
+			if (ret)
+				ib_mad_send_done_handler(port_priv, wc);
+		} else
+			ib_mad_send_done_handler(port_priv, wc);
+	} else {
+		/* Transition QP to RTS and fail offending send. */
+		ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE);
+		if (ret)
+			printk(KERN_ERR PFX "mad_error_handler - unable to "
+			       "transition QP to RTS : %d\n", ret);
+		ib_mad_send_done_handler(port_priv, wc);
+		mark_sends_for_retry(qp_info);
+	}
+}
+
 /*
  * IB MAD completion callback
  */
@@ -1346,34 +1413,25 @@
 {
 	struct ib_mad_port_private *port_priv;
 	struct ib_wc wc;
-	struct ib_mad_list_head *mad_list;
-	struct ib_mad_qp_info *qp_info;
 
 	port_priv = (struct ib_mad_port_private*)data;
 	ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	
 	while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) {
-		if (wc.status != IB_WC_SUCCESS) {
-			/* Determine if failure was a send or receive */
-			mad_list = (struct ib_mad_list_head *)
-				   (unsigned long)wc.wr_id;
-			qp_info = mad_list->mad_queue->qp_info;
-			if (mad_list->mad_queue == &qp_info->send_queue)
-				wc.opcode = IB_WC_SEND;
-			else
-				wc.opcode = IB_WC_RECV;
-		}
-		switch (wc.opcode) {
-		case IB_WC_SEND:
-			ib_mad_send_done_handler(port_priv, &wc);
-			break;
-		case IB_WC_RECV:
-			ib_mad_recv_done_handler(port_priv, &wc);
-			break;
-		default:
-			BUG_ON(1);
-			break;
-		}
+		if (wc.status == IB_WC_SUCCESS) {
+			switch (wc.opcode) {
+			case IB_WC_SEND:
+				ib_mad_send_done_handler(port_priv, &wc);
+				break;
+			case IB_WC_RECV:
+				ib_mad_recv_done_handler(port_priv, &wc);
+				break;
+			default:
+				BUG_ON(1);
+				break;
+			}
+		} else
+			mad_error_handler(port_priv, &wc);
 	}
 }
 
@@ -1717,7 +1775,8 @@
 /*
  * Modify QP into Ready-To-Send state
  */
-static inline int ib_mad_change_qp_state_to_rts(struct ib_qp *qp)
+static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
+					 enum ib_qp_state cur_state)
 {
 	int ret;
 	struct ib_qp_attr *attr;
@@ -1729,11 +1788,12 @@
 		       "ib_qp_attr\n");
 		return -ENOMEM;
 	}
-
 	attr->qp_state = IB_QPS_RTS;
-	attr->sq_psn = IB_MAD_SEND_Q_PSN;
-	attr_mask = IB_QP_STATE | IB_QP_SQ_PSN;
-
+	attr_mask = IB_QP_STATE;
+	if (cur_state == IB_QPS_RTR) {
+		attr->sq_psn = IB_MAD_SEND_Q_PSN;
+		attr_mask |= IB_QP_SQ_PSN;
+	}
 	ret = ib_modify_qp(qp, attr, attr_mask);
 	kfree(attr);
 
@@ -1793,7 +1853,8 @@
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp);
+		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp,
+						    IB_QPS_RTR);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTS\n", i);
@@ -1852,6 +1913,15 @@
 	}
 }
 
+static void qp_event_handler(struct ib_event *event, void *qp_context)
+{
+	struct ib_mad_qp_info	*qp_info = qp_context;
+
+	/* It's worse than that! He's dead, Jim! */
+	printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n",
+		event->event, qp_info->qp->qp_num);
+}
+
 static void init_mad_queue(struct ib_mad_qp_info *qp_info,
 			   struct ib_mad_queue *mad_queue)
 {
@@ -1884,6 +1954,8 @@
 	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
 	qp_init_attr.qp_type = qp_type;
 	qp_init_attr.port_num = port_priv->port_num;
+	qp_init_attr.qp_context = qp_info;
+	qp_init_attr.event_handler = qp_event_handler;
 	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
 	if (IS_ERR(qp_info->qp)) {
 		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
Index: core/mad_priv.h
===================================================================
--- core/mad_priv.h	(revision 1221)
+++ core/mad_priv.h	(working copy)
@@ -127,6 +127,7 @@
 	u64 wr_id;			/* client WR ID */
 	u64 tid;
 	unsigned long timeout;
+	int retry;
 	int refcount;
 	enum ib_wc_status status;
 };


From halr at voltaire.com  Fri Nov 12 10:04:24 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 13:04:24 -0500
Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs
In-Reply-To: <20041112095451.206ce08c.mshefty@ichips.intel.com>
References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1>
	<4194EF3F.80608@ichips.intel.com>
	<1100279912.3369.507.camel@localhost.localdomain>
	<20041112095451.206ce08c.mshefty@ichips.intel.com>
Message-ID: <1100282664.3369.556.camel@localhost.localdomain>

On Fri, 2004-11-12 at 12:54, Sean Hefty wrote:
> On Fri, 12 Nov 2004 12:18:32 -0500
> Can you try this version?  I'll also revert back to the original code and see if
> I can apply the patch.

Don't bother (if you haven't already). This patch worked.

-- Hal


From halr at voltaire.com  Fri Nov 12 10:19:44 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 13:19:44 -0500
Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs
In-Reply-To: <20041112095451.206ce08c.mshefty@ichips.intel.com>
References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1>
	<4194EF3F.80608@ichips.intel.com>
	<1100279912.3369.507.camel@localhost.localdomain>
	<20041112095451.206ce08c.mshefty@ichips.intel.com>
Message-ID: <1100283584.3369.573.camel@localhost.localdomain>

On Fri, 2004-11-12 at 12:54, Sean Hefty wrote:
> Can you try this version?

Thanks. Applied.

-- Hal


From robert.j.woodruff at intel.com  Fri Nov 12 11:44:23 2004
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 12 Nov 2004 11:44:23 -0800
Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C2D8D8@orsmsx408>

 Hi Masanori,

Matt Leninger from Sandia controls who has access to the svn
tree. You should probably contact him for providing contributions.

cheers

woody

-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Masanori ITOH
Sent: Thursday, November 11, 2004 10:14 PM
To: openib-general at openib.org
Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA


Hello folks,

As I mentioned fomerly on this list, I have a working u/kDAPL on top of
the gen1 stack and I've finally finished all internal procedures
to make it public.
# Actually, it took me about one month and a half. Sigh... :(

I would like to put that into the OpenIB contributors area
(Somewhere like 'https://openib.org/svn/trunk/contrib/nttdata/'.),
and could anyone tell me how I can do that?

Thanks in advance,
Masanori

---
Masanori ITOH  Open Source Software Development Center, NTT DATA
CORPORATION
               e-mail: itoumsn at nttdata.co.jp
               phone : +81-3-3523-8122 (ext. 172-7199)
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Fri Nov 12 11:46:50 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 12 Nov 2004 11:46:50 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <4194F846.5030703@ichips.intel.com> (Sean Hefty's message of
	"Fri, 12 Nov 2004 09:52:06 -0800")
References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com>
	<4194F846.5030703@ichips.intel.com>
Message-ID: <52vfcargt1.fsf@topspin.com>

    Sean> That would work fine, and be only a minor update to the MAD
    Sean> code.  Will you be generated a patch for mthca?

Yes, eventually.  (ib_verbs.h will also need an update to add the
field to ib_qp_attr)

 - R.


From mshefty at ichips.intel.com  Fri Nov 12 12:39:12 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 12:39:12 -0800
Subject: [openib-general] [PATCH] Remove unneeded call in MAD code
Message-ID: <20041112123912.04171c0a.mshefty@ichips.intel.com>

This patch removes ib_mad_return_posted_send_mads, which isn't needed when
shutting down.  There cannot be any sends outstanding at this point, or
clients still exist.

- Sean

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1222)
+++ core/mad.c	(working copy)
@@ -1692,21 +1692,6 @@
 }
 
 /*
- * Return all the posted send MADs
- */
-static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info)
-{
-	unsigned long flags;
-
-	/* Just clear port send posted MAD list... revisit!!! */
-	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
-	INIT_LIST_HEAD(&qp_info->send_queue.list);
-	qp_info->send_queue.count = 0;
-	INIT_LIST_HEAD(&qp_info->overflow_list);
-	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
-}
-
-/*
  * Modify QP into Init state
  */
 static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp)
@@ -1909,7 +1894,6 @@
 			       i);
 		}
 		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-		ib_mad_return_posted_send_mads(&port_priv->qp_info[i]);
 	}
 }
 

From halr at voltaire.com  Fri Nov 12 12:58:50 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 15:58:50 -0500
Subject: [openib-general] Re: [PATCH] Remove unneeded call in MAD code
In-Reply-To: <20041112123912.04171c0a.mshefty@ichips.intel.com>
References: <20041112123912.04171c0a.mshefty@ichips.intel.com>
Message-ID: <1100293130.3369.658.camel@localhost.localdomain>

On Fri, 2004-11-12 at 15:39, Sean Hefty wrote:
> This patch removes ib_mad_return_posted_send_mads, which isn't needed when
> shutting down.  There cannot be any sends outstanding at this point, or
> clients still exist.

Thanks. Applied.

-- Hal


From mshefty at ichips.intel.com  Fri Nov 12 16:45:09 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 16:45:09 -0800
Subject: [openib-general] [PATCH] collapse MAD function calls
Message-ID: <20041112164509.561e90de.mshefty@ichips.intel.com>

This patch callapses several function calls into one when activating
the MAD QPs.  This avoids repeated allocation/freeing of memory.

I have plans to examine the QP transitions to the reset
state to see if these are necessary and if a race condition exists
between shutting down a port and processing a receive completion.

- Sean

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1222)
+++ core/mad.c	(working copy)
@@ -90,8 +90,6 @@
 				    struct ib_mad_send_wc *mad_send_wc);
 static void timeout_sends(void *data);
 static int solicited_mad(struct ib_mad *mad);
-static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
-					 enum ib_qp_state cur_state);
 
 /*
  * Returns a ib_mad_port_private structure or NULL for a device/port
@@ -1396,13 +1394,21 @@
 		} else
 			ib_mad_send_done_handler(port_priv, wc);
 	} else {
+		struct ib_qp_attr *attr;
+
 		/* Transition QP to RTS and fail offending send */
-		ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE);
-		if (ret)
-			printk(KERN_ERR PFX "mad_error_handler - unable to "
-			       "transition QP to RTS : %d\n", ret);
+		attr = kmalloc(sizeof *attr, GFP_KERNEL);
+		if (attr) {
+			attr->qp_state = IB_QPS_RTS;
+			ret = ib_modify_qp(qp_info->qp, attr, IB_QP_STATE);
+			kfree(attr);
+			if (ret)
+				printk(KERN_ERR PFX "mad_error_handler - "
+				       "ib_modify_qp to RTS : %d\n", ret);
+			else
+				mark_sends_for_retry(qp_info);
+		}
 		ib_mad_send_done_handler(port_priv, wc);
-		mark_sends_for_retry(qp_info);
 	}
 }
 
@@ -1692,172 +1698,51 @@
 }
 
 /*
- * Return all the posted send MADs
- */
-static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info)
-{
-	unsigned long flags;
-
-	/* Just clear port send posted MAD list... revisit!!! */
-	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
-	INIT_LIST_HEAD(&qp_info->send_queue.list);
-	qp_info->send_queue.count = 0;
-	INIT_LIST_HEAD(&qp_info->overflow_list);
-	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
-}
-
-/*
- * Modify QP into Init state
- */
-static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-
-	attr->qp_state = IB_QPS_INIT;
-	/*
-	 * PKey index for QP1 is irrelevant but
-	 * one is needed for the Reset to Init transition.
-	 */
-	attr->pkey_index = 0;
-	/* QKey is 0 for QP0 */
-	if (qp->qp_num == 0)
-		attr->qkey = 0;
-	else
-		attr->qkey = IB_QP1_QKEY;
-	attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Ready-To-Receive state
- */
-static inline int ib_mad_change_qp_state_to_rtr(struct ib_qp *qp)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-
-	attr->qp_state = IB_QPS_RTR;
-	attr_mask = IB_QP_STATE;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Ready-To-Send state
- */
-static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
-					 enum ib_qp_state cur_state)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr = kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-	attr->qp_state = IB_QPS_RTS;
-	attr_mask = IB_QP_STATE;
-	if (cur_state == IB_QPS_RTR) {
-		attr->sq_psn = IB_MAD_SEND_Q_PSN;
-		attr_mask |= IB_QP_SQ_PSN;
-	}
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Reset state
+ * Start the port
  */
-static inline int ib_mad_change_qp_state_to_reset(struct ib_qp *qp)
+static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
 {
-	int ret;
+	int ret, i;
 	struct ib_qp_attr *attr;
-	int attr_mask;
+	struct ib_qp *qp;
 
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
  	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
-	attr->qp_state = IB_QPS_RESET;
-	attr_mask = IB_QP_STATE;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Start the port
- */
-static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
-{
-	int ret, i, ret2;
-
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp);
+		qp = port_priv->qp_info[i].qp;
+		/*
+		 * PKey index for QP1 is irrelevant but
+		 * one is needed for the Reset to Init transition.
+		 */
+		attr->qp_state = IB_QPS_INIT;
+		attr->pkey_index = 0;
+		attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE |
+					     IB_QP_PKEY_INDEX | IB_QP_QKEY);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "INIT\n", i);
+			       "INIT: %d\n", i, ret);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp);
+		attr->qp_state = IB_QPS_RTR;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "RTR\n", i);
+			       "RTR: %d\n", i, ret);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp,
-						    IB_QPS_RTR);
+		attr->qp_state = IB_QPS_RTS;
+		attr->sq_psn = IB_MAD_SEND_Q_PSN;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "RTS\n", i);
+				"RTS: %d\n", i, ret);
 			goto error;
 		}
 	}
@@ -1865,30 +1750,28 @@
 	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	if (ret) {
 		printk(KERN_ERR PFX "Failed to request completion "
-		       "notification\n");
+			"notification: %d\n", ret);
 		goto error;
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
 		if (ret) {
-			printk(KERN_ERR PFX "Couldn't post receive "
-			       "requests\n");
+			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
 			goto error;
 		}
 	}
-	return 0;
+	goto out;
 
 error:
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		attr->qp_state = IB_QPS_RESET;
+		ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE);
 		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-		ret2 = ib_mad_change_qp_state_to_reset(port_priv->
-						       qp_info[i].qp);
-		if (ret2) {
-			printk(KERN_ERR PFX "ib_mad_port_start: Couldn't "
-			       "change QP%d state to RESET\n", i);
-		}
 	}
+
+out:
+	kfree(attr);
 	return ret;
 }
 
@@ -1898,19 +1781,26 @@
 static void ib_mad_port_stop(struct ib_mad_port_private *port_priv)
 {
 	int i, ret;
+	struct ib_qp_attr *attr;
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_reset(
-						port_priv->qp_info[i].qp);
-		if (ret) {
-			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change"
-			       " %s port %d QP%d state to RESET\n",
-			       port_priv->device->name, port_priv->port_num,
-			       i);
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+	if (attr) {
+		attr->qp_state = IB_QPS_RESET;
+		for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+			ret = ib_modify_qp(port_priv->qp_info[i].qp, attr,
+					   IB_QP_STATE);
+			if (ret)
+				printk(KERN_ERR PFX "ib_mad_port_stop: "
+				       "Couldn't change %s port %d QP%d "
+				       "state to RESET\n",
+				       port_priv->device->name,
+				       port_priv->port_num, i);
 		}
-		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-		ib_mad_return_posted_send_mads(&port_priv->qp_info[i]);
+		kfree(attr);
 	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++)
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
 }
 
 static void qp_event_handler(struct ib_event *event, void *qp_context)


From halr at voltaire.com  Fri Nov 12 19:08:14 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 12 Nov 2004 22:08:14 -0500
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <20041112164509.561e90de.mshefty@ichips.intel.com>
References: <20041112164509.561e90de.mshefty@ichips.intel.com>
Message-ID: <1100315294.3369.682.camel@localhost.localdomain>

On Fri, 2004-11-12 at 19:45, Sean Hefty wrote:
> This patch callapses several function calls into one when activating
> the MAD QPs.  This avoids repeated allocation/freeing of memory.
> 
> I have plans to examine the QP transitions to the reset
> state to see if these are necessary and if a race condition exists
> between shutting down a port and processing a receive completion.

This patch looks like it includes the previous patch and due to this 2
large hunks are rejected. Can you regenerate this ?

-- Hal


From sean.hefty at intel.com  Fri Nov 12 20:04:22 2004
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 12 Nov 2004 20:04:22 -0800
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <1100315294.3369.682.camel@localhost.localdomain>
Message-ID: <ORSMSX401A1XvpFVjCR00000009@orsmsx401.amr.corp.intel.com>

>On Fri, 2004-11-12 at 19:45, Sean Hefty wrote:
>> This patch callapses several function calls into one when activating
>> the MAD QPs.  This avoids repeated allocation/freeing of memory.
>>
>> I have plans to examine the QP transitions to the reset
>> state to see if these are necessary and if a race condition exists
>> between shutting down a port and processing a receive completion.
>
>This patch looks like it includes the previous patch and due to this 2
>large hunks are rejected. Can you regenerate this ?

Oops, sorry about that.  I'll do this as soon as I get back in touch
with my systems.

- Sean


From roland at topspin.com  Fri Nov 12 20:21:39 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 12 Nov 2004 20:21:39 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <4194F846.5030703@ichips.intel.com> (Sean Hefty's message of
	"Fri, 12 Nov 2004 09:52:06 -0800")
References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com>
	<4194F846.5030703@ichips.intel.com>
Message-ID: <52fz3eqsz0.fsf@topspin.com>

OK, here's a patch that adds support for "Current QP state" in the
modify QP verb.  Does this look OK?

Thanks,
  Roland

Index: infiniband/include/ib_verbs.h
===================================================================
--- infiniband/include/ib_verbs.h	(revision 1223)
+++ infiniband/include/ib_verbs.h	(working copy)
@@ -421,7 +421,8 @@
 
 enum ib_qp_attr_mask {
 	IB_QP_STATE			= 1,
-	IB_QP_EN_SQD_ASYNC_NOTIFY	= (1<<1),
+	IB_QP_CUR_STATE			= (1<<1),
+	IB_QP_EN_SQD_ASYNC_NOTIFY	= (1<<2),
 	IB_QP_ACCESS_FLAGS		= (1<<3),
 	IB_QP_PKEY_INDEX		= (1<<4),
 	IB_QP_PORT			= (1<<5),
@@ -460,6 +461,7 @@
 
 struct ib_qp_attr {
 	enum ib_qp_state	qp_state;
+	enum ib_qp_state	cur_qp_state;
 	enum ib_mtu		path_mtu;
 	enum ib_mig_state	path_mig_state;
 	u32			qkey;
Index: infiniband/hw/mthca/mthca_qp.c
===================================================================
--- infiniband/hw/mthca/mthca_qp.c	(revision 1223)
+++ infiniband/hw/mthca/mthca_qp.c	(working copy)
@@ -394,13 +394,16 @@
 				[MLX] = IB_QP_SQ_PSN,
 			},
 			.opt_param = {
-				[UD]  = IB_QP_QKEY,
-				[RC]  = (IB_QP_ALT_PATH              |
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
 					 IB_QP_ACCESS_FLAGS          |
 					 IB_QP_PKEY_INDEX            |
 					 IB_QP_MIN_RNR_TIMER         |
 					 IB_QP_PATH_MIG_STATE),
-				[MLX] = IB_QP_QKEY,
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
 			}
 		}
 	},
@@ -410,12 +413,14 @@
 		[IB_QPS_RTS]   = {
 			.trans = MTHCA_TRANS_RTS2RTS,
 			.opt_param = {
-				[UD]  = IB_QP_QKEY,
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
 				[RC]  = (IB_QP_ACCESS_FLAGS          |
 					 IB_QP_ALT_PATH              |
 					 IB_QP_PATH_MIG_STATE        |
 					 IB_QP_MIN_RNR_TIMER),
-				[MLX] = IB_QP_QKEY,
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
 			}
 		},
 		[IB_QPS_SQD]   = {
@@ -427,9 +432,36 @@
 		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
 		[IB_QPS_RTS]   = {
 			.trans = MTHCA_TRANS_SQD2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
 		},
 		[IB_QPS_SQD]   = {
 			.trans = MTHCA_TRANS_SQD2SQD,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_TIMEOUT               |
+					 IB_QP_RETRY_CNT             |
+					 IB_QP_RNR_RETRY             |
+					 IB_QP_MAX_QP_RD_ATOMIC      |
+					 IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+			}
 		}
 	},
 	[IB_QPS_SQE]   = {
@@ -437,6 +469,14 @@
 		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
 		[IB_QPS_RTS]   = {
 			.trans = MTHCA_TRANS_SQERR2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
 		}
 	},
 	[IB_QPS_ERR] = {
@@ -490,9 +530,19 @@
 	u8 status;
 	int err;
 
-	spin_lock_irq(&qp->lock);
-	cur_state = qp->state;
-	spin_unlock_irq(&qp->lock);
+	if (attr_mask & IB_QP_CUR_STATE) {
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			cur_state = attr->cur_qp_state;
+	} else {
+		spin_lock_irq(&qp->lock);
+		cur_state = qp->state;
+		spin_unlock_irq(&qp->lock);
+	}
 
 	if (attr_mask & IB_QP_STATE) {
                if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)


From halr at voltaire.com  Sat Nov 13 06:39:42 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sat, 13 Nov 2004 09:39:42 -0500
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <52fz3eqsz0.fsf@topspin.com>
References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com>
	<4194F846.5030703@ichips.intel.com> <52fz3eqsz0.fsf@topspin.com>
Message-ID: <1100356781.3369.692.camel@localhost.localdomain>

On Fri, 2004-11-12 at 23:21, Roland Dreier wrote:
> OK, here's a patch that adds support for "Current QP state" in the
> modify QP verb.  Does this look OK?

Looks good to me. 

A few comments/questions relative to IBA 1.2 vol 1 table 91 (p.569-572):

For SQD2SQD, path migration state is missing as is remote node address
vector, . Is IB_QP_TIMEOUT local ACK timeout ? Also, does
MAX_QP_RD_ATOMIC handle both local and destination ? 

I presume the omission of number of WQEs is intentional.

-- Hal


From roland at topspin.com  Sat Nov 13 15:07:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Sat, 13 Nov 2004 15:07:14 -0800
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
In-Reply-To: <1100356781.3369.692.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Sat, 13 Nov 2004 09:39:42 -0500")
References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com>
	<4194F846.5030703@ichips.intel.com> <52fz3eqsz0.fsf@topspin.com>
	<1100356781.3369.692.camel@localhost.localdomain>
Message-ID: <52y8h5pcv1.fsf@topspin.com>

    Hal> For SQD2SQD, path migration state is missing as is remote
    Hal> node address vector, . 

Good catch on the IB_QP_AV (actually IB_QP_PATH_MIG_STATE was there).

    Hal> Is IB_QP_TIMEOUT local ACK timeout ? 

Yes.

    Hal> Also, does MAX_QP_RD_ATOMIC handle both local and destination?

No, actually there is also IB_QP_MAX_DEST_RD_ATOMIC.  I need to audit
where that's missing from my table (although no mthca RDMA support is
written yet).

    Hal> I presume the omission of number of WQEs is intentional.

Yes, I'm not planning to try and support resizing of QPs.

Thanks,
  Roland


From gdror at mellanox.co.il  Sun Nov 14 15:15:30 2004
From: gdror at mellanox.co.il (Dror Goldenberg)
Date: Mon, 15 Nov 2004 01:15:30 +0200
Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE
Message-ID: <506C3D7B14CDD411A52C00025558DED6067481F8@mtlex01.yok.mtl.com>


> -----Original Message-----
> From: Roland Dreier [mailto:roland at topspin.com] 
> Sent: Friday, November 12, 2004 7:42 PM
> 
> 
> I thought about this a little, and it seems that having the 
> CQ poll operation update the QP state is not the right 
> solution.  It seems it would be better to add support for the 
> "Current QP state" modifier for the modify QP operation and 
> expect the consumer to use that to indicate that the QP is in 
> SQE state.
> 

Actually I recall adding "current QP state" as an input modifier
to the modify QP verb as part of the IB 1.1 errata (if I remember
correctly). The main intention is to avoid the ambiguity when
a consumer moves a QP into RTS state but can't tell if the QP
was in SQError/Error or SQDrain. According to the spec, current
QP state should only be valid when moving QP into RTS state.
Hope that helps.

-Dror
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041115/556b971e/attachment.html>

From roland at topspin.com  Sun Nov 14 21:25:05 2004
From: roland at topspin.com (Roland Dreier)
Date: Sun, 14 Nov 2004 21:25:05 -0800
Subject: [openib-general] MAD handling
Message-ID: <52d5yfptu6.fsf@topspin.com>

A few questions about MAD handling:

- What is supposed to happen to MADs that are received and are
  considered "solicited" because they have a method like GetResp, but
  which don't match any outstanding sends?  Right now it looks as if
  they will be silently dropped in find_mad_agent().  Unfortunately
  this doesn't work very well with the current user_mad.c stuff -- I
  post all sends with a timeout of 0 and expect userspace to register
  an agent to get responses.

  I could have user_mad.c use timeouts, but then we need to come up
  with a way for the timeouts to be passed up to userspace.  I'd sort
  of prefer to let userspace handle its own timeouts, although I could
  be persuaded otherwise.

- It looks as if the case of response DR SMPs going to the SM is not
  handled in smi.c.  smi_check_forward_dr_smp() doesn't handle the
  case of hop_ptr == 0, and smi_handle_dr_smp_send() just says

		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */

  and returns 0, which will lead to the packet being dropped.  How
  should this be fixed?

- Also, if I'm reading the code correctly, it seems that in
  handle_outgoing_smp, mad_priv->mad will be dispatched even if no
  response was generated by the call to process_mad (ie we might pass
  garbage to the receive handler).

Thanks,
  Roland


From halr at voltaire.com  Mon Nov 15 05:42:30 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 08:42:30 -0500
Subject: [openib-general] MAD handling
In-Reply-To: <52d5yfptu6.fsf@topspin.com>
References: <52d5yfptu6.fsf@topspin.com>
Message-ID: <1100526150.3369.2119.camel@localhost.localdomain>

On Mon, 2004-11-15 at 00:25, Roland Dreier wrote:
> A few questions about MAD handling:
> 
> - What is supposed to happen to MADs that are received and are
>   considered "solicited" because they have a method like GetResp, but
>   which don't match any outstanding sends?  Right now it looks as if
>   they will be silently dropped in find_mad_agent().  

This issue was brought up on the list last week in a thread entitled
"Solicited response with no matching send request". There seemed to be
no pressing need for this at the time. 

>   Unfortunately this doesn't work very well with the current user_mad.c stuff -- I
>   post all sends with a timeout of 0 and expect userspace to register
>   an agent to get responses.

I can work on a patch for this. One issue raised with this was not
providing an unmatched response if the client cancelled the send. This
means that the cancellations need to be kept around (at least for some
time period).

>   I could have user_mad.c use timeouts, but then we need to come up
>   with a way for the timeouts to be passed up to userspace.  I'd sort
>   of prefer to let userspace handle its own timeouts, although I could
>   be persuaded otherwise.

Seems to me like the SM would/could.should be using soliticed sends with
time outs. Maybe that's not the way this would be today just porting
what is already there. 

> - It looks as if the case of response DR SMPs going to the SM is not
>   handled in smi.c.  smi_check_forward_dr_smp() doesn't handle the
>   case of hop_ptr == 0, and smi_handle_dr_smp_send() just says
> 
> 		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */
> 
>   and returns 0, which will lead to the packet being dropped.  How
>   should this be fixed?

I will be working on SMI today/tomorrow to hopefully fix the remaining
cases. 

> - Also, if I'm reading the code correctly, it seems that in
>   handle_outgoing_smp, mad_priv->mad will be dispatched even if no
>   response was generated by the call to process_mad (ie we might pass
>   garbage to the receive handler).

Are you referring to if SUCCESS is set without CONSUMED or REPLY (after
calling process_mad for a local MAD) ? Is this the trap repress case
(from the SM to the local SMA) ?

-- Hal


From roland at topspin.com  Mon Nov 15 07:20:37 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 07:20:37 -0800
Subject: [openib-general] MAD handling
In-Reply-To: <1100526150.3369.2119.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Mon, 15 Nov 2004 08:42:30 -0500")
References: <52d5yfptu6.fsf@topspin.com>
	<1100526150.3369.2119.camel@localhost.localdomain>
Message-ID: <527jonp29m.fsf@topspin.com>

    Hal> Seems to me like the SM would/could.should be using soliticed
    Hal> sends with time outs. Maybe that's not the way this would be
    Hal> today just porting what is already there.

I guess I'll extend user_mad.c to handle timeouts then.

    Roland> - Also, if I'm reading the code correctly, it seems that
    Roland> in handle_outgoing_smp, mad_priv->mad will be dispatched
    Roland> even if no response was generated by the call to
    Roland> process_mad (ie we might pass garbage to the receive
    Roland> handler).

    Hal> Are you referring to if SUCCESS is set without CONSUMED or
    Hal> REPLY (after calling process_mad for a local MAD) ? Is this
    Hal> the trap repress case (from the SM to the local SMA) ?

I'm just talking about the code starting below in handle_outgoing_smp():

		/* See if response is solicited and there is a recv handler */
		mad_agent_priv = container_of(mad_agent,
					      struct ib_mad_agent_private,
					      agent);
		if (solicited_mad(&mad_priv->mad.mad) && 
		    mad_agent_priv->agent.recv_handler) {

It seems we will start passing the MAD to the recv_handler without
checking that process_mad() generated a reply (indeed without checking
that we even called process_mad()).

 - Roland


From halr at voltaire.com  Mon Nov 15 07:22:44 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 10:22:44 -0500
Subject: [openib-general] MAD handling
In-Reply-To: <527jonp29m.fsf@topspin.com>
References: <52d5yfptu6.fsf@topspin.com>
	<1100526150.3369.2119.camel@localhost.localdomain>
	<527jonp29m.fsf@topspin.com>
Message-ID: <1100532164.3369.2137.camel@localhost.localdomain>

On Mon, 2004-11-15 at 10:20, Roland Dreier wrote:
>     Roland> - Also, if I'm reading the code correctly, it seems that
>     Roland> in handle_outgoing_smp, mad_priv->mad will be dispatched
>     Roland> even if no response was generated by the call to
>     Roland> process_mad (ie we might pass garbage to the receive
>     Roland> handler).
> 
>     Hal> Are you referring to if SUCCESS is set without CONSUMED or
>     Hal> REPLY (after calling process_mad for a local MAD) ? Is this
>     Hal> the trap repress case (from the SM to the local SMA) ?
> 
> I'm just talking about the code starting below in handle_outgoing_smp():
> 
> 		/* See if response is solicited and there is a recv handler */
> 		mad_agent_priv = container_of(mad_agent,
> 					      struct ib_mad_agent_private,
> 					      agent);
> 		if (solicited_mad(&mad_priv->mad.mad) && 
> 		    mad_agent_priv->agent.recv_handler) {
> 
> It seems we will start passing the MAD to the recv_handler without
> checking that process_mad() generated a reply (indeed without checking
> that we even called process_mad()).

I see what you mean. There are a number of cases which should skip this
and just call the send handler. I'll issue a patch for this. Thanks.

-- Hal


From roland at topspin.com  Mon Nov 15 08:40:09 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 08:40:09 -0800
Subject: [openib-general] MAD handling
In-Reply-To: <1100532164.3369.2137.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Mon, 15 Nov 2004 10:22:44 -0500")
References: <52d5yfptu6.fsf@topspin.com>
	<1100526150.3369.2119.camel@localhost.localdomain>
	<527jonp29m.fsf@topspin.com>
	<1100532164.3369.2137.camel@localhost.localdomain>
Message-ID: <523bzboyl2.fsf@topspin.com>

Oh yeah, one more slight glitch in the MAD API.  It turns out that if
a 0-hop DR SMP is passed to ib_post_send_mad(), the client's
recv_handler will be called back directly from the same context.  This
means that the client has to be very careful to avoid deadlocking by
taking the same lock in both the send posting code and the receive
handling code.

I fixed up the locking in user_mad.c to handle this but we may want to
think about changing the MAD code to avoid this case.

 - R.


From roland at topspin.com  Mon Nov 15 08:52:15 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 08:52:15 -0800
Subject: [openib-general] Upstream submission
Message-ID: <52y8h3njgg.fsf@topspin.com>

Just to focus our minds, I would like to propose that we aim to post a
first version of InfiniBand patches for review to linux-kernel next
Monday, November 22.  The plan would be to produce a series of patches
that adds the code in our gen2/trunk: the IB core, mad layer, mthca,
IPoIB and user MAD modules.  I believe the code we have now is good
enough to be reviewed, and I don't think it's going to get much better
without input from the wider Linux community.

I still need to update IPoIB driver to remove the use of /proc (more
on this later) and add timeout handling to user_mad.c.  This work
should be finished today or tomorrow.  Then I'll work on some scripts
to take our svn tree and turn it into a series of patches for posting
to lkml.  I'll post a preliminary patch series just to openib-general
by Friday morning, and if everything looks good I'll post the same
series to lkml (cc'ed to openib-general so that we get replies as well).

Unfortunately we missed the 2.6.10 release train (in yesterday's
announcment of 2.6.10-rc2, Linus said: "Ok, the -rc2 changes are
almost as big as the -rc1 changes, and we should now calm down, so I
do not want to see anything but bug-fixes until 2.6.10 is
released").  Still, I think by starting code review as soon as
possible, we maximize our chances at getting merged as soon as 2.6.11
opens up.

Comments?  Objections?

Thanks,
  Roland


From roland at topspin.com  Mon Nov 15 09:09:12 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 09:09:12 -0800
Subject: [openib-general] [PATCH] umad: pass timeouts to userspace
Message-ID: <52u0rrnio7.fsf@topspin.com>

OK, this adds a status and a timeout_ms field to struct ib_user_mad
and passes timeouts up to userspace.  Seem OK?

 - R.


Index: infiniband/include/ib_user_mad.h
===================================================================
--- infiniband/include/ib_user_mad.h	(revision 1223)
+++ infiniband/include/ib_user_mad.h	(working copy)
@@ -37,6 +37,10 @@
  * ib_user_mad - MAD packet
  * @data - Contents of MAD
  * @id - ID of agent MAD received with/to be sent with
+ * @status - 0 on successfuly receive, ETIMEDOUT if no response
+ *   received (transaction ID in data[] will be set to TID of original
+ *   request) (ignored on send)
+ * @timeout_ms - Milliseconds to wait for response (unset on receive)
  * @qpn - Remote QP number received from/to be sent to
  * @qkey - Remote Q_Key to be sent with (unset on receive)
  * @lid - Remote lid received from/to be sent to
@@ -54,6 +58,8 @@
 struct ib_user_mad {
 	__u8	data[256];
 	__u32	id;
+	__u32	status;
+	__u32	timeout_ms;
 	__u32	qpn;
 	__u32   qkey;
 	__u16	lid;
Index: infiniband/core/user_mad.c
===================================================================
--- infiniband/core/user_mad.c	(revision 1231)
+++ infiniband/core/user_mad.c	(working copy)
@@ -84,17 +84,50 @@
 static void ib_umad_add_one(struct ib_device *device);
 static void ib_umad_remove_one(struct ib_device *device);
 
+static int queue_packet(struct ib_umad_file *file,
+			struct ib_mad_agent *agent,
+			struct ib_umad_packet *packet)
+{
+	int ret = 1;
+
+	down_read(&file->agent_mutex);
+	for (packet->mad.id = 0;
+	     packet->mad.id < IB_UMAD_MAX_AGENTS;
+	     packet->mad.id++)
+		if (agent == file->agent[packet->mad.id]) {
+			spin_lock_irq(&file->recv_lock);
+			list_add_tail(&packet->list, &file->recv_list);
+			spin_unlock_irq(&file->recv_lock);
+			wake_up_interruptible(&file->recv_wait);
+			ret = 0;
+			break;
+		}
+
+	up_read(&file->agent_mutex);
+
+	return ret;
+}
+
 static void send_handler(struct ib_mad_agent *agent,
-			 struct ib_mad_send_wc *mad_send_wc)
+			 struct ib_mad_send_wc *send_wc)
 {
+	struct ib_umad_file *file = agent->context;
 	struct ib_umad_packet *packet =
-		(void *) (unsigned long) mad_send_wc->wr_id;
+		(void *) (unsigned long) send_wc->wr_id;
 
 	pci_unmap_single(agent->device->dma_device,
 			 pci_unmap_addr(packet, mapping),
 			 sizeof packet->mad.data,
 			 PCI_DMA_TODEVICE);
 	ib_destroy_ah(packet->ah);
+
+	if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) {
+		packet->mad.status = ETIMEDOUT;
+
+		if (!queue_packet(file, agent, packet))
+			return;
+	}
+		
 	kfree(packet);
 }
 
@@ -114,6 +147,7 @@
 	memset(packet, 0, sizeof *packet);
 
 	memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data);
+	packet->mad.status        = 0;
 	packet->mad.qpn 	  = cpu_to_be32(mad_recv_wc->wc->src_qp);
 	packet->mad.lid 	  = cpu_to_be16(mad_recv_wc->wc->slid);
 	packet->mad.sl  	  = mad_recv_wc->wc->sl;
@@ -128,23 +162,9 @@
 		packet->mad.flow_label 	  = 0;
 	}
 
-	down_read(&file->agent_mutex);
-	for (packet->mad.id = 0;
-	     packet->mad.id < IB_UMAD_MAX_AGENTS;
-	     packet->mad.id++)
-		if (agent == file->agent[packet->mad.id]) {
-			spin_lock_irq(&file->recv_lock);
-			list_add_tail(&packet->list, &file->recv_list);
-			spin_unlock_irq(&file->recv_lock);
-			wake_up_interruptible(&file->recv_wait);
-			goto agent;
-		}
+	if (queue_packet(file, agent, packet))
+		kfree(packet);
 
-	kfree(packet);
-
-agent:
-	up_read(&file->agent_mutex);
-
 out:
 	ib_free_recv_mad(mad_recv_wc);
 }
@@ -259,6 +279,7 @@
 	wr.wr.ud.ah          = packet->ah;
 	wr.wr.ud.remote_qpn  = be32_to_cpu(packet->mad.qpn);
 	wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey);
+	wr.wr.ud.timeout_ms  = packet->mad.timeout_ms;
 
 	wr.wr_id            = (unsigned long) packet;
 
Index: docs/user_mad.txt
===================================================================
--- docs/user_mad.txt	(revision 1223)
+++ docs/user_mad.txt	(working copy)
@@ -39,6 +39,10 @@
   fields will be filled in with information on the received MAD.  For
   example, the remote LID will be in mad.lid.
 
+  If a send times out, a receive will be generated with mad.status set
+  to ETIMEDOUT.  Otherwise when a MAD has been successfully received,
+  mad.status will be 0.
+
   poll()/select() may be used to wait until a MAD can be read.
 
 Sending MADs


From robert.j.woodruff at intel.com  Mon Nov 15 09:10:13 2004
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 15 Nov 2004 09:10:13 -0800
Subject: [openib-general] Upstream submission
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C7E39D@orsmsx408>

 
>Comments?  Objections?

>Thanks,
>  Roland

Getting code review as early as possible is probably a good idea.

woody


From tduffy at sun.com  Mon Nov 15 09:23:18 2004
From: tduffy at sun.com (Tom Duffy)
Date: Mon, 15 Nov 2004 09:23:18 -0800
Subject: [openib-general] Upstream submission
In-Reply-To: <52y8h3njgg.fsf@topspin.com>
References: <52y8h3njgg.fsf@topspin.com>
Message-ID: <1100539398.13150.7.camel@duffman>

On Mon, 2004-11-15 at 08:52 -0800, Roland Dreier wrote:
> The plan would be to produce a series of patches
> that adds the code in our gen2/trunk: the IB core, mad layer, mthca,
> IPoIB and user MAD modules. 

Is there a reason to break up into patches code in drivers/infiniband?

There seem to already be 4 patches outside of drivers/infiniband:

linux-2.6.9-infiniband.diff
linux-2.6.9-ioctl.diff
linux-2.6.9-ipoib-ipv6.diff
linux-2.6.9-ipoib-multicast.diff

It is not like the parts of drivers/infiniband would be accepted and
others not. That would not make much sense (who would want IPoIB without
any layers below it to run on).

-tduffy

P.S.  I think submitting the patches to lkml next Monday is a great
idea.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041115/94776a3a/attachment.sig>

From halr at voltaire.com  Mon Nov 15 09:55:19 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 12:55:19 -0500
Subject: [openib-general] IPoIB removal issue
Message-ID: <1100541319.3369.2318.camel@localhost.localdomain>

Hi,

The ethernet on this machine is DHCP'd. Some network glitch (I think)
followed by trying to remove the ipoib modules caused the following to
be display in the console logs. Any ideas ? Thanks.

-- Hal

Nov 15 10:44:28 hpc-1 network: Shutting down interface eth1:  succeeded
Nov 15 10:44:28 hpc-1 network: Shutting down loopback interface: 
succeeded
Nov 15 10:44:28 hpc-1 sysctl: net.ipv4.ip_forward = 0
Nov 15 10:44:28 hpc-1 sysctl: net.ipv4.conf.default.rp_filter = 1
Nov 15 10:44:28 hpc-1 sysctl: kernel.sysrq = 0
Nov 15 10:44:28 hpc-1 sysctl: kernel.core_uses_pid = 1
Nov 15 10:44:28 hpc-1 network: Setting network parameters:  succeeded
Nov 15 10:44:28 hpc-1 network: Bringing up loopback interface: 
succeeded
Nov 15 10:44:28 hpc-1 ifup: 
Nov 15 10:44:28 hpc-1 ifup: Determining IP information for eth0...
Nov 15 10:44:34 hpc-1 ifup:  failed; no link present.  Check cable?
Nov 15 10:44:34 hpc-1 network: Bringing up interface eth0:  failed
Nov 15 10:44:34 hpc-1 ifup: 
Nov 15 10:44:34 hpc-1 ifup: Determining IP information for eth1...
Nov 15 10:44:34 hpc-1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Up 100 Mbps Full Duplex
Nov 15 10:44:34 hpc-1 dhclient: ib1: unknown hardware address type 32
Nov 15 10:44:34 hpc-1 dhclient: sit0: unknown hardware address type 776
Nov 15 10:44:34 hpc-1 dhclient: ib0: unknown hardware address type 32
Nov 15 10:44:35 hpc-1 dhclient: ib1: unknown hardware address type 32
Nov 15 10:44:35 hpc-1 dhclient: sit0: unknown hardware address type 776
Nov 15 10:44:35 hpc-1 dhclient: ib0: unknown hardware address type 32
Nov 15 10:44:36 hpc-1 dhclient: DHCPREQUEST on eth1 to 255.255.255.255
port 67
Nov 15 10:44:36 hpc-1 dhclient: DHCPACK from 10.0.2.1
Nov 15 10:44:36 hpc-1 dhclient: bound to 10.0.2.4 -- renewal in 1463
seconds.
Nov 15 10:44:36 hpc-1 ifup:  done.
Nov 15 10:44:36 hpc-1 network: Bringing up interface eth1:  succeeded
Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:601b:ffff:0:0:1:ff96:71
Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:601b:ffff:0:0:0:0:1
Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:401b:ffff:0:0:0:0:1
Nov 15 11:07:00 hpc-1 kernel: leaving MGID
ff12:401b:ffff:0:0:0:ffff:ffff

Message from syslogd at hpc-1 at Mon Nov 15 11:07:10 2004 ...
hpc-1 kernel: unregister_netdevice: waiting for ib0 to become free.
Usage count = 1
Nov 15 11:07:10 hpc-1 kernel: unregister_netdevice: waiting for ib0 to
become free. Usage count = 1

Message from syslogd at hpc-1 at Mon Nov 15 11:07:50 2004 ...
hpc-1 last message repeated 4 times
Nov 15 11:07:50 hpc-1 last message repeated 4 times


From roland at topspin.com  Mon Nov 15 10:11:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 10:11:38 -0800
Subject: [openib-general] Upstream submission
In-Reply-To: <1100539398.13150.7.camel@duffman> (Tom Duffy's message of
	"Mon, 15 Nov 2004 09:23:18 -0800")
References: <52y8h3njgg.fsf@topspin.com> <1100539398.13150.7.camel@duffman>
Message-ID: <52hdnrnfs5.fsf@topspin.com>

    Tom> Is there a reason to break up into patches code in
    Tom> drivers/infiniband?

I think so: ease of review.  A single 15000 line patch is not going to
be very readable.  Breaking it up into multiple pieces makes the
architecture a little clearer and also helps naturally organize the
replies into multiple threads.

 - R.


From roland at topspin.com  Mon Nov 15 10:14:31 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 10:14:31 -0800
Subject: [openib-general] Re: IPoIB removal issue
In-Reply-To: <1100541319.3369.2318.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Mon, 15 Nov 2004 12:55:19 -0500")
References: <1100541319.3369.2318.camel@localhost.localdomain>
Message-ID: <52d5yfnfnc.fsf@topspin.com>

    Hal> unregister_netdevice: waiting for ib0 to become free. Usage count = 1

Someone is still holding a reference to the ib0 device.  I don't see
anything in the IPoIB code that could be doing it, so it seems like
someone outside the driver must be doing it.

 - R.


From roland at topspin.com  Mon Nov 15 10:16:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 10:16:53 -0800
Subject: [openib-general] Signed-off-by: lines
In-Reply-To: <52y8h3njgg.fsf@topspin.com> (Roland Dreier's message of "Mon,
	15 Nov 2004 08:52:15 -0800")
References: <52y8h3njgg.fsf@topspin.com>
Message-ID: <526547nfje.fsf@topspin.com>

By the way, for our initial submission upstream, I am planning on
submitting all the patches with my own

Signed-off-by: Roland Dreier <roland at topspin.com>

line, of course preserving any other Signed-off-by: lines that already
exist.  However, for the future, it would be a good idea to make sure
that all patches come with a properly formatted Signed-off-by: line(s)
and preserve all such lines in the svn commit messages.

(Read Documentation/SubmittingPatches in the kernel tree for full details)

Thanks,
  Roland


From mshefty at ichips.intel.com  Mon Nov 15 10:29:18 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 10:29:18 -0800
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <1100315294.3369.682.camel@localhost.localdomain>
References: <20041112164509.561e90de.mshefty@ichips.intel.com>
	<1100315294.3369.682.camel@localhost.localdomain>
Message-ID: <20041115102918.29e7dcdb.mshefty@ichips.intel.com>

On Fri, 12 Nov 2004 22:08:14 -0500
Hal Rosenstock <halr at voltaire.com> wrote:
> This patch looks like it includes the previous patch and due to this 2
> large hunks are rejected. Can you regenerate this ?

Updated patch.

- Sean


Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1232)
+++ core/mad.c	(working copy)
@@ -90,8 +90,6 @@
 				    struct ib_mad_send_wc *mad_send_wc);
 static void timeout_sends(void *data);
 static int solicited_mad(struct ib_mad *mad);
-static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
-					 enum ib_qp_state cur_state);
 
 /*
  * Returns a ib_mad_port_private structure or NULL for a device/port
@@ -1397,13 +1395,21 @@
 		} else
 			ib_mad_send_done_handler(port_priv, wc);
 	} else {
+		struct ib_qp_attr *attr;
+
 		/* Transition QP to RTS and fail offending send */
-		ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE);
-		if (ret)
-			printk(KERN_ERR PFX "mad_error_handler - unable to "
-			       "transition QP to RTS : %d\n", ret);
+		attr = kmalloc(sizeof *attr, GFP_KERNEL);
+		if (attr) {
+			attr->qp_state = IB_QPS_RTS;
+			ret = ib_modify_qp(qp_info->qp, attr, IB_QP_STATE);
+			kfree(attr);
+			if (ret)
+				printk(KERN_ERR PFX "mad_error_handler - "
+				       "ib_modify_qp to RTS : %d\n", ret);
+			else
+				mark_sends_for_retry(qp_info);
+		}
 		ib_mad_send_done_handler(port_priv, wc);
-		mark_sends_for_retry(qp_info);
 	}
 }
 
@@ -1699,151 +1705,45 @@
 {
 	int ret;
 	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-
-	attr->qp_state = IB_QPS_INIT;
-	/*
-	 * PKey index for QP1 is irrelevant but
-	 * one is needed for the Reset to Init transition.
-	 */
-	attr->pkey_index = 0;
-	/* QKey is 0 for QP0 */
-	if (qp->qp_num == 0)
-		attr->qkey = 0;
-	else
-		attr->qkey = IB_QP1_QKEY;
-	attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Ready-To-Receive state
- */
-static inline int ib_mad_change_qp_state_to_rtr(struct ib_qp *qp)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-
-	attr->qp_state = IB_QPS_RTR;
-	attr_mask = IB_QP_STATE;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Ready-To-Send state
- */
-static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
-					 enum ib_qp_state cur_state)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr = kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-	attr->qp_state = IB_QPS_RTS;
-	attr_mask = IB_QP_STATE;
-	if (cur_state == IB_QPS_RTR) {
-		attr->sq_psn = IB_MAD_SEND_Q_PSN;
-		attr_mask |= IB_QP_SQ_PSN;
-	}
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Reset state
- */
-static inline int ib_mad_change_qp_state_to_reset(struct ib_qp *qp)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
+	struct ib_qp *qp;
 
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
  	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
-	attr->qp_state = IB_QPS_RESET;
-	attr_mask = IB_QP_STATE;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Start the port
- */
-static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
-{
-	int ret, i, ret2;
-
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp);
+		qp = port_priv->qp_info[i].qp;
+		/*
+		 * PKey index for QP1 is irrelevant but
+		 * one is needed for the Reset to Init transition.
+		 */
+		attr->qp_state = IB_QPS_INIT;
+		attr->pkey_index = 0;
+		attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE |
+					     IB_QP_PKEY_INDEX | IB_QP_QKEY);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "INIT\n", i);
+			       "INIT: %d\n", i, ret);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp);
+		attr->qp_state = IB_QPS_RTR;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "RTR\n", i);
+			       "RTR: %d\n", i, ret);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp,
-						    IB_QPS_RTR);
+		attr->qp_state = IB_QPS_RTS;
+		attr->sq_psn = IB_MAD_SEND_Q_PSN;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "RTS\n", i);
+				"RTS: %d\n", i, ret);
 			goto error;
 		}
 	}
@@ -1851,30 +1751,28 @@
 	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	if (ret) {
 		printk(KERN_ERR PFX "Failed to request completion "
-		       "notification\n");
+			"notification: %d\n", ret);
 		goto error;
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
 		if (ret) {
-			printk(KERN_ERR PFX "Couldn't post receive "
-			       "requests\n");
+			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
 			goto error;
 		}
 	}
-	return 0;
+	goto out;
 
 error:
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		attr->qp_state = IB_QPS_RESET;
+		ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE);
 		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-		ret2 = ib_mad_change_qp_state_to_reset(port_priv->
-						       qp_info[i].qp);
-		if (ret2) {
-			printk(KERN_ERR PFX "ib_mad_port_start: Couldn't "
-			       "change QP%d state to RESET\n", i);
-		}
 	}
+
+out:
+	kfree(attr);
 	return ret;
 }
 
@@ -1884,18 +1782,26 @@
 static void ib_mad_port_stop(struct ib_mad_port_private *port_priv)
 {
 	int i, ret;
+	struct ib_qp_attr *attr;
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_reset(
-						port_priv->qp_info[i].qp);
-		if (ret) {
-			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change"
-			       " %s port %d QP%d state to RESET\n",
-			       port_priv->device->name, port_priv->port_num,
-			       i);
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+	if (attr) {
+		attr->qp_state = IB_QPS_RESET;
+		for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+			ret = ib_modify_qp(port_priv->qp_info[i].qp, attr,
+					   IB_QP_STATE);
+			if (ret)
+				printk(KERN_ERR PFX "ib_mad_port_stop: "
+				       "Couldn't change %s port %d QP%d "
+				       "state to RESET\n",
+				       port_priv->device->name,
+				       port_priv->port_num, i);
 		}
-		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
+		kfree(attr);
 	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++)
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
 }
 
 static void qp_event_handler(struct ib_event *event, void *qp_context)


From dledford at redhat.com  Mon Nov 15 10:52:26 2004
From: dledford at redhat.com (Doug Ledford)
Date: Mon, 15 Nov 2004 13:52:26 -0500
Subject: [openib-general] Upstream submission
In-Reply-To: <52y8h3njgg.fsf@topspin.com>
References: <52y8h3njgg.fsf@topspin.com>
Message-ID: <1100544746.3712.26.camel@compaq-rhel4.xsintricity.com>

On Mon, 2004-11-15 at 08:52 -0800, Roland Dreier wrote:
> Just to focus our minds, I would like to propose that we aim to post a
> first version of InfiniBand patches for review to linux-kernel next
> Monday, November 22.

Boo! ;-)

I'll echo the sentiment that this is a good idea.

While I'm piping up I'll go ahead and introduce myself.  I'm a kernel
engineer for Red Hat (been here about 6 years).  I've been assigned the
task of helping aid integration of the OpenIB work into our products.
Obviously, upstream inclusion makes my task easier, so that's certainly
welcome.  I've also been assigned the task of assisting with ongoing
development efforts.  It'll be a little bit before I'm up to speed on
things and able to effectively contribute (well, that and I'm going to
have to line up some test hardware).

I started out by subscribing to this list and lurking in the background.
Been here about 3 weeks now.  Next I'm planning on doing what's
necessary to get access to the current IB specs and downloading the
current code base and starting to familiarize myself with the spec and
the current state of the code base.  By the time I've made myself
familiar with things I will have hopefully worked out the hardware issue
and be able to get to work.

Suggestions for items I can read, web sites I should visit in order to
help get me up to speed, etc. welcomed.

I'm sure you'll hear more from me in the future ;-)

-- 
  Doug Ledford <dledford at redhat.com>
         Red Hat, Inc.
         1801 Varsity Dr.
         Raleigh, NC 27606


From roland at topspin.com  Mon Nov 15 11:15:41 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 11:15:41 -0800
Subject: [openib-general] Upstream submission
In-Reply-To: <1100544746.3712.26.camel@compaq-rhel4.xsintricity.com> (Doug
	Ledford's message of "Mon, 15 Nov 2004 13:52:26 -0500")
References: <52y8h3njgg.fsf@topspin.com>
	<1100544746.3712.26.camel@compaq-rhel4.xsintricity.com>
Message-ID: <52wtwmncte.fsf@topspin.com>

    Doug> Suggestions for items I can read, web sites I should visit
    Doug> in order to help get me up to speed, etc. welcomed.

Doug,

First off, welcome!

Unfortunately there's not much to read about InfiniBand beyond the
current IB spec.  However, I think chapter 3 is actually quite a nice
introduction.  As far as our current codebase goes, documentation
there is pretty sparse.  However, the latest tree (svn at
https://openib.org/svn/gen2/trunk/src/linux-kernel) should be
reasonably understandable, since we've chopped the code down to a
minimum.

In any case please ask if there's anything that needs clarification.

(By the way, I think we're working with Brian Stevens to get your
hardware situation sorted out)

 - Roland


From halr at voltaire.com  Mon Nov 15 11:50:06 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 14:50:06 -0500
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <20041115102918.29e7dcdb.mshefty@ichips.intel.com>
References: <20041112164509.561e90de.mshefty@ichips.intel.com>
	<1100315294.3369.682.camel@localhost.localdomain>
	<20041115102918.29e7dcdb.mshefty@ichips.intel.com>
Message-ID: <1100548206.2767.1.camel@hpc-1>

On Mon, 2004-11-15 at 13:29, Sean Hefty wrote:
> On Fri, 12 Nov 2004 22:08:14 -0500
> Hal Rosenstock <halr at voltaire.com> wrote:
> > This patch looks like it includes the previous patch and due to this 2
> > large hunks are rejected. Can you regenerate this ?
> 
> Updated patch.

Patch now applies but I get the following compile errors:

drivers/infiniband/core/mad.c: In function
`ib_mad_change_qp_state_to_init':
drivers/infiniband/core/mad.c:1708: warning: declaration of `qp' shadows
a parameter
drivers/infiniband/core/mad.c:1716: `i' undeclared (first use in this
function)
drivers/infiniband/core/mad.c:1716: (Each undeclared identifier is
reported only once
drivers/infiniband/core/mad.c:1716: for each function it appears in.)
drivers/infiniband/core/mad.c:1717: `port_priv' undeclared (first use in
this function)
drivers/infiniband/core/mad.c: In function `ib_mad_port_open':
drivers/infiniband/core/mad.c:1944: warning: implicit declaration of
function `ib_mad_port_start'


From mshefty at ichips.intel.com  Mon Nov 15 11:48:23 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 11:48:23 -0800
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <1100548206.2767.1.camel@hpc-1>
References: <20041112164509.561e90de.mshefty@ichips.intel.com>
	<1100315294.3369.682.camel@localhost.localdomain>
	<20041115102918.29e7dcdb.mshefty@ichips.intel.com>
	<1100548206.2767.1.camel@hpc-1>
Message-ID: <41990807.8030604@ichips.intel.com>

Hal Rosenstock wrote:
> Patch now applies but I get the following compile errors:
> 
> drivers/infiniband/core/mad.c: In function
> `ib_mad_change_qp_state_to_init':
> drivers/infiniband/core/mad.c:1708: warning: declaration of `qp' shadows
> a parameter
> drivers/infiniband/core/mad.c:1716: `i' undeclared (first use in this
> function)
> drivers/infiniband/core/mad.c:1716: (Each undeclared identifier is
> reported only once
> drivers/infiniband/core/mad.c:1716: for each function it appears in.)
> drivers/infiniband/core/mad.c:1717: `port_priv' undeclared (first use in
> this function)
> drivers/infiniband/core/mad.c: In function `ib_mad_port_open':
> drivers/infiniband/core/mad.c:1944: warning: implicit declaration of
> function `ib_mad_port_start'

Something didn't merge right here between the two patches.  I don't 
think this function name is even correct.  Let me recheck this.

- Sean


From halr at voltaire.com  Mon Nov 15 12:06:25 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 15:06:25 -0500
Subject: [openib-general] [PATCH] mad: In handle_outgoing_smp,
	only match response if	generated
Message-ID: <1100549185.2767.11.camel@hpc-1>

mad: In handle_outgoing_smp, only match response if generated
(based on comment from Roland)

Index: mad.c
===================================================================
--- mad.c	(revision 1230)
+++ mad.c	(working copy)
@@ -394,6 +394,10 @@
 			goto error1;
 		}
 
+		mad_agent_priv = container_of(mad_agent,
+					      struct ib_mad_agent_private,
+					      agent);
+
 		if (mad_agent->device->process_mad) {
 			ret = mad_agent->device->process_mad(
 					    mad_agent->device,
@@ -418,46 +422,50 @@
 								mad_priv);
 						goto error1;
 					}
-				}
-			}
-		}
 
-		/* See if response is solicited and there is a recv handler */
-		mad_agent_priv = container_of(mad_agent,
-					      struct ib_mad_agent_private,
-					      agent);
-		if (solicited_mad(&mad_priv->mad.mad) && 
-		    mad_agent_priv->agent.recv_handler) {
-			struct ib_wc wc;
+					/*
+					 * See if response is solicited and
+					 * there is a recv handler
+					 */
+					if (solicited_mad(&mad_priv->mad.mad) && 
+					    mad_agent_priv->agent.recv_handler) {
+						struct ib_wc wc;
 
-			/*
-			 * Defined behavior is to complete response
-			 * before request
-			 */
-			wc.wr_id = send_wr->wr_id;
-			wc.status = IB_WC_SUCCESS;
-			wc.opcode = IB_WC_RECV;
-			wc.vendor_err = 0;
-			wc.byte_len = sizeof(struct ib_mad);
-			wc.src_qp = 0;	/* IB_QPT_SMI ? */
-			wc.wc_flags = 0;
-			wc.pkey_index = 0;
-			wc.slid = IB_LID_PERMISSIVE;
-			wc.sl = 0;
-			wc.dlid_path_bits = 0;
-			mad_priv->header.recv_wc.wc = &wc;
-			mad_priv->header.recv_wc.mad_len =
+						/*
+						 * Defined behavior is to
+						 * complete response before
+						 * request
+						 */
+						wc.wr_id = send_wr->wr_id;
+						wc.status = IB_WC_SUCCESS;
+						wc.opcode = IB_WC_RECV;
+						wc.vendor_err = 0;
+						wc.byte_len = sizeof(struct ib_mad);
+						wc.src_qp = 0;  /* IB_QPT_SMI ? */
+						wc.wc_flags = 0;
+						wc.pkey_index = 0;
+						wc.slid = IB_LID_PERMISSIVE;
+						wc.sl = 0;
+						wc.dlid_path_bits = 0;
+						mad_priv->header.recv_wc.wc = &wc;
+						mad_priv->header.recv_wc.mad_len =
 							sizeof(struct ib_mad);
-			INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
-			mad_priv->header.recv_buf.grh = NULL;
-			mad_priv->header.recv_buf.mad = &mad_priv->mad.mad;
-			mad_priv->header.recv_wc.recv_buf =
-						&mad_priv->header.recv_buf;
-			mad_agent_priv->agent.recv_handler(
-						mad_agent,
-						&mad_priv->header.recv_wc);
-		} else
-			kmem_cache_free(ib_mad_cache, mad_priv);
+						INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
+						mad_priv->header.recv_buf.grh = NULL;
+						mad_priv->header.recv_buf.mad =
+							&mad_priv->mad.mad;
+						mad_priv->header.recv_wc.recv_buf =
+							&mad_priv->header.recv_buf;
+						mad_agent_priv->agent.recv_handler(
+							mad_agent,
+							&mad_priv->header.recv_wc);
+					} else
+						kmem_cache_free(ib_mad_cache, mad_priv);
+				} else
+					kmem_cache_free(ib_mad_cache, mad_priv);
+			} else
+				kmem_cache_free(ib_mad_cache, mad_priv);
+		}
 
 		if (mad_agent_priv->agent.send_handler) {
 			/* Now, complete send */


From paul.baxter at dsl.pipex.com  Mon Nov 15 12:04:03 2004
From: paul.baxter at dsl.pipex.com (Paul Baxter)
Date: Mon, 15 Nov 2004 20:04:03 -0000
Subject: [openib-general] Upstream submission
References: <52y8h3njgg.fsf@topspin.com> <1100539398.13150.7.camel@duffman>
	<52hdnrnfs5.fsf@topspin.com>
Message-ID: <009601c4cb4e$41eb16f0$8000000a@blorp>

While I am delighted that the lower layers are suffficiently stable to 
warrant being considered for code review/inclusion in the kernel, I am 
slightly surprised.

Has the code been used in anger enough?

There seem to be a lot of bugs still being discovered daily.

Wouldn't having at least a preliminary set of user capabilities help 
assessment of the low level code and allow a wider set of people to evaluate 
it.

Are there sufficient test tools and documentation to allow an IB novice 
(kernel expert) to evaluate the offering.

Perhaps over the next month Doug could be a 'dry run guinea pig' for kernel 
inclusion and highlight the documentation and coding areas of difficulty 
prior to submission for a wider audience.

I am concerned that an overly changeable or buggy submission may do more 
harm than good.

Good luck with it though as I'm really looking forward to the fruits of this 
development.

Just an opinion

Paul


----- Original Message ----- 
From: "Roland Dreier" <roland at topspin.com>
To: "Tom Duffy" <tduffy at sun.com>
Cc: <>
Sent: Monday, November 15, 2004 6:11 PM
Subject: Re: [openib-general] Upstream submission


>    Tom> Is there a reason to break up into patches code in
>    Tom> drivers/infiniband?
>
> I think so: ease of review.  A single 15000 line patch is not going to
> be very readable.  Breaking it up into multiple pieces makes the
> architecture a little clearer and also helps naturally organize the
> replies into multiple threads.
>
> - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From mshefty at ichips.intel.com  Mon Nov 15 12:04:11 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 12:04:11 -0800
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <1100548206.2767.1.camel@hpc-1>
References: <20041112164509.561e90de.mshefty@ichips.intel.com>
	<1100315294.3369.682.camel@localhost.localdomain>
	<20041115102918.29e7dcdb.mshefty@ichips.intel.com>
	<1100548206.2767.1.camel@hpc-1>
Message-ID: <20041115120411.7ae02766.mshefty@ichips.intel.com>

On Mon, 15 Nov 2004 14:50:06 -0500
Hal Rosenstock <halr at voltaire.com> wrote:

> On Mon, 2004-11-15 at 13:29, Sean Hefty wrote:
> > On Fri, 12 Nov 2004 22:08:14 -0500
> > Hal Rosenstock <halr at voltaire.com> wrote:
> > > This patch looks like it includes the previous patch and due to this 2
> > > large hunks are rejected. Can you regenerate this ?
> > 
> > Updated patch.

This should fix the merge/compilation issues.  Also, I re-examined the initial patch,
and I don't see why it would have failed.  Oh well...

- Sean

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1232)
+++ core/mad.c	(working copy)
@@ -90,8 +90,6 @@
 				    struct ib_mad_send_wc *mad_send_wc);
 static void timeout_sends(void *data);
 static int solicited_mad(struct ib_mad *mad);
-static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
-					 enum ib_qp_state cur_state);
 
 /*
  * Returns a ib_mad_port_private structure or NULL for a device/port
@@ -1397,13 +1395,23 @@
 		} else
 			ib_mad_send_done_handler(port_priv, wc);
 	} else {
+		struct ib_qp_attr *attr;
+
 		/* Transition QP to RTS and fail offending send */
-		ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE);
-		if (ret)
-			printk(KERN_ERR PFX "mad_error_handler - unable to "
-			       "transition QP to RTS : %d\n", ret);
+		attr = kmalloc(sizeof *attr, GFP_KERNEL);
+		if (attr) {
+			attr->qp_state = IB_QPS_RTS;
+			attr->cur_qp_state = IB_QPS_SQE;
+			ret = ib_modify_qp(qp_info->qp, attr,
+					   IB_QP_STATE | IB_QP_CUR_STATE);
+			kfree(attr);
+			if (ret)
+				printk(KERN_ERR PFX "mad_error_handler - "
+				       "ib_modify_qp to RTS : %d\n", ret);
+			else
+				mark_sends_for_retry(qp_info);
+		}
 		ib_mad_send_done_handler(port_priv, wc);
-		mark_sends_for_retry(qp_info);
 	}
 }
 
@@ -1693,157 +1701,51 @@
 }
 
 /*
- * Modify QP into Init state
- */
-static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-
-	attr->qp_state = IB_QPS_INIT;
-	/*
-	 * PKey index for QP1 is irrelevant but
-	 * one is needed for the Reset to Init transition.
-	 */
-	attr->pkey_index = 0;
-	/* QKey is 0 for QP0 */
-	if (qp->qp_num == 0)
-		attr->qkey = 0;
-	else
-		attr->qkey = IB_QP1_QKEY;
-	attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Ready-To-Receive state
+ * Start the port.
  */
-static inline int ib_mad_change_qp_state_to_rtr(struct ib_qp *qp)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr =  kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-
-	attr->qp_state = IB_QPS_RTR;
-	attr_mask = IB_QP_STATE;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Ready-To-Send state
- */
-static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp,
-					 enum ib_qp_state cur_state)
-{
-	int ret;
-	struct ib_qp_attr *attr;
-	int attr_mask;
-
-	attr = kmalloc(sizeof *attr, GFP_KERNEL);
-	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
-		return -ENOMEM;
-	}
-	attr->qp_state = IB_QPS_RTS;
-	attr_mask = IB_QP_STATE;
-	if (cur_state == IB_QPS_RTR) {
-		attr->sq_psn = IB_MAD_SEND_Q_PSN;
-		attr_mask |= IB_QP_SQ_PSN;
-	}
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Modify QP into Reset state
- */
-static inline int ib_mad_change_qp_state_to_reset(struct ib_qp *qp)
+static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
 {
-	int ret;
+	int ret, i;
 	struct ib_qp_attr *attr;
-	int attr_mask;
+	struct ib_qp *qp;
 
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
  	if (!attr) {
-		printk(KERN_ERR PFX "Couldn't allocate memory for "
-		       "ib_qp_attr\n");
+		printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n");
 		return -ENOMEM;
 	}
 
-	attr->qp_state = IB_QPS_RESET;
-	attr_mask = IB_QP_STATE;
-
-	ret = ib_modify_qp(qp, attr, attr_mask);
-	kfree(attr);
-
-	if (ret)
-		printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset "
-		       "ret = %d\n", ret);
-	return ret;
-}
-
-/*
- * Start the port
- */
-static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
-{
-	int ret, i, ret2;
-
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp);
+		qp = port_priv->qp_info[i].qp;
+		/*
+		 * PKey index for QP1 is irrelevant but
+		 * one is needed for the Reset to Init transition.
+		 */
+		attr->qp_state = IB_QPS_INIT;
+		attr->pkey_index = 0;
+		attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE |
+					     IB_QP_PKEY_INDEX | IB_QP_QKEY);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "INIT\n", i);
+			       "INIT: %d\n", i, ret);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp);
+		attr->qp_state = IB_QPS_RTR;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "RTR\n", i);
+			       "RTR: %d\n", i, ret);
 			goto error;
 		}
 
-		ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp,
-						    IB_QPS_RTR);
+		attr->qp_state = IB_QPS_RTS;
+		attr->sq_psn = IB_MAD_SEND_Q_PSN;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
-			       "RTS\n", i);
+			       "RTS: %d\n", i, ret);
 			goto error;
 		}
 	}
@@ -1851,30 +1753,28 @@
 	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
 	if (ret) {
 		printk(KERN_ERR PFX "Failed to request completion "
-		       "notification\n");
+		       "notification: %d\n", ret);
 		goto error;
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
 		if (ret) {
-			printk(KERN_ERR PFX "Couldn't post receive "
-			       "requests\n");
+			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
 			goto error;
 		}
 	}
-	return 0;
+	goto out;
 
 error:
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		attr->qp_state = IB_QPS_RESET;
+		ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE);
 		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-		ret2 = ib_mad_change_qp_state_to_reset(port_priv->
-						       qp_info[i].qp);
-		if (ret2) {
-			printk(KERN_ERR PFX "ib_mad_port_start: Couldn't "
-			       "change QP%d state to RESET\n", i);
-		}
 	}
+
+out:
+	kfree(attr);
 	return ret;
 }
 
@@ -1884,18 +1784,26 @@
 static void ib_mad_port_stop(struct ib_mad_port_private *port_priv)
 {
 	int i, ret;
+	struct ib_qp_attr *attr;
 
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		ret = ib_mad_change_qp_state_to_reset(
-						port_priv->qp_info[i].qp);
-		if (ret) {
-			printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change"
-			       " %s port %d QP%d state to RESET\n",
-			       port_priv->device->name, port_priv->port_num,
-			       i);
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+	if (attr) {
+		attr->qp_state = IB_QPS_RESET;
+		for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+			ret = ib_modify_qp(port_priv->qp_info[i].qp, attr,
+					   IB_QP_STATE);
+			if (ret)
+				printk(KERN_ERR PFX "ib_mad_port_stop: "
+				       "Couldn't change %s port %d QP%d "
+				       "state to RESET\n",
+				       port_priv->device->name,
+				       port_priv->port_num, i);
 		}
-		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
+		kfree(attr);
 	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++)
+		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
 }
 
 static void qp_event_handler(struct ib_event *event, void *qp_context)


From roland at topspin.com  Mon Nov 15 12:23:26 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 12:23:26 -0800
Subject: [openib-general] Upstream submission
In-Reply-To: <009601c4cb4e$41eb16f0$8000000a@blorp> (Paul Baxter's message
	of "Mon, 15 Nov 2004 20:04:03 -0000")
References: <52y8h3njgg.fsf@topspin.com> <1100539398.13150.7.camel@duffman>
	<52hdnrnfs5.fsf@topspin.com> <009601c4cb4e$41eb16f0$8000000a@blorp>
Message-ID: <52oehyn9oh.fsf@topspin.com>

    Paul> Has the code been used in anger enough?
    Paul> There seem to be a lot of bugs still being discovered daily.

I think in most scenarios IPoIB is quite stable.  I've run many
gigabytes of traffic without trouble.  There may still be corner cases
with module unloading and the like, but I think the best way to fix
those is to get enough testers so that we can start to see a pattern
to the problems.

    Paul> Wouldn't having at least a preliminary set of user
    Paul> capabilities help assessment of the low level code and allow
    Paul> a wider set of people to evaluate it.

I think we're better served in starting with as small and digestible a
chunk of code as possible and building on that.  Getting a foot in the
door and all that...

    Paul> Perhaps over the next month Doug could be a 'dry run guinea
    Paul> pig' for kernel inclusion and highlight the documentation
    Paul> and coding areas of difficulty prior to submission for a
    Paul> wider audience.

I think we're really at the point where we're ready for a full lkml
code review.  Certainly I think we're at a level of stability and
functionality that is appropriate for inclusion in an -mm kernel if
not Linus's tree.  The code doesn't need to be perfect before it gets
in the kernel -- just good enough to be usable and benefit from the
increased test coverage.

I do plan on marking the InfiniBand Kconfig options as EXPERIMENTAL,
so that should help set expectations.

    Paul> I am concerned that an overly changeable or buggy submission
    Paul> may do more harm than good.

In the past there certainly have been submissions to lkml that were
far too early.  However our current tree is definitely not an
embarrassment: there aren't any gross violations of coding standards,
we use modern interfaces like sysfs correctly, and so on.

Thanks,
  Roland


From halr at voltaire.com  Mon Nov 15 12:32:23 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 15:32:23 -0500
Subject: [openib-general] Re: [PATCH] collapse MAD function calls
In-Reply-To: <20041115120411.7ae02766.mshefty@ichips.intel.com>
References: <20041112164509.561e90de.mshefty@ichips.intel.com>
	<1100315294.3369.682.camel@localhost.localdomain>
	<20041115102918.29e7dcdb.mshefty@ichips.intel.com>
	<1100548206.2767.1.camel@hpc-1>
	<20041115120411.7ae02766.mshefty@ichips.intel.com>
Message-ID: <1100550743.2767.32.camel@hpc-1>

On Mon, 2004-11-15 at 15:04, Sean Hefty wrote:
> On Mon, 15 Nov 2004 14:50:06 -0500
> Hal Rosenstock <halr at voltaire.com> wrote:
> 
> > On Mon, 2004-11-15 at 13:29, Sean Hefty wrote:
> > > On Fri, 12 Nov 2004 22:08:14 -0500
> > > Hal Rosenstock <halr at voltaire.com> wrote:
> > > > This patch looks like it includes the previous patch and due to this 2
> > > > large hunks are rejected. Can you regenerate this ?
> > > 
> > > Updated patch.
> 
> This should fix the merge/compilation issues.  

Thanks. Applied.

> Also, I re-examined the initial patch,
> and I don't see why it would have failed.  Oh well...

I can see some differences (other than line numbers):

28,30c28
< +                     attr->cur_qp_state = IB_QPS_SQE;
< +                     ret = ib_modify_qp(qp_info->qp, attr,
< +                                        IB_QP_STATE |
IB_QP_CUR_STATE);
---
> +                     ret = ib_modify_qp(qp_info->qp, attr,
IB_QP_STATE);
43,52c41,44
< @@ -1693,157 +1701,51 @@
<  }
<
<  /*
< - * Modify QP into Init state
< - */
< -static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp)
< -{
< -     int ret;
< -     struct ib_qp_attr *attr;
---
> @@ -1699,151 +1705,45 @@
>  {
>       int ret;
>       struct ib_qp_attr *attr;
86,87c78
< + * Start the port.
<   */
---
> - */
148,149c139
< +static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
<  {
---
> -{
151,152c141
< +     int ret, i;
<       struct ib_qp_attr *attr;
---
> -     struct ib_qp_attr *attr;

-- Hal


From robert.j.woodruff at intel.com  Mon Nov 15 13:30:51 2004
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 15 Nov 2004 13:30:51 -0800
Subject: [openib-general] Upstream submission
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C7E9C3@orsmsx408>

Doug Ledford wrote, 
>Boo! ;-)

Ditto what Roland said, 

Welcome.

woody


From Tom.Duffy at Sun.COM  Mon Nov 15 13:31:37 2004
From: Tom.Duffy at Sun.COM (Tom Duffy)
Date: Mon, 15 Nov 2004 13:31:37 -0800
Subject: [openib-general] [PATCH] fix sparse warnings in mthca
Message-ID: <1100554297.13150.23.camel@duffman>

Was getting warnings like: "warning: Using plain integer as NULL
pointer" when sparse checking on x86_64.

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: drivers/infiniband/hw/mthca/mthca_doorbell.h
===================================================================
--- drivers/infiniband/hw/mthca/mthca_doorbell.h	(revision 1234)
+++ drivers/infiniband/hw/mthca/mthca_doorbell.h	(working copy)
@@ -40,7 +40,7 @@
 
 #define MTHCA_DECLARE_DOORBELL_LOCK(name)
 #define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
-#define MTHCA_GET_DOORBELL_LOCK(ptr)      (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
 
 static inline void mthca_write64(u32 val[2], void __iomem *dest,
 				 spinlock_t *doorbell_lock)
@@ -53,7 +53,7 @@
 
 #define MTHCA_DECLARE_DOORBELL_LOCK(name)
 #define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
-#define MTHCA_GET_DOORBELL_LOCK(ptr)      (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
 
 static inline unsigned long mthca_get_fpu(void)
 {

-- 
Tom Duffy <tom.duffy at sun.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041115/a3f13d38/attachment.sig>

From mshefty at ichips.intel.com  Mon Nov 15 13:36:29 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 13:36:29 -0800
Subject: [openib-general] Solicited response with no matching send	request
In-Reply-To: <1100552937.2767.69.camel@hpc-1>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com>
	<1100552937.2767.69.camel@hpc-1>
Message-ID: <4199215D.3070401@ichips.intel.com>

Hal Rosenstock wrote:

> After Roland's query this AM, I am looking at this some more:
> 
> On Wed, 2004-11-10 at 13:43, Sean Hefty wrote:
> 
>>The second case where I can see this happening is if the client canceled 
>>the send, and I'm not sure that we'd want to give the client an 
>>unmatched response in this case.
> 
> 
> So do we just keep the cancel around for some time period to make sure
> this doesn't occur ? If so, should cancel also have its own timeout or
> should some arbitrary timeout be used to handle this case ? 

My personal take would be to avoid adding that complexity.  E.g. a 
client sends a MAD with TID 5, cancels 5, sends 5, cancels 5, sends 5. 
A response is now received.  What should the MAD layer do?

I don't see issues with silently dropping any MAD that we're not ready 
to receive.  For unsolicited MADs, I don't see a reasonable alternative.

For solicited (response) MADs, I have a hard time seeing why a client 
would ever want an unmatched MAD, unless they're trying to duplicate MAD 
layer functionality higher in the stack.  For user-mode, this may make 
sense, but I'm not convinced that duplicating the request-response 
functionality in user-mode is the best option (versus moving all of RMPP 
to user-mode).

For the sourceforge stack, we handled this by defining "raw" MAD 
services that did nothing other than send/receive MADs.  Clients using a 
raw service were responsible for performing RMPP, request/response 
matching, and handling timeouts.  This worked, but the MAD layer still 
needed to route received MADs to the correct client.

- Sean


From roland at topspin.com  Mon Nov 15 13:44:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 13:44:54 -0800
Subject: [openib-general] Re: [PATCH] fix sparse warnings in mthca
In-Reply-To: <1100554297.13150.23.camel@duffman> (Tom Duffy's message of
	"Mon, 15 Nov 2004 13:31:37 -0800")
References: <1100554297.13150.23.camel@duffman>
Message-ID: <52y8h2lrc9.fsf@topspin.com>

Thanks, applied.  I'm cross-compiling for lots of archs but I only run
sparse on i386.  It's always something... ;)

(Thanks for the Signed-off-by: line too)

 - R.


From robert.j.woodruff at intel.com  Mon Nov 15 13:48:05 2004
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 15 Nov 2004 13:48:05 -0800
Subject: [openib-general] Upstream submission
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C7EA23@orsmsx408>

 Paul Baxter wrote, 

>While I am delighted that the lower layers are suffficiently stable to 
>warrant being considered for code review/inclusion in the kernel, I am 
>slightly surprised.

>Has the code been used in anger enough?

I think that Roland is suggesting we submit it for review now,
not inclusion. The team can then incorporate the comments from
lkml before submitting it for inclusion. 
Given the past IBA projects where we developed a lot
of code/capabilities and tested it fully before getting review by lkml
only to have the code flamed to death when we did submit it,
I think that sending in code early is better than later (IMO). 

>There seem to be a lot of bugs still being discovered daily.

>Wouldn't having at least a preliminary set of user capabilities help 
>assessment of the low level code and allow a wider set of people to
evaluate 
>it.

I think the initial set of capabilities is the ability to run IPoIB. 

>Are there sufficient test tools and documentation to allow an IB novice

>(kernel expert) to evaluate the offering.

Initial test tools can be anything that runs on top of a network stack
today.
I do think that it is important to have good enough documentation for
people to configure/run the stuff. I actually think that having it
included in the 
kernel tree will make it easier for people to try it out, rather than
having to check out the code from svn. 

>Perhaps over the next month Doug could be a 'dry run guinea pig' for
kernel 
>inclusion and highlight the documentation and coding areas of
difficulty 
>prior to submission for a wider audience.

Good idea.


From halr at voltaire.com  Mon Nov 15 13:08:58 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 16:08:58 -0500
Subject: [openib-general] Solicited response with no matching send	request
In-Reply-To: <4192616C.7070905@ichips.intel.com>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com>
Message-ID: <1100552937.2767.69.camel@hpc-1>

After Roland's query this AM, I am looking at this some more:

On Wed, 2004-11-10 at 13:43, Sean Hefty wrote:
> The second case where I can see this happening is if the client canceled 
> the send, and I'm not sure that we'd want to give the client an 
> unmatched response in this case.

So do we just keep the cancel around for some time period to make sure
this doesn't occur ? If so, should cancel also have its own timeout or
should some arbitrary timeout be used to handle this case ? 

-- Hal


From halr at voltaire.com  Mon Nov 15 13:56:06 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 16:56:06 -0500
Subject: [openib-general] Solicited response with no matching	send	request
In-Reply-To: <4199215D.3070401@ichips.intel.com>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1>
	<4199215D.3070401@ichips.intel.com>
Message-ID: <1100555766.2767.102.camel@hpc-1>

On Mon, 2004-11-15 at 16:36, Sean Hefty wrote:
> Hal Rosenstock wrote:
> 
> > After Roland's query this AM, I am looking at this some more:
> > 
> > On Wed, 2004-11-10 at 13:43, Sean Hefty wrote:
> > 
> >>The second case where I can see this happening is if the client canceled 
> >>the send, and I'm not sure that we'd want to give the client an 
> >>unmatched response in this case.
> > 
> > 
> > So do we just keep the cancel around for some time period to make sure
> > this doesn't occur ? If so, should cancel also have its own timeout or
> > should some arbitrary timeout be used to handle this case ? 
> 
> My personal take would be to avoid adding that complexity.  E.g. a 
> client sends a MAD with TID 5, cancels 5, sends 5, cancels 5, sends 5. 
> A response is now received.  What should the MAD layer do?
> I don't see issues with silently dropping any MAD that we're not ready 
> to receive.

What does "ready to receive" mean in this context ? Does it mean there
is no matching send if it is a solicited response ?

>   For unsolicited MADs, I don't see a reasonable alternative.

Not sure what you mean by alternative for unsolicited MADs. For
unsolicited MADs, there is only the version/class/method based routing.
If there is no client, the receive is dropped.

> For solicited (response) MADs, I have a hard time seeing why a client 
> would ever want an unmatched MAD, unless they're trying to duplicate MAD 
> layer functionality higher in the stack.  

Yes, I too would view this as a duplication of MAD layer services. 

> For user-mode, this may make sense, but I'm not convinced that duplicating the request-response 
> functionality in user-mode is the best option (versus moving all of RMPP 
> to user-mode).

What functionality are you referring to being duplicated ?
Request/response matching with timeouts ? Wouldn't moving RMPP to user
mode be a duplication ? There are certain things in the kernel that
might want to use RMPP.

> For the sourceforge stack, we handled this by defining "raw" MAD 
> services that did nothing other than send/receive MADs.  Clients using a 
> raw service were responsible for performing RMPP, request/response 
> matching, and handling timeouts.  This worked, but the MAD layer still 
> needed to route received MADs to the correct client.

Yes, but there are two types of routing: TID based and
version/class/method based.

-- Hal


From halr at voltaire.com  Mon Nov 15 13:17:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 16:17:39 -0500
Subject: [openib-general] OpenIB BuiltIn Support ?
Message-ID: <1100553459.2767.80.camel@hpc-1>

Hi Roland,

Should IB build as either built-in or modules ? (I usually build
everything as modules). If built-in should work, does everything IB need
to be built in rather than as modules ?

Just wondering what the expectations should be here.

Thanks.

-- Hal


From mlleinin at hpcn.ca.sandia.gov  Mon Nov 15 13:55:41 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Mon, 15 Nov 2004 13:55:41 -0800
Subject: [openib-general] Signed-off-by: lines
In-Reply-To: <526547nfje.fsf@topspin.com>
References: <52y8h3njgg.fsf@topspin.com>  <526547nfje.fsf@topspin.com>
Message-ID: <1100555741.14334.699.camel@trinity>

On Mon, 2004-11-15 at 10:16 -0800, Roland Dreier wrote:
> By the way, for our initial submission upstream, I am planning on
> submitting all the patches with my own
> 
> Signed-off-by: Roland Dreier <roland at topspin.com>
> 
> line, of course preserving any other Signed-off-by: lines that already
> exist.  However, for the future, it would be a good idea to make sure
> that all patches come with a properly formatted Signed-off-by: line(s)
> and preserve all such lines in the svn commit messages.
> 
> (Read Documentation/SubmittingPatches in the kernel tree for full details)
> 
   I added the "signed-off by" requirement to the OpenIB FAQ.   We
probably need to have an 'SVN acceptable use policy' that covers the
licensing and "signed-off by" requirements.  I'll put something together
and put it up on openib.org for review.

	- Matt


From roland at topspin.com  Mon Nov 15 13:58:34 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 13:58:34 -0800
Subject: [openib-general] Solicited response with no matching send request
In-Reply-To: <1100552937.2767.69.camel@hpc-1> (Hal Rosenstock's message of
	"Mon, 15 Nov 2004 16:08:58 -0500")
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1>
Message-ID: <52ekiulqph.fsf@topspin.com>

    Hal> So do we just keep the cancel around for some time period to
    Hal> make sure this doesn't occur ? If so, should cancel also have
    Hal> its own timeout or should some arbitrary timeout be used to
    Hal> handle this case ?

I don't think we should worry about this.  If a consumer sends two
requests with the same TID close enough together that we can't tell
which is which, that's the consumer's fault and if it breaks they
should just get both pieces.

 - R.


From roland at topspin.com  Mon Nov 15 14:00:18 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 14:00:18 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <1100553459.2767.80.camel@hpc-1> (Hal Rosenstock's message of
	"Mon, 15 Nov 2004 16:17:39 -0500")
References: <1100553459.2767.80.camel@hpc-1>
Message-ID: <52actilqml.fsf@topspin.com>

    Hal> Should IB build as either built-in or modules ? (I usually
    Hal> build everything as modules). If built-in should work, does
    Hal> everything IB need to be built in rather than as modules ?

I haven't actually tried it but I think any combination of 'y' and 'm'
for config options that is allowed by the kernel config system should
work.  If it doesn't then it should be fairly easy to fix.

 - R.


From peter at pantasys.com  Mon Nov 15 14:11:54 2004
From: peter at pantasys.com (Peter Buckingham)
Date: Mon, 15 Nov 2004 14:11:54 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <52actilqml.fsf@topspin.com>
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
Message-ID: <419929AA.1010409@pantasys.com>

Roland Dreier wrote:
>     Hal> Should IB build as either built-in or modules ? (I usually
>     Hal> build everything as modules). If built-in should work, does
>     Hal> everything IB need to be built in rather than as modules ?
> 
> I haven't actually tried it but I think any combination of 'y' and 'm'
> for config options that is allowed by the kernel config system should
> work.  If it doesn't then it should be fairly easy to fix.

I have tried this with gen1 and things don't seem to play nice.. I've 
only tried it with mellanox's hca driver, does mthca work better when 
built-in?

thanks,

peter


From mshefty at ichips.intel.com  Mon Nov 15 14:15:50 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 14:15:50 -0800
Subject: [openib-general] Solicited response with no matching	send	request
In-Reply-To: <1100555766.2767.102.camel@hpc-1>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com>
	<1100552937.2767.69.camel@hpc-1>
	<4199215D.3070401@ichips.intel.com>
	<1100555766.2767.102.camel@hpc-1>
Message-ID: <41992A96.9060000@ichips.intel.com>

Hal Rosenstock wrote:

>>My personal take would be to avoid adding that complexity.  E.g. a 
>>client sends a MAD with TID 5, cancels 5, sends 5, cancels 5, sends 5. 
>>A response is now received.  What should the MAD layer do?
>>I don't see issues with silently dropping any MAD that we're not ready 
>>to receive.
> 
> What does "ready to receive" mean in this context ? Does it mean there
> is no matching send if it is a solicited response ?

Meaning that we have a client that has requested to receive a MAD, by 
either asking for an unsolicited MADs via ib_register_mad_agent, or by 
asking for a solicited MAD by sending a request via ib_post_send_mad.

>>  For unsolicited MADs, I don't see a reasonable alternative.
> 
> Not sure what you mean by alternative for unsolicited MADs. For
> unsolicited MADs, there is only the version/class/method based routing.
> If there is no client, the receive is dropped.

Correct - the receive is dropped.  There really isn't an alternative.

>>For user-mode, this may make sense, but I'm not convinced that duplicating the request-response 
>>functionality in user-mode is the best option (versus moving all of RMPP 
>>to user-mode).
> 
> 
> What functionality are you referring to being duplicated ?
> Request/response matching with timeouts ? Wouldn't moving RMPP to user
> mode be a duplication ? There are certain things in the kernel that
> might want to use RMPP.

I'm not suggesting relocating RMPP or request/response matching to 
user-mode, but I would consider duplicating those services in user-mode 
for user-mode clients if there was a strong enough reason.

> Yes, but there are two types of routing: TID based and
> version/class/method based.

Correct, and we're talking mainly about TID based routing in this case. 
  I guess my view is that I don't think that the code should trust 
anything that comes off the wire.  If we receive a MAD that results in 
TID based routing that doesn't match with an existing request, then 
dropping it seems like the safest solution.

If we want to route the MAD to the corresponding agent, however, we can 
do that.  But doing this only seems useful if a client is duplicating 
functionality, which only makes sense to me for user-mode clients.

- Sean


From halr at voltaire.com  Mon Nov 15 14:34:30 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 17:34:30 -0500
Subject: [openib-general] Solicited response with no	matching	send	request
In-Reply-To: <41992A96.9060000@ichips.intel.com>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1>
	<4199215D.3070401@ichips.intel.com> <1100555766.2767.102.camel@hpc-1>
	<41992A96.9060000@ichips.intel.com>
Message-ID: <1100558070.2767.119.camel@hpc-1>

On Mon, 2004-11-15 at 17:15, Sean Hefty wrote:
> If we want to route the MAD to the corresponding agent, however, we can 
> do that.  But doing this only seems useful if a client is duplicating 
> functionality, which only makes sense to me for user-mode clients.

If we want to limit this to user mode clients only, we would need an
extra parameter on register to indicate whether the client was kernel or
user mode. This clearly wouldn't be very trustworthy. Is there a better
way ?

-- Hal


From tduffy at sun.com  Mon Nov 15 14:33:07 2004
From: tduffy at sun.com (Tom Duffy)
Date: Mon, 15 Nov 2004 14:33:07 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <419929AA.1010409@pantasys.com>
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com>
Message-ID: <1100557988.13150.28.camel@duffman>

On Mon, 2004-11-15 at 14:11 -0800, Peter Buckingham wrote:
> I have tried this with gen1 and things don't seem to play nice.. I've 
> only tried it with mellanox's hca driver, does mthca work better when 
> built-in?

I just tried with the latest gen2 openib bits on 2.6.10-rc2, mthca and
ipoib builtin and everything builds and boots fine (at least on x86_64).

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041115/1eca2bb4/attachment.sig>

From roland at topspin.com  Mon Nov 15 14:35:37 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 14:35:37 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <419929AA.1010409@pantasys.com> (Peter Buckingham's message of
	"Mon, 15 Nov 2004 14:11:54 -0800")
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com>
Message-ID: <526546lozq.fsf@topspin.com>

    Peter> I have tried this with gen1 and things don't seem to play
    Peter> nice.. I've only tried it with mellanox's hca driver, does
    Peter> mthca work better when built-in?

Yes, I'm sure gen1 is completely broken, as is mellanox's driver.
mthca should work since it uses the correct PCI driver API.

I'll try building a kernel with IB built-in and report back...

 - R.


From mshefty at ichips.intel.com  Mon Nov 15 14:37:16 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 14:37:16 -0800
Subject: [openib-general] Solicited response with no	matching	send	request
In-Reply-To: <1100558070.2767.119.camel@hpc-1>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com>
	<1100552937.2767.69.camel@hpc-1>
	<4199215D.3070401@ichips.intel.com>
	<1100555766.2767.102.camel@hpc-1>
	<41992A96.9060000@ichips.intel.com>
	<1100558070.2767.119.camel@hpc-1>
Message-ID: <41992F9C.80900@ichips.intel.com>

Hal Rosenstock wrote:
>>If we want to route the MAD to the corresponding agent, however, we can 
>>do that.  But doing this only seems useful if a client is duplicating 
>>functionality, which only makes sense to me for user-mode clients.
> 
> 
> If we want to limit this to user mode clients only, we would need an
> extra parameter on register to indicate whether the client was kernel or
> user mode. This clearly wouldn't be very trustworthy. Is there a better
> way ?

Since the registration would actually be done in the kernel, I think 
that we can trust it.  It's just that before supporting this, I'd like 
to make sure that routing unmatched responses is really the right solution.

I.e. Is this something that kernel mode clients would need?  Does it 
make sense for clients to duplicate additional functionality, such as 
RMPP?  Would a solution that duplicated RMPP functionality in user-mode 
be better than one that only allowed for managing timeouts?

- Sean


From halr at voltaire.com  Mon Nov 15 14:55:14 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 15 Nov 2004 17:55:14 -0500
Subject: [openib-general] Solicited response with	no	matching	send	request
In-Reply-To: <41992F9C.80900@ichips.intel.com>
References: <1100111569.2836.61.camel@hpc-1>
	<4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1>
	<4199215D.3070401@ichips.intel.com> <1100555766.2767.102.camel@hpc-1>
	<41992A96.9060000@ichips.intel.com> <1100558070.2767.119.camel@hpc-1>
	<41992F9C.80900@ichips.intel.com>
Message-ID: <1100559314.2767.124.camel@hpc-1>

On Mon, 2004-11-15 at 17:37, Sean Hefty wrote:
> It's just that before supporting this, I'd like 
> to make sure that routing unmatched responses is really the right solution.
> 
> I.e. Is this something that kernel mode clients would need?  

I think you mean user mode clients.

> Does it 
> make sense for clients to duplicate additional functionality, such as 
> RMPP?  Would a solution that duplicated RMPP functionality in user-mode 
> be better than one that only allowed for managing timeouts?

It doesn't to me but this might be the "naive" port to get OpenSM up and
running as quickly as possible.

-- Hal


From roland at topspin.com  Mon Nov 15 14:50:39 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 14:50:39 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <1100557988.13150.28.camel@duffman> (Tom Duffy's message of
	"Mon, 15 Nov 2004 14:33:07 -0800")
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman>
Message-ID: <52wtwmk9q8.fsf@topspin.com>

    Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2,
    Tom> mthca and ipoib builtin and everything builds and boots fine
    Tom> (at least on x86_64).

Cool, thanks for testing.  For what it's worth, it works here on i386
as well.  (Not very convenient for development though :)

 - R.


From peter at pantasys.com  Mon Nov 15 15:22:45 2004
From: peter at pantasys.com (Peter Buckingham)
Date: Mon, 15 Nov 2004 15:22:45 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <52wtwmk9q8.fsf@topspin.com>
References: <1100553459.2767.80.camel@hpc-1>
	<52actilqml.fsf@topspin.com>	<419929AA.1010409@pantasys.com>
	<1100557988.13150.28.camel@duffman> <52wtwmk9q8.fsf@topspin.com>
Message-ID: <41993A45.7040406@pantasys.com>

Roland Dreier wrote:
>     Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2,
>     Tom> mthca and ipoib builtin and everything builds and boots fine
>     Tom> (at least on x86_64).
> 
> Cool, thanks for testing.  For what it's worth, it works here on i386
> as well.  (Not very convenient for development though :)

So gen2 works. From what I understand OpenSM is not yet supported for 
gen2. What other things are still missing between gen1 and gen2? (sorry, 
this is probably a FAQ...)

thanks,

peter


From roland at topspin.com  Mon Nov 15 15:34:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 15:34:38 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <41993A45.7040406@pantasys.com> (Peter Buckingham's message of
	"Mon, 15 Nov 2004 15:22:45 -0800")
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman>
	<52wtwmk9q8.fsf@topspin.com> <41993A45.7040406@pantasys.com>
Message-ID: <52sm7ak7ox.fsf@topspin.com>

    Peter> So gen2 works. From what I understand OpenSM is not yet
    Peter> supported for gen2. What other things are still missing
    Peter> between gen1 and gen2? (sorry, this is probably a FAQ...)

Easier to answer what works now in gen2: only IPoIB.  Everything else
(userspace verbs, CM, SDP, etc.) needs to be implemented or ported forward.

 - R.


From Tom.Duffy at Sun.COM  Mon Nov 15 15:35:27 2004
From: Tom.Duffy at Sun.COM (Tom Duffy)
Date: Mon, 15 Nov 2004 15:35:27 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <41993A45.7040406@pantasys.com>
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman>
	<52wtwmk9q8.fsf@topspin.com> <41993A45.7040406@pantasys.com>
Message-ID: <1100561727.13150.31.camel@duffman>

On Mon, 2004-11-15 at 15:22 -0800, Peter Buckingham wrote:
> Roland Dreier wrote:
> >     Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2,
> >     Tom> mthca and ipoib builtin and everything builds and boots fine
> >     Tom> (at least on x86_64).
> > 
> > Cool, thanks for testing.  For what it's worth, it works here on i386
> > as well.  (Not very convenient for development though :)
> 
> So gen2 works. From what I understand OpenSM is not yet supported for 
> gen2. What other things are still missing between gen1 and gen2? (sorry, 
> this is probably a FAQ...)

Well, all the ULP's except for IPoIB.  So, for now, no SRP, SDP, *DAPL,
NFSoRDMA, etc.

Also, no 2.4.x kernel support.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041115/a3d6aceb/attachment.sig>

From mshefty at ichips.intel.com  Mon Nov 15 15:48:26 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 15 Nov 2004 15:48:26 -0800
Subject: [openib-general] [PATCH] fix cleanup in MAD code when unloading HCA
	driver
Message-ID: <20041115154826.2162f686.mshefty@ichips.intel.com>

After looking at the code, I believe that there's a race condition
cleaning up in the MAD code when unloading the HCA driver.  The
MAD layer can be processing a received MAD when the driver unloads,
which can result in accessing the receive queue after all MADs
on the receive queue have been freed.

This patch should correct that issue, by delaying cleanup of
the receive queues until after processing completions.  A
similar fix is applied recovering from errors when initializing
the port.

- Sean

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1237)
+++ core/mad.c	(working copy)
@@ -1677,7 +1677,7 @@
 /*
  * Return all the posted receive MADs
  */
-static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info *qp_info)
+static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info)
 {
 	struct ib_mad_private_header *mad_priv_hdr;
 	struct ib_mad_private *recv;
@@ -1737,7 +1737,7 @@
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "INIT: %d\n", i, ret);
-			goto error;
+			goto out;
 		}
 
 		attr->qp_state = IB_QPS_RTR;
@@ -1745,7 +1745,7 @@
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTR: %d\n", i, ret);
-			goto error;
+			goto out;
 		}
 
 		attr->qp_state = IB_QPS_RTS;
@@ -1754,7 +1754,7 @@
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't change QP%d state to "
 			       "RTS: %d\n", i, ret);
-			goto error;
+			goto out;
 		}
 	}
 
@@ -1762,58 +1762,21 @@
 	if (ret) {
 		printk(KERN_ERR PFX "Failed to request completion "
 		       "notification: %d\n", ret);
-		goto error;
+		goto out;
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
-			goto error;
+			goto out;
 		}
 	}
-	goto out;
-
-error:
-	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-		attr->qp_state = IB_QPS_RESET;
-		ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE);
-		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-	}
-
 out:
 	kfree(attr);
 	return ret;
 }
 
-/*
- * Stop the port
- */
-static void ib_mad_port_stop(struct ib_mad_port_private *port_priv)
-{
-	int i, ret;
-	struct ib_qp_attr *attr;
-
-	attr = kmalloc(sizeof *attr, GFP_KERNEL);
-	if (attr) {
-		attr->qp_state = IB_QPS_RESET;
-		for (i = 0; i < IB_MAD_QPS_CORE; i++) {
-			ret = ib_modify_qp(port_priv->qp_info[i].qp, attr,
-					   IB_QP_STATE);
-			if (ret)
-				printk(KERN_ERR PFX "ib_mad_port_stop: "
-				       "Couldn't change %s port %d QP%d "
-				       "state to RESET\n",
-				       port_priv->device->name,
-				       port_priv->port_num, i);
-		}
-		kfree(attr);
-	}
-
-	for (i = 0; i < IB_MAD_QPS_CORE; i++)
-		ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]);
-}
-
 static void qp_event_handler(struct ib_event *event, void *qp_context)
 {
 	struct ib_mad_qp_info	*qp_info = qp_context;
@@ -1832,21 +1795,24 @@
 	INIT_LIST_HEAD(&mad_queue->list);
 }
 
-static int create_mad_qp(struct ib_mad_port_private *port_priv,
-			 struct ib_mad_qp_info *qp_info,
-			 enum ib_qp_type qp_type)
+static void init_mad_qp(struct ib_mad_port_private *port_priv,
+			struct ib_mad_qp_info *qp_info)
 {
-	struct ib_qp_init_attr	qp_init_attr;
-	int ret;
-
 	qp_info->port_priv = port_priv;
 	init_mad_queue(qp_info, &qp_info->send_queue);
 	init_mad_queue(qp_info, &qp_info->recv_queue);
 	INIT_LIST_HEAD(&qp_info->overflow_list);
+}
+
+static int create_mad_qp(struct ib_mad_qp_info *qp_info,
+			 enum ib_qp_type qp_type)
+{
+	struct ib_qp_init_attr	qp_init_attr;
+	int ret;
 
 	memset(&qp_init_attr, 0, sizeof qp_init_attr);
-	qp_init_attr.send_cq = port_priv->cq;
-	qp_init_attr.recv_cq = port_priv->cq;
+	qp_init_attr.send_cq = qp_info->port_priv->cq;
+	qp_init_attr.recv_cq = qp_info->port_priv->cq;
 	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
 	qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR;
 	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
@@ -1854,10 +1820,10 @@
 	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
 	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
 	qp_init_attr.qp_type = qp_type;
-	qp_init_attr.port_num = port_priv->port_num;
+	qp_init_attr.port_num = qp_info->port_priv->port_num;
 	qp_init_attr.qp_context = qp_info;
 	qp_init_attr.event_handler = qp_event_handler;
-	qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr);
+	qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr);
 	if (IS_ERR(qp_info->qp)) {
 		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
 		       get_spl_qp_index(qp_type));
@@ -1903,11 +1869,13 @@
 		printk(KERN_ERR PFX "No memory for ib_mad_port_private\n");
 		return -ENOMEM;
 	}
-
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->device = device;
 	port_priv->port_num = port_num;
 	spin_lock_init(&port_priv->reg_lock);
+	INIT_LIST_HEAD(&port_priv->agent_list);
+	init_mad_qp(port_priv, &port_priv->qp_info[0]);
+	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
@@ -1934,16 +1902,13 @@
 		goto error5;
 	}
 
-	ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI);
+	ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
 	if (ret)
 		goto error6;
-	ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI);
+	ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI);
 	if (ret)
 		goto error7;
 
-	spin_lock_init(&port_priv->reg_lock);
-	INIT_LIST_HEAD(&port_priv->agent_list);
-
 	port_priv->wq = create_workqueue("ib_mad");
 	if (!port_priv->wq) {
 		ret = -ENOMEM;
@@ -1974,6 +1939,8 @@
 	ib_dealloc_pd(port_priv->pd);
 error4:
 	ib_destroy_cq(port_priv->cq);
+	cleanup_recv_queue(&port_priv->qp_info[1]);
+	cleanup_recv_queue(&port_priv->qp_info[0]);
 error3:
 	kfree(port_priv);
 
@@ -2000,7 +1967,7 @@
 	list_del(&port_priv->port_list);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
 
-	ib_mad_port_stop(port_priv);
+	/* Stop processing completions. */
 	flush_workqueue(port_priv->wq);
 	destroy_workqueue(port_priv->wq);
 	destroy_mad_qp(&port_priv->qp_info[1]);
@@ -2008,6 +1975,8 @@
 	ib_dereg_mr(port_priv->mr);
 	ib_dealloc_pd(port_priv->pd);
 	ib_destroy_cq(port_priv->cq);
+	cleanup_recv_queue(&port_priv->qp_info[1]);
+	cleanup_recv_queue(&port_priv->qp_info[0]);
 	/* XXX: Handle deallocation of MAD registration tables */
 
 	kfree(port_priv);


From roland at topspin.com  Mon Nov 15 20:18:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 20:18:38 -0800
Subject: [openib-general] [PATCH] fix warning in mad.c
Message-ID: <52oehyjujl.fsf@topspin.com>

flags for spin lock should be unsigned long, not int.

 - R.

Index: infiniband/core/mad.c
===================================================================
--- infiniband/core/mad.c	(revision 1239)
+++ infiniband/core/mad.c	(working copy)
@@ -1353,7 +1353,7 @@
 {
 	struct ib_mad_send_wr_private *mad_send_wr;
 	struct ib_mad_list_head *mad_list;
-	int flags;
+	unsigned long flags;
 
 	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
 	list_for_each_entry(mad_list, &qp_info->send_queue.list, list) {


From roland at topspin.com  Mon Nov 15 20:43:06 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 20:43:06 -0800
Subject: [openib-general] [PATCH] Get rid of /proc/infiniband/ipoib_vlan
Message-ID: <52k6smjtet.fsf@topspin.com>

This kills off /proc/infiniband/ipoib_vlan in favor of a simpler sysfs
interface.  To create ib0.8001, you can now just do

    # echo 0x8001 > /sys/class/net/ib0/create_child 

and to get rid of the interface,

    # echo 0x8001 > /sys/class/net/ib0/delete_child

(Better names for these files gladly accepted)

To see a child interface's parent (in case interfaces have been
renamed to something nonobvious):

    # cat /sys/class/net/ib0.8001/parent
    ib0

and to check the P_Key of an interface:

    # cat /sys/class/net/ib0/pkey
    0xffff

 - Roland

Index: infiniband/ulp/ipoib/ipoib_vlan.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_vlan.c	(revision 1239)
+++ infiniband/ulp/ipoib/ipoib_vlan.c	(working copy)
@@ -32,43 +32,62 @@
 
 #include "ipoib.h"
 
-struct ipoib_vlan_iter {
-	struct list_head *pintf_cur;
-	struct list_head *intf_cur;
-};
+/*
+ * We use this mutex to serialize child interface creation.  This
+ * closes the race where userspace might create the same child
+ * interface twice at exactly the same time.
+ */
+static DECLARE_MUTEX(vlan_mutex);
 
-static DECLARE_MUTEX(proc_mutex);
+static ssize_t show_parent(struct class_device *class_dev, char *buf)
+{
+	struct net_device *dev =
+		container_of(class_dev, struct net_device, class_dev);
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-int ipoib_vlan_add(struct net_device *pdev, char *intf_name,
-		   unsigned short pkey)
+	return sprintf(buf, "%s\n", priv->parent->name);
+}
+static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
+
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 {
 	struct ipoib_dev_priv *ppriv, *priv;
-	int result = -ENOMEM;
+	char intf_name[IFNAMSIZ];
+	int result;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
+	down(&vlan_mutex);
+
 	ppriv = netdev_priv(pdev);
 
 	/*
 	 * First ensure this isn't a duplicate. We check the parent device and
 	 * then all of the child interfaces to make sure the Pkey doesn't match.
 	 */
-	if (ppriv->pkey == pkey)
-		return -ENOTUNIQ;
+	if (ppriv->pkey == pkey) {
+		result = -ENOTUNIQ;
+		goto err;
+	}
 
 	down(&ipoib_device_mutex);
 	list_for_each_entry(priv, &ppriv->child_intfs, list) {
 		if (priv->pkey == pkey) {
 			up(&ipoib_device_mutex);
-			return -ENOTUNIQ;
+			result = -ENOTUNIQ;
+			goto err;
 		}
 	}
 	up(&ipoib_device_mutex);
 
+	snprintf(intf_name, sizeof intf_name, "%s.%04x",
+		 ppriv->dev->name, pkey);
 	priv = ipoib_intf_alloc(intf_name);
-	if (!priv)
-		goto alloc_mem_failed;
+	if (!priv) {
+		result = -ENOMEM;
+		goto err;
+	}
 
 	set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags);
 
@@ -92,19 +111,33 @@
 		goto register_failed;
 	}
 
+	priv->parent = ppriv->dev;
+
+	if (ipoib_add_pkey_attr(priv->dev))
+		goto sysfs_failed;
+
+	if (class_device_create_file(&priv->dev->class_dev,
+				     &class_device_attr_parent))
+		goto sysfs_failed;
+
 	down(&ipoib_device_mutex);
 	list_add_tail(&priv->list, &ppriv->child_intfs);
 	up(&ipoib_device_mutex);
+	up(&vlan_mutex);
 
 	return 0;
 
+sysfs_failed:
+	unregister_netdev(priv->dev);
+
 register_failed:
 	ipoib_dev_cleanup(priv->dev);
 
 device_init_failed:
 	free_netdev(priv->dev);
 
-alloc_mem_failed:
+err:
+	up(&vlan_mutex);
 	return result;
 }
 
@@ -120,13 +153,8 @@
 	down(&ipoib_device_mutex);
 	list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) {
 		if (priv->pkey == pkey) {
-			if (priv->dev->flags & IFF_UP) {
-				up(&ipoib_device_mutex);
-				return -EBUSY;
-			}
-
-			ipoib_dev_cleanup(priv->dev);
 			unregister_netdev(priv->dev);
+			ipoib_dev_cleanup(priv->dev);
 
 			list_del(&priv->list);
 
@@ -140,219 +168,3 @@
 
 	return -ENOENT;
 }
-
-/* =============================================================== */
-/*..ipoib_vlan_iter_next -- incr. iter. -- return non-zero at end  */
-int ipoib_vlan_iter_next(struct ipoib_vlan_iter *iter)
-{
-	while (1) {
-		struct ipoib_dev_priv *priv;
-
-		priv = list_entry(iter->pintf_cur, struct ipoib_dev_priv, list);
-		if (!iter->intf_cur)
-			iter->intf_cur = priv->child_intfs.next;
-		else
-			iter->intf_cur = iter->intf_cur->next;
-
-		if (iter->intf_cur == &priv->child_intfs) {
-			iter->pintf_cur = iter->pintf_cur->next;
-			if (iter->pintf_cur == &ipoib_device_list)
-				return 1;
-
-			iter->intf_cur = NULL;
-			return 0;
-		} else
-			return 0;
-	}
-}
-
-/* =============================================================== */
-/*.._ipoib_vlan_seq_start -- seq file handling                     */
-static void *_ipoib_vlan_seq_start(struct seq_file *file, loff_t *pos)
-{
-	struct ipoib_vlan_iter *iter;
-	loff_t n = *pos;
-
-	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
-	if (!iter)
-		return NULL;
-
-	iter->pintf_cur = ipoib_device_list.next;
-	iter->intf_cur = NULL;
-
-	while (n--) {
-		if (ipoib_vlan_iter_next(iter)) {
-			kfree(iter);
-			return NULL;
-		}
-	}
-
-	return iter;
-}
-
-/* =============================================================== */
-/*.._ipoib_vlan_seq_next -- seq file handling                      */
-static void *_ipoib_vlan_seq_next(struct seq_file *file, void *iter_ptr,
-				  loff_t *pos)
-{
-	struct ipoib_vlan_iter *iter = iter_ptr;
-
-	(*pos)++;
-
-	if (ipoib_vlan_iter_next(iter)) {
-		kfree(iter);
-		return NULL;
-	}
-
-	return iter;
-}
-
-/* =============================================================== */
-/*.._ipoib_vlan_seq_stop -- seq file handling                      */
-static void _ipoib_vlan_seq_stop(struct seq_file *file, void *iter_ptr)
-{
-	struct ipoib_vlan_iter *iter = iter_ptr;
-
-	kfree(iter);
-}
-
-/* =============================================================== */
-/*.._ipoib_vlan_seq_show -- seq file handling                      */
-static int _ipoib_vlan_seq_show(struct seq_file *file, void *iter_ptr)
-{
-	struct ipoib_vlan_iter *iter = iter_ptr;
-
-	if (iter) {
-		struct ipoib_dev_priv *ppriv;
-
-		ppriv = list_entry(iter->pintf_cur, struct ipoib_dev_priv, list);
-
-		if (!iter->intf_cur)
-			seq_printf(file, "%s 0x%04x\n", ppriv->dev->name,
-				   ppriv->pkey);
-		else {
-			struct ipoib_dev_priv *priv;
-
-			priv = list_entry(iter->intf_cur, struct ipoib_dev_priv,
-				          list);
-
-			seq_printf(file, " %s %s 0x%04x\n", ppriv->dev->name,
-				   priv->dev->name, priv->pkey);
-		}
-	}
-
-	return 0;
-}
-
-static struct seq_operations ipoib_vlan_seq_operations = {
-	.start = _ipoib_vlan_seq_start,
-	.next = _ipoib_vlan_seq_next,
-	.stop = _ipoib_vlan_seq_stop,
-	.show = _ipoib_vlan_seq_show,
-};
-
-/* =============================================================== */
-/*.._ipoib_vlan_proc_open -- proc file handling                    */
-static int _ipoib_vlan_proc_open(struct inode *inode, struct file *file)
-{
-	if (down_interruptible(&proc_mutex))
-		return -ERESTARTSYS;
-
-	return seq_open(file, &ipoib_vlan_seq_operations);
-}
-
-/* =============================================================== */
-/*.._ipoib_vlan_proc_write -- proc file handling                   */
-static ssize_t _ipoib_vlan_proc_write(struct file *file,
-				      const char __user *buffer,
-				      size_t count, loff_t *pos)
-{
-	int result;
-	char kernel_buf[256];
-	char intf_parent[128], intf_name[128];
-	unsigned int pkey;
-	struct net_device *pdev;
-
-	count = min(count, sizeof(kernel_buf));
-
-	if (copy_from_user(kernel_buf, buffer, count))
-		return -EFAULT;
-
-	kernel_buf[count - 1] = '\0';
-
-	if (sscanf(kernel_buf, "add %128s %128s %i", intf_parent, intf_name,
-		   &pkey) == 3) {
-		if (pkey > 0xffff)
-			return -EINVAL;
-
-		pdev = dev_get_by_name(intf_parent);
-		if (!pdev)
-			return -ENOENT;
-
-		result = ipoib_vlan_add(pdev, intf_name, pkey);
-
-		dev_put(pdev);
-
-		if (result < 0)
-			return result;
-	} else if (sscanf(kernel_buf, "del %128s %i", intf_parent,
-			  &pkey) == 2) {
-		if (pkey > 0xffff)
-			return -EINVAL;
-
-		pdev = dev_get_by_name(intf_parent);
-		if (!pdev)
-			return -ENOENT;
-
-		result = ipoib_vlan_delete(pdev, pkey);
-
-		dev_put(pdev);
-
-		if (result < 0)
-			return result;
-	} else
-		return -EINVAL;
-
-	return count;
-}
-
-/* =============================================================== */
-/*.._ipoib_vlan_proc_release -- proc file handling                 */
-static int _ipoib_vlan_proc_release(struct inode *inode, struct file *file)
-{
-	up(&proc_mutex);
-
-	return seq_release(inode, file);
-}
-
-static struct file_operations ipoib_vlan_proc_operations = {
-	.owner 	 = THIS_MODULE,
-	.open 	 = _ipoib_vlan_proc_open,
-	.read 	 = seq_read,
-	.write 	 = _ipoib_vlan_proc_write,
-	.llseek  = seq_lseek,
-	.release = _ipoib_vlan_proc_release,
-};
-
-struct proc_dir_entry *vlan_proc_entry;
-
-int ipoib_vlan_init(void)
-{
-	vlan_proc_entry = create_proc_entry("ipoib_vlan",
-					    S_IRUGO | S_IWUGO, ipoib_proc_dir);
-
-	if (!vlan_proc_entry) {
-		printk(KERN_WARNING "Can't create ipoib_vlan in /proc\n");
-		return -ENOMEM;
-	}
-
-	vlan_proc_entry->proc_fops = &ipoib_vlan_proc_operations;
-
-	return 0;
-}
-
-void ipoib_vlan_cleanup(void)
-{
-	if (vlan_proc_entry)
-		remove_proc_entry("ipoib_vlan", ipoib_proc_dir);
-}
Index: infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_main.c	(revision 1239)
+++ infiniband/ulp/ipoib/ipoib_main.c	(working copy)
@@ -609,7 +609,6 @@
 void ipoib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
-	int i;
 
 	/* Delete any child interfaces first */
 	/* Safe since it's either protected by ipoib_device_mutex or empty */
@@ -626,19 +625,11 @@
 	ipoib_ib_dev_cleanup(dev);
 
 	if (priv->rx_ring) {
-		for (i = 0; i < IPOIB_RX_RING_SIZE; ++i)
-			if (priv->rx_ring[i].skb)
-				dev_kfree_skb_any(priv->rx_ring[i].skb);
-
 		kfree(priv->rx_ring);
 		priv->rx_ring = NULL;
 	}
 
 	if (priv->tx_ring) {
-		for (i = 0; i < IPOIB_TX_RING_SIZE; ++i)
-			if (priv->tx_ring[i].skb)
-				dev_kfree_skb_any(priv->tx_ring[i].skb);
-
 		kfree(priv->tx_ring);
 		priv->tx_ring = NULL;
 	}
@@ -714,6 +705,60 @@
 	return netdev_priv(dev);
 }
 
+static ssize_t show_pkey(struct class_device *cdev, char *buf)
+{
+	struct ipoib_dev_priv *priv =
+		netdev_priv(container_of(cdev, struct net_device, class_dev));
+
+	return sprintf(buf, "0x%04x\n", priv->pkey);
+}
+static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
+
+static ssize_t create_child(struct class_device *cdev,
+			    const char *buf, size_t count)
+{
+	int pkey;
+	int ret;
+
+	if (sscanf(buf, "%i", &pkey) != 1)
+		return -EINVAL;
+
+	if (pkey < 0 || pkey > 0xffff)
+		return -EINVAL;
+
+	ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev),
+			     pkey);
+
+	return ret ? ret : count;
+}
+static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
+
+static ssize_t delete_child(struct class_device *cdev,
+			    const char *buf, size_t count)
+{
+	int pkey;
+	int ret;
+
+	if (sscanf(buf, "%i", &pkey) != 1)
+		return -EINVAL;
+
+	if (pkey < 0 || pkey > 0xffff)
+		return -EINVAL;
+
+	ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev),
+				pkey);
+
+	return ret ? ret : count;
+
+}
+static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
+
+int ipoib_add_pkey_attr(struct net_device *dev)
+{
+	return class_device_create_file(&dev->class_dev,
+					&class_device_attr_pkey);
+}
+
 static int ipoib_add_port(const char *format, struct ib_device *hca, u8 port)
 {
 	struct ipoib_dev_priv *priv;
@@ -771,6 +816,15 @@
 	if (ipoib_proc_dev_init(priv->dev))
 		goto proc_failed;
 
+	if (ipoib_add_pkey_attr(priv->dev))
+		goto proc_failed;
+	if (class_device_create_file(&priv->dev->class_dev,
+				     &class_device_attr_create_child))
+		goto proc_failed;
+	if (class_device_create_file(&priv->dev->class_dev,
+				     &class_device_attr_delete_child))
+		goto proc_failed;
+
 	down(&ipoib_device_mutex);
 	list_add_tail(&priv->list, &ipoib_device_list);
 	up(&ipoib_device_mutex);
@@ -860,8 +914,6 @@
 	if (ret)
 		goto err_wq;
 
-	ipoib_vlan_init();
-
 	return 0;
 
 err_wq:
@@ -875,7 +927,6 @@
 
 static void __exit ipoib_cleanup_module(void)
 {
-	ipoib_vlan_cleanup();
 	ib_unregister_client(&ipoib_client);
 	remove_proc_entry("infiniband", NULL);
 	destroy_workqueue(ipoib_workqueue);
Index: infiniband/ulp/ipoib/ipoib.h
===================================================================
--- infiniband/ulp/ipoib/ipoib.h	(revision 1239)
+++ infiniband/ulp/ipoib/ipoib.h	(working copy)
@@ -124,7 +124,6 @@
 
 	union ib_gid local_gid;
 	u16          local_lid;
-	u32          local_qpn;
 
 	unsigned int admin_mtu;
 	unsigned int mcast_mtu;
@@ -145,6 +144,7 @@
 
 	struct net_device_stats stats;
 
+	struct net_device *parent;
 	struct list_head child_intfs;
 	struct list_head list;
 };
@@ -186,6 +186,8 @@
 	kref_put(&ah->ref, ipoib_free_ah);
 }
 
+int ipoib_add_pkey_attr(struct net_device *dev);
+
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		struct ipoib_ah *address, u32 qpn);
 void ipoib_reap_ah(void *dev_ptr);
@@ -240,8 +242,8 @@
 void ipoib_event(struct ib_event_handler *handler,
 		 struct ib_event *record);
 
-int ipoib_vlan_init(void);
-void ipoib_vlan_cleanup(void);
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
 void ipoib_pkey_poll(void *dev);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
Index: infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_ib.c	(revision 1239)
+++ infiniband/ulp/ipoib/ipoib_ib.c	(working copy)
@@ -229,7 +229,6 @@
 		priv->stats.tx_bytes += tx_req->skb->len;
 
 		dev_kfree_skb_any(tx_req->skb);
-		tx_req->skb = NULL;
 
 		spin_lock_irqsave(&priv->lock, flags);
 		++priv->tx_tail;
@@ -336,7 +335,6 @@
 		      address->ah, qpn, addr, skb->len)) {
 		ipoib_warn(priv, "post_send failed\n");
 		++priv->stats.tx_errors;
-		tx_req->skb = NULL;
 		dev_kfree_skb_any(skb);
 	} else {
 		unsigned long flags;
@@ -485,6 +483,8 @@
 	while (priv->tx_head != priv->tx_tail || recvs_pending(dev))
 		yield();
 
+	ipoib_dbg(priv, "All sends and receives done.\n");
+
 	qp_attr.qp_state = IB_QPS_RESET;
 	attr_mask        = IB_QP_STATE;
 	if (ib_modify_qp(priv->qp, &qp_attr, attr_mask))
@@ -499,12 +499,9 @@
 		yield();
 	}
 
-	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) {
-		if (priv->rx_ring[i].skb) {
-			dev_kfree_skb_any(priv->rx_ring[i].skb);
-			priv->rx_ring[i].skb = NULL;
-		}
-	}
+	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i)
+		if (priv->rx_ring[i].skb)
+			ipoib_warn(priv, "Recv skb still around @ %d\n", i);
 
 	return 0;
 }


From roland at topspin.com  Mon Nov 15 20:46:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 15 Nov 2004 20:46:54 -0800
Subject: [openib-general] warning: ipoibcfg no longer needed
In-Reply-To: <52k6smjtet.fsf@topspin.com> (Roland Dreier's message of "Mon,
	15 Nov 2004 20:43:06 -0800")
References: <52k6smjtet.fsf@topspin.com>
Message-ID: <52fz3ajt8h.fsf@topspin.com>

I just committed a change to IPoIB that means ipoibcfg is no longer
needed (and will no longer work).  See the previous message in this
thread, "[PATCH] Get rid of /proc/infiniband/ipoib_vlan", for full details.

 - Roland


From halr at voltaire.com  Mon Nov 15 21:04:53 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 00:04:53 -0500
Subject: [openib-general] [PATCH] fix warning in mad.c
In-Reply-To: <52oehyjujl.fsf@topspin.com>
References: <52oehyjujl.fsf@topspin.com>
Message-ID: <1100581493.3369.2393.camel@localhost.localdomain>

On Mon, 2004-11-15 at 23:18, Roland Dreier wrote:
> flags for spin lock should be unsigned long, not int.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Mon Nov 15 21:44:36 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 00:44:36 -0500
Subject: [openib-general] umad doc
Message-ID: <1100583875.3369.2405.camel@localhost.localdomain>

Hi Roland,

Should the user-mad.txt doc indicate /udev rather than /dev as follows:

/udev files
r.t.
/dev files

/udev/infiniband/mthca0/ports/1/mad
r.t.
/dev/infiniband/mthca0/ports/1/mad

-- Hal


From itoumsn at nttdata.co.jp  Tue Nov 16 04:49:10 2004
From: itoumsn at nttdata.co.jp (Masanori ITOH)
Date: Tue, 16 Nov 2004 21:49:10 +0900 (JST)
Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA
In-Reply-To: <20041112.151421.120503395.itoumsn@nttdata.co.jp>
References: <20041112.151421.120503395.itoumsn@nttdata.co.jp>
Message-ID: <20041116.214910.01371084.itoumsn@nttdata.co.jp>


Hi folks,

From: Masanori ITOH <itoumsn at nttdata.co.jp>
Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA
Date: Fri, 12 Nov 2004 15:14:21 +0900 (JST)

> 
> Hello folks,
> 
> As I mentioned fomerly on this list, I have a working u/kDAPL on top of
> the gen1 stack and I've finally finished all internal procedures
> to make it public.
> # Actually, it took me about one month and a half. Sigh... :(
> 
> I would like to put that into the OpenIB contributors area
> (Somewhere like 'https://openib.org/svn/trunk/contrib/nttdata/'.),
> and could anyone tell me how I can do that?

Today, I checked in my u/kDAPL work into:

  https://openib.org/svn/trunk/contrib/nttdata/

A detailed readme document is included in 'ntt_dapl_1.0.tar.bz2', and
I hope that my work also could be a base of the gen2 u/kDAPL.

Regards,
Masanori

---
Masanori ITOH  Open Source Software Development Center, NTT DATA CORPORATION
               e-mail: itoumsn at nttdata.co.jp
               phone : +81-3-3523-8122 (ext. 172-7199)


From halr at voltaire.com  Tue Nov 16 05:20:13 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 08:20:13 -0500
Subject: [openib-general] Re: [PATCH] fix cleanup in MAD code when unloading
	HCA driver
In-Reply-To: <20041115154826.2162f686.mshefty@ichips.intel.com>
References: <20041115154826.2162f686.mshefty@ichips.intel.com>
Message-ID: <1100611212.3369.2422.camel@localhost.localdomain>

On Mon, 2004-11-15 at 18:48, Sean Hefty wrote:
> After looking at the code, I believe that there's a race condition
> cleaning up in the MAD code when unloading the HCA driver.  The
> MAD layer can be processing a received MAD when the driver unloads,
> which can result in accessing the receive queue after all MADs
> on the receive queue have been freed.
> 
> This patch should correct that issue, by delaying cleanup of
> the receive queues until after processing completions.  A
> similar fix is applied recovering from errors when initializing
> the port.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov 16 06:12:28 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 09:12:28 -0500
Subject: [openib-general] [PATCH] mad: In handle_outgoing_smp,
 remove unneeded call to	smi_handle_dr_recv
Message-ID: <1100614347.28332.3.camel@hpc-1>

mad: In handle_outgoing_smp, remove unneeded call to smi_handle_dr_recv
There is no need to check the DR validity on a MAD which has been processed locally

Index: mad.c
===================================================================
--- mad.c	(revision 1244)
+++ mad.c	(working copy)
@@ -410,17 +410,6 @@
 					goto error1;
 				}
 				if (ret & IB_MAD_RESULT_REPLY) {
-					if (!smi_handle_dr_smp_recv(
-					    (struct ib_smp *)&mad_priv->mad,
-					    mad_agent->device->node_type,
-					    mad_agent->port_num,
-					    mad_agent->device->phys_port_cnt)) {
-						ret = -EINVAL;
-						kmem_cache_free(ib_mad_cache,
-								mad_priv);
-						goto error1;
-					}
-
 					/*
 					 * See if response is solicited and
 					 * there is a recv handler


From roland at topspin.com  Tue Nov 16 08:01:19 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 16 Nov 2004 08:01:19 -0800
Subject: [openib-general] Re: umad doc
In-Reply-To: <1100583875.3369.2405.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Tue, 16 Nov 2004 00:44:36 -0500")
References: <1100583875.3369.2405.camel@localhost.localdomain>
Message-ID: <52brdxkckw.fsf@topspin.com>

    Hal> Hi Roland, Should the user-mad.txt doc indicate /udev rather
    Hal> than /dev as follows:

I guess it depends on how udev is set up on your system.  On my
systems (running Debian sarge), udev manages /dev and there is no
/udev tree.  I believe this is the way things are expected to be done
on a completely modern system.

 - R.


From halr at voltaire.com  Tue Nov 16 08:13:26 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 11:13:26 -0500
Subject: [openib-general] Re: umad doc
In-Reply-To: <52brdxkckw.fsf@topspin.com>
References: <1100583875.3369.2405.camel@localhost.localdomain>
	<52brdxkckw.fsf@topspin.com>
Message-ID: <1100621604.27172.0.camel@hpc-1>

On Tue, 2004-11-16 at 11:01, Roland Dreier wrote:
>     Hal> Hi Roland, Should the user-mad.txt doc indicate /udev rather
>     Hal> than /dev as follows:
> 
> I guess it depends on how udev is set up on your system.  On my
> systems (running Debian sarge), udev manages /dev and there is no
> /udev tree.  I believe this is the way things are expected to be done
> on a completely modern system.

Guess I didn't completely modernize my machine :-)

BTW, are there major and minor device numbers for the IB devices ?

-- Hal


From roland at topspin.com  Tue Nov 16 08:09:37 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 16 Nov 2004 08:09:37 -0800
Subject: [openib-general] Re: umad doc
In-Reply-To: <1100621604.27172.0.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 16 Nov 2004 11:13:26 -0500")
References: <1100583875.3369.2405.camel@localhost.localdomain>
	<52brdxkckw.fsf@topspin.com> <1100621604.27172.0.camel@hpc-1>
Message-ID: <52y8h1ixmm.fsf@topspin.com>

    Hal> BTW, are there major and minor device numbers for the IB
    Hal> devices ?

They are assigned dynamically by the call to alloc_chrdev_region().

 - R.


From halr at voltaire.com  Tue Nov 16 08:16:52 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 11:16:52 -0500
Subject: [openib-general] SMI Patch
Message-ID: <1100621810.27172.4.camel@hpc-1>

Hi,

I have a patch which I think fixes the SMI issues for an end node. There
is more to be done for switch support but this hopefully is sufficient
for SM support. Can you please validate it before I check it in ? I'm a
little gun shy about breaking the tree after last week's debacle. If it
works in your configurations, I will check it in.

Thanks.

-- Hal

Index: smi.c
===================================================================
--- smi.c	(revision 1247)
+++ smi.c	(working copy)
@@ -98,6 +98,9 @@
 		}
 
 		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */
+		if (hop_ptr == 0)
+			return 1;
+
 		/* C14-13:5 -- Check for unreasonable hop pointer */
 		return 0;
 	}
Index: mad.c
===================================================================
--- mad.c	(revision 1245)
+++ mad.c	(working copy)
@@ -1121,7 +1121,7 @@
 					    port_priv->device->phys_port_cnt))
 			goto out;
 		if (!smi_check_forward_dr_smp(smp))
-			goto out;
+			goto local;
 		if (!smi_handle_dr_smp_send(smp,
 					    port_priv->device->node_type,
 					    port_priv->port_num))
@@ -1132,6 +1132,7 @@
 			goto out;
 	}
 
+local:
 	/* Give driver "right of first refusal" on incoming MAD */
 	if (port_priv->device->process_mad) {
 		int ret;


From halr at voltaire.com  Tue Nov 16 08:11:20 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 11:11:20 -0500
Subject: [openib-general] MAD handling
In-Reply-To: <52d5yfptu6.fsf@topspin.com>
References: <52d5yfptu6.fsf@topspin.com>
Message-ID: <1100621480.3369.2601.camel@localhost.localdomain>

On Mon, 2004-11-15 at 00:25, Roland Dreier wrote:
> A few questions about MAD handling:

I'm reresponding with more concrete/specific answers to the below.

> - It looks as if the case of response DR SMPs going to the SM is not
>   handled in smi.c.  smi_check_forward_dr_smp() doesn't handle the
>   case of hop_ptr == 0, and smi_handle_dr_smp_send() just says
> 
> 		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */
> 
>   and returns 0, which will lead to the packet being dropped.  How
>   should this be fixed?

I added the check for hop_ptr 0 into smi_handle_dr_smp_send.
smi_check_forward_smp is correct as for hop_ptr 0 it returns 0 which
means SMP should be completed up the stack (which wasn't being done in
mad.c).

-- Hal


From mshefty at ichips.intel.com  Tue Nov 16 09:42:16 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 16 Nov 2004 09:42:16 -0800
Subject: [openib-general] Re: SMI Patch
In-Reply-To: <1100621810.27172.4.camel@hpc-1>
References: <1100621810.27172.4.camel@hpc-1>
Message-ID: <419A3BF8.3080908@ichips.intel.com>

Hal Rosenstock wrote:

> Hi,
> 
> I have a patch which I think fixes the SMI issues for an end node. There
> is more to be done for switch support but this hopefully is sufficient
> for SM support. Can you please validate it before I check it in ? I'm a
> little gun shy about breaking the tree after last week's debacle. If it
> works in your configurations, I will check it in.

I applied the patch to my local repository, and it worked fine.

One item of note is that my test system is connected into a switched 
fabric with opensm running on the source forge stack.  When I load the 
openib stack, the port goes to INIT.  It doesn't go to ACTIVE until I 
unplug and re-insert the cable.  This is true with or without this 
patch.  (I'm connected to a 16-port Mellanox switch.)  The systems 
running the source forge stack do not see this issue.

- Sean


From roland at topspin.com  Tue Nov 16 09:43:18 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 16 Nov 2004 09:43:18 -0800
Subject: [openib-general] Re: SMI Patch
In-Reply-To: <1100621810.27172.4.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 16 Nov 2004 11:16:52 -0500")
References: <1100621810.27172.4.camel@hpc-1>
Message-ID: <52pt2ditah.fsf@topspin.com>

Seems to work here as well...

 - R.


From halr at voltaire.com  Tue Nov 16 10:29:24 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 13:29:24 -0500
Subject: [openib-general] Re: SMI Patch
In-Reply-To: <419A3BF8.3080908@ichips.intel.com>
References: <1100621810.27172.4.camel@hpc-1>
	<419A3BF8.3080908@ichips.intel.com>
Message-ID: <1100629763.27971.6.camel@hpc-1>

On Tue, 2004-11-16 at 12:42, Sean Hefty wrote:
> Hal Rosenstock wrote:
> 
> > Hi,
> > 
> > I have a patch which I think fixes the SMI issues for an end node. There
> > is more to be done for switch support but this hopefully is sufficient
> > for SM support. Can you please validate it before I check it in ? I'm a
> > little gun shy about breaking the tree after last week's debacle. If it
> > works in your configurations, I will check it in.
> 
> I applied the patch to my local repository, and it worked fine.

Thanks for checking this out.

> One item of note is that my test system is connected into a switched 
> fabric with opensm running on the source forge stack.  When I load the 
> openib stack, the port goes to INIT.  It doesn't go to ACTIVE until I 
> unplug and re-insert the cable.  This is true with or without this 
> patch.  (I'm connected to a 16-port Mellanox switch.)  The systems 
> running the source forge stack do not see this issue.

Can you see whether any packets received make it to
ib_mad_recv_done_handler when the port stays in INIT ?

It seems weird that a cable reinsertion would bring it back to life.
Sounds like some sort of initialization issue that gets fixed on a cable
reinsertion. Is this reproducible every time or intermittent ?

-- Hal

> 
> - Sean


From halr at voltaire.com  Tue Nov 16 10:29:38 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 13:29:38 -0500
Subject: [openib-general] Re: SMI Patch
In-Reply-To: <52pt2ditah.fsf@topspin.com>
References: <1100621810.27172.4.camel@hpc-1> <52pt2ditah.fsf@topspin.com>
Message-ID: <1100629778.27971.9.camel@hpc-1>

On Tue, 2004-11-16 at 12:43, Roland Dreier wrote:
> Seems to work here as well...

Thanks for trying this. I will check this in shortly.

-- Hal


From halr at voltaire.com  Tue Nov 16 10:35:29 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 13:35:29 -0500
Subject: [openib-general] [PATCH] SMI/MAD: Fix a couple of SMI cases
Message-ID: <1100630129.27971.16.camel@hpc-1>

smi/mad: In smi_handle_dr_smp_send, handle hop_ptr 0 (C14-13:4). Also,
in ib_mad_recv_handler_done, 0 return from smi_check_forward_dr_smp
means local rather than discard.

Index: smi.c
===================================================================
--- smi.c	(revision 1247)
+++ smi.c	(working copy)
@@ -98,6 +98,9 @@
 		}
 
 		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */
+		if (hop_ptr == 0)
+			return 1;
+
 		/* C14-13:5 -- Check for unreasonable hop pointer */
 		return 0;
 	}
Index: mad.c
===================================================================
--- mad.c	(revision 1245)
+++ mad.c	(working copy)
@@ -1121,7 +1121,7 @@
 					    port_priv->device->phys_port_cnt))
 			goto out;
 		if (!smi_check_forward_dr_smp(smp))
-			goto out;
+			goto local;
 		if (!smi_handle_dr_smp_send(smp,
 					    port_priv->device->node_type,
 					    port_priv->port_num))
@@ -1132,6 +1132,7 @@
 			goto out;
 	}
 
+local:
 	/* Give driver "right of first refusal" on incoming MAD */
 	if (port_priv->device->process_mad) {
 		int ret;


From blist at aon.at  Tue Nov 16 10:30:19 2004
From: blist at aon.at (Bernhard Fischer)
Date: Tue, 16 Nov 2004 19:30:19 +0100
Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP
Message-ID: <20041116183019.GB1206@aon.at>

Hi,

from linux/spinlock.h: "spin_is_locked on UP always says FALSE"

please consider applying,
-------------- next part --------------
diff -x '*.diff' -rup gen2.oorig/src/linux-kernel/infiniband/core/agent.c gen2/src/linux-kernel/infiniband/core/agent.c
--- gen2.oorig/src/linux-kernel/infiniband/core/agent.c	2004-11-12 16:29:26.000000000 +0100
+++ gen2/src/linux-kernel/infiniband/core/agent.c	2004-11-16 19:11:04.595949168 +0100
@@ -42,7 +42,9 @@ __ib_get_agent_port(struct ib_device *de
 {
 	struct ib_agent_port_private *entry;
 
+#if defined(CONFIG_SMP)
 	BUG_ON(!spin_is_locked(&ib_agent_port_list_lock));
+#endif
 	BUG_ON(!(!!device ^ !!mad_agent));  /* Exactly one MUST be (!NULL) */
 
 	if (device) {
diff -x '*.diff' -rup gen2.oorig/src/linux-kernel/infiniband/core/mad.c gen2/src/linux-kernel/infiniband/core/mad.c
--- gen2.oorig/src/linux-kernel/infiniband/core/mad.c	2004-11-16 17:24:36.000000000 +0100
+++ gen2/src/linux-kernel/infiniband/core/mad.c	2004-11-16 19:09:25.577038602 +0100
@@ -100,7 +100,9 @@ __ib_get_mad_port(struct ib_device *devi
 {
 	struct ib_mad_port_private *entry;
 
+#if defined(CONFIG_SMP)
 	BUG_ON(!spin_is_locked(&ib_mad_port_list_lock));
+#endif
 	list_for_each_entry(entry, &ib_mad_port_list, port_list) {
 		if (entry->device == device && entry->port_num == port_num)
 			return entry;

From roland at topspin.com  Tue Nov 16 10:38:41 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 16 Nov 2004 10:38:41 -0800
Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP
In-Reply-To: <20041116183019.GB1206@aon.at> (Bernhard Fischer's message of
	"Tue, 16 Nov 2004 19:30:19 +0100")
References: <20041116183019.GB1206@aon.at>
Message-ID: <52ekitiqq6.fsf@topspin.com>

    Bernhard> Hi, from linux/spinlock.h: "spin_is_locked on UP always
    Bernhard> says FALSE"

Good catch.

    Bernhard> please consider applying,

Can we try and think of a fix that doesn't involve adding #ifdefs to
the source file?  Do we really need the BUG_ONs at all?

 - R.


From mshefty at ichips.intel.com  Tue Nov 16 10:41:18 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 16 Nov 2004 10:41:18 -0800
Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP
In-Reply-To: <52ekitiqq6.fsf@topspin.com>
References: <20041116183019.GB1206@aon.at> <52ekitiqq6.fsf@topspin.com>
Message-ID: <419A49CE.7000506@ichips.intel.com>

Roland Dreier wrote:

>     Bernhard> Hi, from linux/spinlock.h: "spin_is_locked on UP always
>     Bernhard> says FALSE"
> 
> Good catch.
> 
>     Bernhard> please consider applying,
> 
> Can we try and think of a fix that doesn't involve adding #ifdefs to
> the source file?  Do we really need the BUG_ONs at all?

I'd vote to remove the BUG_ONs, versus adding #ifdef.

- Sean


From roland at topspin.com  Tue Nov 16 10:46:45 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 16 Nov 2004 10:46:45 -0800
Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP
In-Reply-To: <419A49CE.7000506@ichips.intel.com> (Sean Hefty's message of
	"Tue, 16 Nov 2004 10:41:18 -0800")
References: <20041116183019.GB1206@aon.at> <52ekitiqq6.fsf@topspin.com>
	<419A49CE.7000506@ichips.intel.com>
Message-ID: <52acthiqcq.fsf@topspin.com>

    Sean> I'd vote to remove the BUG_ONs, versus adding #ifdef.

That seems fine to me.  Maybe adding a comment in agent.c similar to
what mad.c says ("Assumes ib_mad_port_list_lock is being held") is all
we really need, something like this:

Index: agent.c
===================================================================
--- agent.c	(revision 1249)
+++ agent.c	(working copy)
@@ -36,6 +36,9 @@
 extern kmem_cache_t *ib_mad_cache;
 
 
+/*
+ * Caller must hold ib_agent_port_list_lock.
+ */
 static inline struct ib_agent_port_private *
 __ib_get_agent_port(struct ib_device *device, int port_num,
 		    struct ib_mad_agent *mad_agent)


From halr at voltaire.com  Tue Nov 16 11:03:57 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 14:03:57 -0500
Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP
In-Reply-To: <20041116183019.GB1206@aon.at>
References: <20041116183019.GB1206@aon.at>
Message-ID: <1100631837.27971.22.camel@hpc-1>

On Tue, 2004-11-16 at 13:30, Bernhard Fischer wrote:
> Hi,
> 
> from linux/spinlock.h: "spin_is_locked on UP always says FALSE"
> 
> please consider applying,
> 
> ______________________________________________________________________
Thanks for pointing this out. Guess we all were running SMP. 

The consensus seems to be to eliminate these rather than conditionalize
them.

-- Hal


From halr at voltaire.com  Tue Nov 16 11:04:07 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 14:04:07 -0500
Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP
In-Reply-To: <52acthiqcq.fsf@topspin.com>
References: <20041116183019.GB1206@aon.at> <52ekitiqq6.fsf@topspin.com>
	<419A49CE.7000506@ichips.intel.com> <52acthiqcq.fsf@topspin.com>
Message-ID: <1100631847.27971.24.camel@hpc-1>

On Tue, 2004-11-16 at 13:46, Roland Dreier wrote:
>     Sean> I'd vote to remove the BUG_ONs, versus adding #ifdef.
> 
> That seems fine to me.  Maybe adding a comment in agent.c similar to
> what mad.c says ("Assumes ib_mad_port_list_lock is being held") is all
> we really need, something like this:
> 
> Index: agent.c
> ===================================================================
> --- agent.c	(revision 1249)
> +++ agent.c	(working copy)
> @@ -36,6 +36,9 @@
>  extern kmem_cache_t *ib_mad_cache;
>  
> 
> +/*
> + * Caller must hold ib_agent_port_list_lock.
> + */
>  static inline struct ib_agent_port_private *
>  __ib_get_agent_port(struct ib_device *device, int port_num,
>  		    struct ib_mad_agent *mad_agent)

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Nov 16 11:32:33 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 16 Nov 2004 14:32:33 -0500
Subject: [openib-general] Setting of MAD TID for user mode clients
Message-ID: <1100633553.27971.28.camel@hpc-1>

Hi,

Should it be the responsibility of user_mad or the client itself to set
the hi_tid ? Right now, it's in user_mad::ib_umad_write.

Just wondering...

-- Hal


From roland at topspin.com  Tue Nov 16 11:43:56 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 16 Nov 2004 11:43:56 -0800
Subject: [openib-general] Re: Setting of MAD TID for user mode clients
In-Reply-To: <1100633553.27971.28.camel@hpc-1> (Hal Rosenstock's message of
	"Tue, 16 Nov 2004 14:32:33 -0500")
References: <1100633553.27971.28.camel@hpc-1>
Message-ID: <526545inpf.fsf@topspin.com>

    Hal> Hi, Should it be the responsibility of user_mad or the client
    Hal> itself to set the hi_tid ? Right now, it's in
    Hal> user_mad::ib_umad_write.

I think it has to be in the kernel (ie in user_mad.c) because we can't
trust anything userspace gives us.

 - R.


From mshefty at ichips.intel.com  Tue Nov 16 11:49:34 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 16 Nov 2004 11:49:34 -0800
Subject: [openib-general] Re: Setting of MAD TID for user mode clients
In-Reply-To: <526545inpf.fsf@topspin.com>
References: <1100633553.27971.28.camel@hpc-1> <526545inpf.fsf@topspin.com>
Message-ID: <419A59CE.3000705@ichips.intel.com>

Roland Dreier wrote:

>     Hal> Hi, Should it be the responsibility of user_mad or the client
>     Hal> itself to set the hi_tid ? Right now, it's in
>     Hal> user_mad::ib_umad_write.
> 
> I think it has to be in the kernel (ie in user_mad.c) because we can't
> trust anything userspace gives us.

agreed


From mshefty at ichips.intel.com  Tue Nov 16 17:07:10 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 16 Nov 2004 17:07:10 -0800
Subject: [openib-general] RMPP implementation
Message-ID: <419AA43E.4050405@ichips.intel.com>

I'm starting work on the RMPP implementation in the MAD code.  If anyone 
has any ideas/preferences on the implementation, please let me know.

For the send side, there are a couple of ways to perform the segmentation:

1.  Issue one send at a time.  Additional sends are not transfered until 
  the first send completes.
2.  Issue multiple sends using 2 data segments per request.  This 
requires allocating and mapping space (36 bytes) for copying the MAD 
common and RMPP headers.
3.  Issue multiple sends using 3 data segments per request.  This is the 
same as #2, but only copes the RMPP header.

I'm leaning towards #2 at this point.

RMPP is fairly complex, so I will probably submit a series of patches, 
rather than the entire implementation at once.

- Sean


From ftillier at infiniconsys.com  Tue Nov 16 17:33:43 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Tue, 16 Nov 2004 17:33:43 -0800
Subject: [openib-general] RMPP implementation
In-Reply-To: <419AA43E.4050405@ichips.intel.com>
Message-ID: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com>

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Tuesday, November 16, 2004 5:07 PM
> 
> I'm starting work on the RMPP implementation in the MAD code.  If anyone
> has any ideas/preferences on the implementation, please let me know.
> 
> For the send side, there are a couple of ways to perform the segmentation:
> 
> 1.  Issue one send at a time.  Additional sends are not transfered until
>   the first send completes.
> 2.  Issue multiple sends using 2 data segments per request.  This
> requires allocating and mapping space (36 bytes) for copying the MAD
> common and RMPP headers.
> 3.  Issue multiple sends using 3 data segments per request.  This is the
> same as #2, but only copes the RMPP header.
> 
> I'm leaning towards #2 at this point.
> 

Isn't #1 the simplest to implement?  Turnaround on the send queue should be
pretty quick, so send performance should be fine.  I say do whatever is
simplest, and then optimize from there, and to me that means #1 at the
moment.  What are the reasons to *not* do #1?

- Fab


From mst at mellanox.co.il  Wed Nov 17 01:08:35 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 17 Nov 2004 11:08:35 +0200
Subject: [openib-general] RMPP implementation
In-Reply-To: <419AA43E.4050405@ichips.intel.com>
References: <419AA43E.4050405@ichips.intel.com>
Message-ID: <20041117090835.GA6959@mellanox.co.il>

Hello!
Quoting r. Sean Hefty (mshefty at ichips.intel.com) "[openib-general] RMPP implementation":
> I'm starting work on the RMPP implementation in the MAD code.  If anyone 
> has any ideas/preferences on the implementation, please let me know.
> 
> For the send side, there are a couple of ways to perform the segmentation:
> 
> 1.  Issue one send at a time.  Additional sends are not transfered until 
>  the first send completes.
> 2.  Issue multiple sends using 2 data segments per request.  This 
> requires allocating and mapping space (36 bytes) for copying the MAD 
> common and RMPP headers.
> 3.  Issue multiple sends using 3 data segments per request.  This is the 
> same as #2, but only copes the RMPP header.
> 
> I'm leaning towards #2 at this point.
> 
> RMPP is fairly complex, so I will probably submit a series of patches, 
> rather than the entire implementation at once.
> 
> - Sean

RMPP is somewhat similiar to TCP. I wander if there is some way
to re-use the TCP stack code.

MST


From roland at topspin.com  Wed Nov 17 08:49:22 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 17 Nov 2004 08:49:22 -0800
Subject: [openib-general] ipoib_debugfs -- new kernel patch required for
	2.6.9
Message-ID: <52lld0e7zh.fsf@topspin.com>

I just committed changes that replace the IPoIB /proc files with and
ipoib_debugfs filesystem.  Using this filesystem is described in the
new docs/ipoib.txt file.

There is a new kernel patch, linux-2.6.9-backports.diff, that is
required to build against a 2.6.9 kernel.  This patch just adds the
new d_alloc_name() function from 2.6.10-rc.

 - R.


From mshefty at ichips.intel.com  Wed Nov 17 09:07:48 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 17 Nov 2004 09:07:48 -0800
Subject: [openib-general] RMPP implementation
In-Reply-To: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com>
References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com>
Message-ID: <419B8564.2090805@ichips.intel.com>

Fab Tillier wrote:
>>1.  Issue one send at a time.  Additional sends are not transfered until
>>  the first send completes.
> 
> Isn't #1 the simplest to implement?  Turnaround on the send queue should be
> pretty quick, so send performance should be fine.  I say do whatever is
> simplest, and then optimize from there, and to me that means #1 at the
> moment.  What are the reasons to *not* do #1?

It's simpler to implement, and would definitely be the easiest to do on 
redirected QPs.  The only disadvantage is that it lowers the throughput 
between two clients.  Also, this is a relatively small decrease in 
complexity with respect to the rest of RMPP.

A couple of other areas that will need to be addressed include: RMPP 
timeouts, receive window sizes, and user-mode support.

- Sean


From ftillier at infiniconsys.com  Wed Nov 17 09:22:11 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Wed, 17 Nov 2004 09:22:11 -0800
Subject: [openib-general] RMPP implementation
In-Reply-To: <419B8564.2090805@ichips.intel.com>
Message-ID: <000301c4ccc9$fa3a3cf0$655aa8c0@infiniconsys.com>

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Wednesday, November 17, 2004 9:08 AM
> 
> Fab Tillier wrote:
> >>1.  Issue one send at a time.  Additional sends are not transfered until
> >>  the first send completes.
> >
> > Isn't #1 the simplest to implement?  Turnaround on the send queue should
> > be
> > pretty quick, so send performance should be fine.  I say do whatever is
> > simplest, and then optimize from there, and to me that means #1 at the
> > moment.  What are the reasons to *not* do #1?
> 
> It's simpler to implement, and would definitely be the easiest to do on
> redirected QPs.  The only disadvantage is that it lowers the throughput
> between two clients.  Also, this is a relatively small decrease in
> complexity with respect to the rest of RMPP.

I agree that it will lower the throughput, but by how much?  I would expect
it to be minimal.  It also allows more concurrent transfers to progress.
I'm thinking that the SA is likely the primary user of RMPP sends, and thus
responding to more queries in parallel is probably better than responding to
queries serially but faster for each query.  The send completion delay is
likely to be less than the RMPP timeouts, so might as well keep many
requestors going than getting a response to any one client quickly. 

- Fab


From roland at topspin.com  Wed Nov 17 09:31:36 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 17 Nov 2004 09:31:36 -0800
Subject: [openib-general] RMPP implementation
In-Reply-To: <419B8564.2090805@ichips.intel.com> (Sean Hefty's message of
	"Wed, 17 Nov 2004 09:07:48 -0800")
References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com>
	<419B8564.2090805@ichips.intel.com>
Message-ID: <52hdnoe613.fsf@topspin.com>

Would it make sense to figure out what the expected consumers of this
RMPP support will be and what they will need before designing the RMPP
implementation?

 - Roland


From mshefty at ichips.intel.com  Wed Nov 17 09:34:58 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 17 Nov 2004 09:34:58 -0800
Subject: [openib-general] RMPP implementation
In-Reply-To: <52hdnoe613.fsf@topspin.com>
References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com>	<419B8564.2090805@ichips.intel.com>
	<52hdnoe613.fsf@topspin.com>
Message-ID: <419B8BC2.60806@ichips.intel.com>

Roland Dreier wrote:

> Would it make sense to figure out what the expected consumers of this
> RMPP support will be and what they will need before designing the RMPP
> implementation?

Absolutely.  Right now, I'm assuming opensm and SA query as the primary 
users.

- Sean


From halr at voltaire.com  Wed Nov 17 09:42:04 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Wed, 17 Nov 2004 12:42:04 -0500
Subject: [openib-general] RMPP implementation
In-Reply-To: <419B8BC2.60806@ichips.intel.com>
References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com>
	<419B8564.2090805@ichips.intel.com> <52hdnoe613.fsf@topspin.com>
	<419B8BC2.60806@ichips.intel.com>
Message-ID: <1100713324.12272.2.camel@localhost.localdomain>

On Wed, 2004-11-17 at 12:34, Sean Hefty wrote:
> Roland Dreier wrote:
> 
> > Would it make sense to figure out what the expected consumers of this
> > RMPP support will be and what they will need before designing the RMPP
> > implementation?
> 
> Absolutely.  Right now, I'm assuming opensm and SA query as the primary 
> users.

By OpenSM, I presume you are referring to the SA. It might also have
other applications: e.g database synchronization between OpenSMs (but
this would be down the road).

I think we also need to understand the dynamics of the users as well.

-- Hal


From roland at topspin.com  Wed Nov 17 14:57:41 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 17 Nov 2004 14:57:41 -0800
Subject: [openib-general] Updated backports patch needed to build
Message-ID: <52d5ycozh6.fsf@topspin.com>

I've checked in a few changes that use kernel features added after the
2.6.9 release.  This means to build the latest tree, you have two
choices:

 - Apply the latest linux-2.6.9-backports.diff to your 2.6.9 tree.
   This backports all the features required.
 - Use an extremely up-to-date kernel tree.  I haven't tried
   2.6.10-rc2 but I expect it would work; certainly an up-to-date BK
   tree will definitely have everything needed.  In this case you
   should apply all the patches _except_ linux-2.6.9-backports.diff.

 - Roland


From shaharf at voltaire.com  Thu Nov 18 08:14:47 2004
From: shaharf at voltaire.com (shaharf)
Date: Thu, 18 Nov 2004 18:14:47 +0200
Subject: [openib-general] openib gen2 architecture
Message-ID: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>

Hi all,

 
            I know I am new to this project and I must be naïve but I want to understand few things concerning the openib architecture. In the course of learning the openib gen 2 stack and preparing to port the opensm to it (which is my current task), I have encountered few areas that seem problematic to me and I would like to understand the reasoning for it, if not to offer alternatives. I am sorry that I rise these issues so late, but I was not involved in this project earlier. I hope it is better late than never.

 
            It seems to me that the major design approach is to do everything in the kernel but let user mode staff access to the lower levels to enable performance sensitive applications override all kernel layers. Am I right?

 
            It seems also that within the kernel, the ib interface/verbs (ib_*) is very close to the mthca verbs that are very close to vapi. I know that this is the way most of the industry was working, but I wonder - is this the correct model? Will this not pollute the kernel with a lot of IB specific stuff? Personally, I think that IB verbs (vapi) are so complicated that another level of abstraction is required. PDs, MRs, QPs QP state machine, PKEYs, MLIDs and other "curses", why should a module such as IPoIB knows about it? If the answer is performance then I have to disagree. In the same fashion you can say that in order to achieve efficient disk IO applications should know the disks geometry and to able to do direct IO to the disk firmware, or that applications should talk SCSI verbs to optimize their data transfers.

 
            It seems to me that the current interfaces evolved to what they are today mainly because of the way IB itself evolved - with a lot of uncertainly and a lot of design holes (not to say "craters"). This forced most of the industry to stick with very straight forward interfaces that were based on Mellanox VAPI.

 
            I wonder if this is not the right time to come up with much better abstraction - for user mode and for kernel mode. For example, it seems that the abstraction layer should abstract the IB networking objects and not the IB hca interface. In other words - why not to build the abstraction around IB networking types - UD, RC, RD, MADS? Why do we have to expose the memory management model of the driver/HCA to upper layers? Do we really want to expose IB related structures such as CQs, QPs,  and WQE? Why? Not only that this is not good for abstraction meaning changes in the drivers will require upper layers modifications, but also this is very problematic due security and stability reasons.

 
            I think that using correct abstraction is very critical for a real acceptance in the Linux/open source world. Good abstraction will enable us also to provide good and secure kernel mode and user mode interfaces and access.

 
            Once we have such interfaces, I think we should consider again the user/kernel division. As a general rule I think that it is commonly agreed that the kernel should include only things that must be in the kernel, meaning hardware aware software and very performance sensitive software. Other software modules may be inserted to the kernel once it is mature and robust. For example, RPC, NFSD and SMBFS (SAMBA) were developed in user mode, served many years in user mode and then after they have matured, they started to "sink" into the kernel. I think that IB, and especially IB management modules, are far from being mature. Even the IB standard itself is not really stable. Specifically, there is a requirement (in the SOW) to make the IB management distributed due some scalability and other (redundancy, etc.) requirements. I do not know if this requirement will actually realize, but if is will, the SM and maybe also the SMI/GSI agents and the CM will have to significantly change. If this is likely to happen, I would suggest keeping as much as possible in user mode - it is much easier to develop and to update. We should have kernel based agents and mechanism to assist the performance, but I think that most of the work should be done in user mode where it can harm less. Specifically, thinks such as MAD transaction manager (retries, timeouts, matching), RMPP and other should be developed in user mode and packed as libraries, again, at least until they will stabilize and mature. Why should we develop complicated functionality such as RMPP in the kernel when the only few kernel based queries (if any at all) will use them?

 
            If I not mistaken, one of the IB design goals was to enable efficient *user mode* networking (copy less, low latency). This is also the major advantage IB have over several alternatives - most remarkably 10G Ethernet. If we will not emphasize such advantages, we will reduce the survival chance of IB once 10GE will be widely used. If potential users will get the impression that comparing to 10GE IB is cheaper, faster, more efficient but requires tons of special kernel based modules and very complicated interfaces and therefore much less stable and much more exposed to bugs, they will use 10GE. I have no doubt. Yes, it is true that this project is meant to supply HPC code base, but eventually, IB will not survive as HPC interconnect only. Furthermore, all HPC applications are user mode based. Good user mode interfaces are critical for HCP not less then to any other high end networking applications.

 
            I really would like to know if I am shooting in the dark or that the issues I mentioned were discussed and there are good reasons to do them the way they are. Or, maybe I don't get the picture and the state of things is completely different from what I am painting. Either way I would like to know what do you think.

 
Thanks,

            Shahar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/fe8c836f/attachment.html>

From roland at topspin.com  Thu Nov 18 08:28:43 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 08:28:43 -0800
Subject: [openib-general] openib gen2 architecture
In-Reply-To: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
	(shaharf@voltaire.com's
	message of "Thu, 18 Nov 2004 18:14:47 +0200")
References: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
Message-ID: <528y8zm890.fsf@topspin.com>

    shaharf> It seems to me that the major design approach is to do
    shaharf> everything in the kernel but let user mode staff access
    shaharf> to the lower levels to enable performance sensitive
    shaharf> applications override all kernel layers. Am I right?

No.  The reason everything is in the kernel now is that we simply have
not started implementing any userspace verbs support.

    shaharf> I wonder if this is not the right time to come up with
    shaharf> much better abstraction - for user mode and for kernel
    shaharf> mode. For example, it seems that the abstraction layer
    shaharf> should abstract the IB networking objects and not the IB
    shaharf> hca interface. In other words - why not to build the
    shaharf> abstraction around IB networking types - UD, RC, RD,
    shaharf> MADS? Why do we have to expose the memory management
    shaharf> model of the driver/HCA to upper layers? Do we really
    shaharf> want to expose IB related structures such as CQs, QPs,
    shaharf> and WQE? Why? Not only that this is not good for
    shaharf> abstraction meaning changes in the drivers will require
    shaharf> upper layers modifications, but also this is very
    shaharf> problematic due security and stability reasons.

Keep in mind that CQs, QPs and other IB transport objects are
themselves abstractions.  I'm not opposed to better abstractions in
principle, but I think that the current level is a good one.  Any IB
hardware is likely to be optimized for implementing these
abstractions, and I have a hard time believing we are smart enough to
build layers on top of them that are both generic enough and efficient
enough for all applications.

    shaharf> I think that using correct abstraction is very critical
    shaharf> for a real acceptance in the Linux/open source world.

I agree.  However, I think the Linux kernel community will actually be
opposed to extra abstraction layers that hide what the hardware is
really doing.  For example, I believe any really high-performance IB
application is going to understand the work queueing model and want to
deal with QPs and CQs.

    shaharf> Why should we develop complicated functionality such as
    shaharf> RMPP in the kernel when the only few kernel based queries
    shaharf> (if any at all) will use them?

Funny you should raise this point.  I said the same thing some time
ago and Yaron Haviv violently disagreed ;)

 - Roland


From halr at voltaire.com  Thu Nov 18 08:42:16 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 11:42:16 -0500
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <52r7n37xz9.fsf@topspin.com>
References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com>
Message-ID: <1100796136.3277.9.camel@localhost.localdomain>

On Tue, 2004-11-09 at 12:07, Roland Dreier wrote:
> multiport bonding/failover
> (although my feeling is that it would be better to extend the existing
> bonding driver rather than trying to put this in the IPoIB driver), ....

I'm not clear what the tradeoffs / pros / cons of the two approaches
(use the bonding driver (above the IPoIB driver) or implement it inside
the IPoIB driver) would be.

-- Hal


From halr at voltaire.com  Thu Nov 18 09:02:03 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 12:02:03 -0500
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names
Message-ID: <1100797323.3277.19.camel@localhost.localdomain>

mad: Add port number to MAD thread names

Index: mad.c
===================================================================
--- mad.c	(revision 1259)
+++ mad.c	(working copy)
@@ -1843,6 +1843,7 @@
 	int ret, cq_size;
 	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
+	char name[8];
 
 	/* First, check if port already open at MAD layer */
 	port_priv = ib_get_mad_port(device, port_num);
@@ -1898,7 +1899,8 @@
 	if (ret)
 		goto error7;
 
-	port_priv->wq = create_workqueue("ib_mad");
+	sprintf(name, "ib_mad%d", port_num);
+	port_priv->wq = create_workqueue(name);
 	if (!port_priv->wq) {
 		ret = -ENOMEM;
 		goto error8;


From roland at topspin.com  Thu Nov 18 09:23:47 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 09:23:47 -0800
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names
In-Reply-To: <1100797323.3277.19.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 18 Nov 2004 12:02:03 -0500")
References: <1100797323.3277.19.camel@localhost.localdomain>
Message-ID: <52zn1fkr4s.fsf@topspin.com>

It's extremely unlikely, but:

+	char name[8];

+	sprintf(name, "ib_mad%d", port_num);

if port_num >= 10, this will overflow the buffer.  Since a device
could conceivably have up to 255 ports (although an HCA with hundreds
of ports is rather far-fetched, and we only create one port for a
switch), I would suggest doing

	char name[sizeof "ib_mad123"];

and

	snprintf(name, sizeof name, "ib_mad%d", port_num);

for correctness and (mostly) ease of auditing.

 - Roland


From Nitin.Hande at Sun.COM  Thu Nov 18 09:34:32 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 18 Nov 2004 09:34:32 -0800
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <1100796136.3277.9.camel@localhost.localdomain>
References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com>
	<1100796136.3277.9.camel@localhost.localdomain>
Message-ID: <419CDD28.70906@Sun.COM>

Hal/Roland,
Hal Rosenstock wrote:
> On Tue, 2004-11-09 at 12:07, Roland Dreier wrote:
> 
>>multiport bonding/failover
>>(although my feeling is that it would be better to extend the existing
>>bonding driver rather than trying to put this in the IPoIB driver), ....
> 
> 
> I'm not clear what the tradeoffs / pros / cons of the two approaches
> (use the bonding driver (above the IPoIB driver) or implement it inside
> the IPoIB driver) would be.
I just started taking a look at the existing bonding driver and
evaluating what work needs to be done to support ipoib driver below it.
It seems to me like a lot of pieces for this approach are readily
available (ifenslave and other logic) etc and besides I guess that will
keep one standard approach of doing bonding in linux. I also assume that
while the ipoib gets enslaved, it will get enough opportunity to take
right set of steps for its present connections and traffic etc. I might
be wrong here though. Would like to hear from other members about this one.

Thanks
Nitin

> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From iod00d at hp.com  Thu Nov 18 09:41:51 2004
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 18 Nov 2004 09:41:51 -0800
Subject: [openib-general] openib gen2 architecture
In-Reply-To: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
References: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
Message-ID: <20041118174151.GB14868@esmail.cup.hp.com>

On Thu, Nov 18, 2004 at 06:14:47PM +0200, shaharf wrote:
> Personally, I think that IB verbs (vapi) are so complicated that
> another level of abstraction is required. PDs, MRs, QPs QP state machine,
> PKEYs, MLIDs and other "curses", why should a module such as IPoIB
> knows about it?
> If the answer is performance then I have to disagree. In the same fashion
> you can say that in order to achieve efficient disk IO applications
> should know the disks geometry and to able to do direct IO to the disk
> firmware, or that applications should talk SCSI verbs to optimize their
> data transfers.

Some applications in fact still do this.  (e.g. sgp_dd in sg3-utils package)
But other applications trade off some of the performance for managability
(abstract storage) and portability to other storage technologies.

IPoIB should know whatever it needs about IB to get good performance.
It doesn't need to be portable and layers above (ifconfig) and below
(SA) should be providing managebility. (At assuming I understand
this correctly)


> I wonder if this is not the right time to come up with much better
> abstraction - for user mode and for kernel mode. For example, it
> seems that the abstraction layer should abstract the IB networking
> objects and not the IB hca interface. In other words - why not to build
> the abstraction around IB networking types - UD, RC, RD, MADS?

If you think it will perform as well as others expect IPoIB should,
then go for it. People argued replacing the TCP/IP stack (highly
tuned) with something else is a mistake since one also loses
all the features (packet filtering notable). Beware you aren't
the first to present a similar arguement. I don't really care
as long as it works.

thanks,
grant


From Nitin.Hande at Sun.COM  Thu Nov 18 09:44:11 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 18 Nov 2004 09:44:11 -0800
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <419CDD28.70906@Sun.COM>
References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com>
	<1100796136.3277.9.camel@localhost.localdomain>
	<419CDD28.70906@Sun.COM>
Message-ID: <419CDF6B.3080102@Sun.COM>

Nitin Hande wrote:
> Hal/Roland,
> Hal Rosenstock wrote:
> 
>>On Tue, 2004-11-09 at 12:07, Roland Dreier wrote:
>>
>>
>>>multiport bonding/failover
>>>(although my feeling is that it would be better to extend the existing
>>>bonding driver rather than trying to put this in the IPoIB driver), ....
>>
>>
>>I'm not clear what the tradeoffs / pros / cons of the two approaches
>>(use the bonding driver (above the IPoIB driver) or implement it inside
>>the IPoIB driver) would be.
> 
> I just started taking a look at the existing bonding driver and
> evaluating what work needs to be done to support ipoib driver below it.
> It seems to me like a lot of pieces for this approach are readily
> available (ifenslave and other logic) etc and besides I guess that will
> keep one standard approach of doing bonding in linux. I also assume that
> while the ipoib gets enslaved, it will get enough opportunity to take
> right set of steps for its present connections and traffic etc. 
Oh well, looking at the code, first thing ifenslave is doing is bringing
the slave interface down, thereby resulting in ipoib to flush its traffic.

Thanks
Nitin

I might
> be wrong here though. Would like to hear from other members about this one.
> 
> Thanks
> Nitin
> 
> 
>>-- Hal
>>
>>_______________________________________________
>>openib-general mailing list
>>openib-general at openib.org
>>http://openib.org/mailman/listinfo/openib-general
>>
>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From peter at pantasys.com  Thu Nov 18 10:18:49 2004
From: peter at pantasys.com (Peter Buckingham)
Date: Thu, 18 Nov 2004 10:18:49 -0800
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <419CDD28.70906@Sun.COM>
References: <1100020075.7342.1.camel@hpc-1>
	<52r7n37xz9.fsf@topspin.com>	<1100796136.3277.9.camel@localhost.localdomain>
	<419CDD28.70906@Sun.COM>
Message-ID: <419CE789.7080602@pantasys.com>

Nitin Hande wrote:
> I just started taking a look at the existing bonding driver and
> evaluating what work needs to be done to support ipoib driver below it.
> It seems to me like a lot of pieces for this approach are readily
> available (ifenslave and other logic) etc and besides I guess that will
> keep one standard approach of doing bonding in linux. I also assume that
> while the ipoib gets enslaved, it will get enough opportunity to take
> right set of steps for its present connections and traffic etc. I might
> be wrong here though. Would like to hear from other members about this one.

i took a quick look at this a little while ago (admittedly with gen1 
IPoIB) and when trying to enslave the ib0, ib1 interfaces the bonding 
driver complains about an unsupported ioctl. i didn't have any time to 
track it down much further than that. if there's some interest i could 
take a bit more of a look at it.

peter


From krause at cup.hp.com  Thu Nov 18 10:25:43 2004
From: krause at cup.hp.com (Michael Krause)
Date: Thu, 18 Nov 2004 10:25:43 -0800
Subject: [openib-general] openib gen2 architecture
In-Reply-To: <20041118174151.GB14868@esmail.cup.hp.com>
References: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
	<20041118174151.GB14868@esmail.cup.hp.com>
Message-ID: <6.1.2.0.2.20041118102226.01de9890@esmail.cup.hp.com>

At 09:41 AM 11/18/2004, Grant Grundler wrote:
>On Thu, Nov 18, 2004 at 06:14:47PM +0200, shaharf wrote:
> > Personally, I think that IB verbs (vapi) are so complicated that
> > another level of abstraction is required. PDs, MRs, QPs QP state machine,
> > PKEYs, MLIDs and other "curses", why should a module such as IPoIB
> > knows about it?
> > If the answer is performance then I have to disagree. In the same fashion
> > you can say that in order to achieve efficient disk IO applications
> > should know the disks geometry and to able to do direct IO to the disk
> > firmware, or that applications should talk SCSI verbs to optimize their
> > data transfers.

In general, there is very little that IP over IB must know to operate.  It 
really comes down to the design implementation and the choices people want 
to make.   Given this is an open source project, one might suggest that if 
there is a better way to structure the design, then implement and propose 
it as a replacement.

> > I wonder if this is not the right time to come up with much better
> > abstraction - for user mode and for kernel mode. For example, it
> > seems that the abstraction layer should abstract the IB networking
> > objects and not the IB hca interface. In other words - why not to build
> > the abstraction around IB networking types - UD, RC, RD, MADS?

Some designs are modular in nature and keep consumers such as IP over IB 
from having to know much more than the QP / work queue / completion queue 
to operate.  It again comes down to design as some focus on maximum code 
re-use.  For example, there is nothing that precludes an implementation 
from using a kernel IT API / DAPL interface for most subsystems in order to 
free itself from all of these details.

Mike 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/1727046d/attachment.html>

From mshefty at ichips.intel.com  Thu Nov 18 10:29:58 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 18 Nov 2004 10:29:58 -0800
Subject: [openib-general] openib gen2 architecture
In-Reply-To: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
References: <D4F8F0B3820E754C887699BEF26A894036617C@taurus.voltaire.com>
Message-ID: <419CEA26.3030702@ichips.intel.com>

shaharf wrote:

> It seems to me that the major design approach is to do everything in
> the kernel but let user mode staff access to the lower levels to
> enable performance sensitive applications override all kernel layers.
> Am I right?

The focus is only on kernel components at the moment, plus whatever 
user-mode support is needed to configure the fabric.

> It seems also that within the kernel, the ib interface/verbs (ib_*)
> is very close to the mthca verbs that are very close to vapi.

The agreement among the IB vendors was the start with VAPI as the base 
for the development of a new API.  But VAPI is little more than verbs as 
defined by the IB spec.

Infiniband hardware exposes PDs, CQs, QPs, etc, so I think it's natural 
for these constructs to appear in the software.  Abstractions away from 
IB specific constructs do exist in the form of SDP, ipoib, and DAPL.

> It seems to me that the current interfaces evolved to what they are
> today mainly because of the way IB itself evolved - with a lot of
> uncertainly and a lot of design holes (not to say "craters"). This
> forced most of the industry to stick with very straight forward
> interfaces that were based on Mellanox VAPI.

The verbs interface evolved because IB hardware is expected to expose 
this sort of functionality.  If you examine the layering of the 
software, ib_mthca and ib_core work together, with ib_core simply 
de-multiplexing among multiple devices.

> I wonder if this is not the right time to come up with much better
> abstraction - for user mode and for kernel mode.

I believe that we have the correct software layering.  At the lowest 
level you need software that talks directly with the hardware, with 
support for different hardware devices.  Abstractions, such as SDP and 
DAPL should be above that.  It seems that you're just wanting to move 
the abstraction down lower in the stack.

> Do we really want to expose IB related
> structures such as CQs, QPs,  and WQE? Why?

*blinks*

> etc.) requirements. I do not know if this requirement will actually
> realize, but if is will, the SM and maybe also the SMI/GSI agents and
> the CM will have to significantly change. If this is likely to

Coding requirements that may or may not happen or potential changes in 
the future is likely to produce nothing usable.

> Why should we develop complicated functionality such as
> RMPP in the kernel when the only few kernel based queries (if any at
> all) will use them?

I'm not opposed to moving functionality from the kernel to user-space, 
if it makes sense to do so.  Note that TCP is in the kernel, and RMPP is 
somewhat similar.

> very complicated interfaces and therefore much less
> stable and much more exposed to bugs, they will use 10GE.

Regardless of exposed interfaces, there will still be a need to 
implement IB management, meaning that the complexity will still be there.

> doubt. Yes, it is true that this project is meant to supply HPC code
> base, but eventually, IB will not survive as HPC interconnect only.

This is a debatable point.  I think that IB can survive as an only an 
HPC interconnect, and that it may have to.  Ethernet may never be as 
good as IB, but that doesn't mean that it won't someday be good enough, 
especially if it comes for "free" on the motherboard.


From Nitin.Hande at Sun.COM  Thu Nov 18 10:41:55 2004
From: Nitin.Hande at Sun.COM (Nitin Hande)
Date: Thu, 18 Nov 2004 10:41:55 -0800
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <419CE789.7080602@pantasys.com>
References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com>
	<1100796136.3277.9.camel@localhost.localdomain>
	<419CDD28.70906@Sun.COM> <419CE789.7080602@pantasys.com>
Message-ID: <419CECF3.1030606@Sun.COM>

Peter Buckingham wrote:
> Nitin Hande wrote:
> 
>>I just started taking a look at the existing bonding driver and
>>evaluating what work needs to be done to support ipoib driver below it.
>>It seems to me like a lot of pieces for this approach are readily
>>available (ifenslave and other logic) etc and besides I guess that will
>>keep one standard approach of doing bonding in linux. I also assume that
>>while the ipoib gets enslaved, it will get enough opportunity to take
>>right set of steps for its present connections and traffic etc. I might
>>be wrong here though. Would like to hear from other members about this one.
> 
> 
> i took a quick look at this a little while ago (admittedly with gen1 
> IPoIB) and when trying to enslave the ib0, ib1 interfaces the bonding 
> driver complains about an unsupported ioctl. i didn't have any time to 
> track it down much further than that. if there's some interest i could 
> take a bit more of a look at it.
Yes, some of that needs to be implemented for ipoib. I have some bits
being readied. But before that would like to hear from people about
various approaches.

Thanks
Nitin


> 
> peter
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Thu Nov 18 10:43:32 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 13:43:32 -0500
Subject: [openib-general] Re: More on IPoIB Multicast
In-Reply-To: <419CECF3.1030606@Sun.COM>
References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com>
	<1100796136.3277.9.camel@localhost.localdomain>
	<419CDD28.70906@Sun.COM>
	<419CE789.7080602@pantasys.com> <419CECF3.1030606@Sun.COM>
Message-ID: <1100803412.3280.4.camel@localhost.localdomain>

On Thu, 2004-11-18 at 13:41, Nitin Hande wrote:
> But before that would like to hear from people about
> various approaches.

Some vendors have implemented this by combining multiple HCA ports and
failing over from one to the other. Bonding may provide striping (using
both ports concurrently). 

I will need to read up on bonding to understand what it provides and
compare it to what can be done under the IPoIB driver.

-- Hal


From halr at voltaire.com  Thu Nov 18 10:45:50 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 13:45:50 -0500
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread	names
In-Reply-To: <52zn1fkr4s.fsf@topspin.com>
References: <1100797323.3277.19.camel@localhost.localdomain>
	<52zn1fkr4s.fsf@topspin.com>
Message-ID: <1100803550.3280.7.camel@localhost.localdomain>

On Thu, 2004-11-18 at 12:23, Roland Dreier wrote:
> It's extremely unlikely, but:
> 
> +	char name[8];
> 
> +	sprintf(name, "ib_mad%d", port_num);
> 
> if port_num >= 10, this will overflow the buffer.  Since a device
> could conceivably have up to 255 ports (although an HCA with hundreds
> of ports is rather far-fetched, and we only create one port for a
> switch), I would suggest doing
> 
> 	char name[sizeof "ib_mad123"];
> 
> and
> 
> 	snprintf(name, sizeof name, "ib_mad%d", port_num);
> 
> for correctness and (mostly) ease of auditing.

Thanks. Applied.

-- Hal

Index: mad.c
===================================================================
--- mad.c	(revision 1261)
+++ mad.c	(working copy)
@@ -1843,7 +1843,7 @@
 	int ret, cq_size;
 	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
-	char name[8];
+	char name[sizeof "ib_mad123"];
 
 	/* First, check if port already open at MAD layer */
 	port_priv = ib_get_mad_port(device, port_num);
@@ -1899,7 +1899,7 @@
 	if (ret)
 		goto error7;
 
-	sprintf(name, "ib_mad%d", port_num);
+	snprintf(name, sizeof name, "ib_mad%d", port_num);
 	port_priv->wq = create_workqueue(name);
 	if (!port_priv->wq) {
 		ret = -ENOMEM;


From johannes at erdfelt.com  Thu Nov 18 10:53:05 2004
From: johannes at erdfelt.com (Johannes Erdfelt)
Date: Thu, 18 Nov 2004 10:53:05 -0800
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names
In-Reply-To: <52zn1fkr4s.fsf@topspin.com>
References: <1100797323.3277.19.camel@localhost.localdomain>
	<52zn1fkr4s.fsf@topspin.com>
Message-ID: <20041118185305.GQ27658@sventech.com>

On Thu, Nov 18, 2004, Roland Dreier <roland at topspin.com> wrote:
> It's extremely unlikely, but:
> 
> +	char name[8];
> 
> +	sprintf(name, "ib_mad%d", port_num);
> 
> if port_num >= 10, this will overflow the buffer.  Since a device
> could conceivably have up to 255 ports (although an HCA with hundreds
> of ports is rather far-fetched, and we only create one port for a
> switch), I would suggest doing
> 
> 	char name[sizeof "ib_mad123"];

You mean

	char name[sizeof "ib_mad123" + 1];

right? :)

Otherwise we'll limit the name to < 100 ports (yes, yes, nitpicking)

> and
> 
> 	snprintf(name, sizeof name, "ib_mad%d", port_num);
> 
> for correctness and (mostly) ease of auditing.

I agree completely.

JE


From halr at voltaire.com  Thu Nov 18 10:49:04 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 13:49:04 -0500
Subject: [openib-general] mthca crash on startup
Message-ID: <1100803744.3280.11.camel@localhost.localdomain>

When starting ib_mthca, I got the following log messages. I am running with the latest
bits. It may also have been related to the startup of a switch or SM at the same 
instant in time.

-- Hal

Nov 18 13:32:06 localhost kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
Nov 18 13:32:06 localhost kernel: ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:02:00.0)
Nov 18 13:32:07 localhost /sbin/hotplug: no runnable /etc/hotplug/infiniband.agent is installed
Nov 18 13:32:07 localhost kernel: modprobe: page allocation failure. order:6, mode:0x20
Nov 18 13:32:07 localhost kernel:  [<c0147b12>] __alloc_pages+0x1c2/0x370
Nov 18 13:32:07 localhost kernel:  [<c0147cdf>] __get_free_pages+0x1f/0x40
Nov 18 13:32:07 localhost kernel:  [<c010f55e>] dma_alloc_coherent+0xce/0x100
Nov 18 13:32:07 localhost kernel:  [<d08ff485>] mthca_cmd_box+0x85/0xe0 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d09098cc>] mthca_alloc_sqp+0x6c/0x420 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d090d75e>] mthca_create_qp+0x16e/0x180 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d08e8d42>] ib_create_qp+0x22/0x80 [ib_core]
Nov 18 13:32:07 localhost kernel:  [<d08f6af6>] create_mad_qp+0x86/0xd0 [ib_mad]
Nov 18 13:32:07 localhost kernel:  [<d08f69b0>] qp_event_handler+0x0/0x30 [ib_mad]
Nov 18 13:32:07 localhost kernel:  [<d08e8f6e>] ib_get_dma_mr+0x1e/0x50 [ib_core]
Nov 18 13:32:07 localhost kernel:  [<d08f6d83>] ib_mad_port_open+0x233/0x5c0 [ib_mad]
Nov 18 13:32:07 localhost kernel:  [<d08f741e>] ib_mad_init_device+0x3e/0x100 [ib_mad]
Nov 18 13:32:07 localhost kernel:  [<d08ece1d>] ib_cache_setup_one+0x12d/0x1d0 [ib_core]
Nov 18 13:32:07 localhost /sbin/hotplug: no runnable /etc/hotplug/infiniband.agent is installed
Nov 18 13:32:07 localhost kernel:  [<d08f73e0>] ib_mad_init_device+0x0/0x100 [ib_mad]
Nov 18 13:32:07 localhost kernel:  [<d08ea7dd>] ib_register_device+0x17d/0x1a0 [ib_core]
Nov 18 13:32:07 localhost kernel:  [<d090d850>] mthca_req_notify_cq+0x0/0x30 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d0905400>] mthca_poll_cq+0x0/0xbb0 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d090d820>] mthca_destroy_cq+0x0/0x30 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d090e1ab>] mthca_register_device+0x15b/0x1a0 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<d08fe7f3>] mthca_init_one+0x523/0x6e0 [ib_mthca]
Nov 18 13:32:07 localhost kernel:  [<c02211b2>] pci_device_probe_static+0x52/0x70
Nov 18 13:32:08 localhost kernel:  [<c022120c>] __pci_device_probe+0x3c/0x50
Nov 18 13:32:08 localhost kernel:  [<c022124c>] pci_device_probe+0x2c/0x50
Nov 18 13:32:08 localhost kernel:  [<c025430f>] bus_match+0x3f/0x70
Nov 18 13:32:08 localhost kernel:  [<c025443c>] driver_attach+0x5c/0x90
Nov 18 13:32:08 localhost kernel:  [<c0254931>] bus_add_driver+0x91/0xb0
Nov 18 13:32:08 localhost kernel:  [<c0254f3c>] driver_register+0x8c/0x90
Nov 18 13:32:08 localhost kernel:  [<c0221510>] pci_register_driver+0x90/0xb0
Nov 18 13:32:08 localhost kernel:  [<d08ad00f>] mthca_init+0xf/0x1a [ib_mthca]
Nov 18 13:32:08 localhost kernel:  [<c013f1c9>] sys_init_module+0x289/0x340
Nov 18 13:32:08 localhost kernel:  [<c010680d>] sysenter_past_esp+0x52/0x71


From halr at voltaire.com  Thu Nov 18 10:51:33 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 13:51:33 -0500
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread	names
In-Reply-To: <20041118185305.GQ27658@sventech.com>
References: <1100797323.3277.19.camel@localhost.localdomain>
	<52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com>
Message-ID: <1100803892.3280.13.camel@localhost.localdomain>

On Thu, 2004-11-18 at 13:53, Johannes Erdfelt wrote:
> I would suggest doing
> > 
> > 	char name[sizeof "ib_mad123"];
> 
> You mean
> 
> 	char name[sizeof "ib_mad123" + 1];
> 
> right? :)

Right. Thanks. Applied.

-- Hal


From roland at topspin.com  Thu Nov 18 10:57:09 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:57:09 -0800
Subject: [openib-general] Draft kernel RFC patches coming...
Message-ID: <52llczkmt6.fsf@topspin.com>

I'm about to send out a draft version of the kernel submission
patches.  I am using the same script I'll use to send the patches to
linux-kernel, so look for the thread starting

    [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review

All comments/corrections/criticisms, both about the code and the
introduction and patch descriptions I wrote, will be very much
appreciated.

Assuming things look OK, I'm still planning on sending this to
linux-kernel on Monday.

Thanks,
  Roland


From roland at topspin.com  Thu Nov 18 10:57:33 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:57:33 -0800
Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of
	InfiniBand patches for review
Message-ID: <200411181057.Qj4goy9DRJmMAYGe@topspin.com>

I'm very happy to be able to post an initial version of InfiniBand
patches for review.  Although this code should be far closer to kernel
coding standards than previous open source InfiniBand drivers, this
initial posting should be treated as a request for comments and not a
request for inclusion; our ultimate goal is to have these drivers
included in the mainline kernel, but we expect that fixes and
improvements will need to be made before the code is completely
acceptable.

These patches add a minimal but complete level of InfiniBand support,
including an IB midlayer, a low-level driver for Mellanox HCAs, an
IP-over-InfiniBand driver, and a mechanism for MADs (management
datagrams) to be passed to and from userspace.  This means that these
patches are all that is required for the kernel to bring up and use an
IP-over-InfiniBand link.

The code has not been through extreme stress testing yet, but it has
been used successfully on i386, x86_64, ppc64, ia64 and sparc64
systems, including mixed 32/64 systems.

Feedback on both details of the code as well as the high-level
organization of the code will be very much appreciated.  For example,
the current set of patches puts include files in driver/infiniband/include;
would it be preferred to put include files in include/linux/infiniband/,
directly in include/linux, or perhaps in include/infiniband?

We would also like to explore the best avenue for having these patches
merged.  It may be desirable for the patches to spend some time in -mm
before moving into Linus's kernel; on the other hand, the patches make
only very minimal and safe changes outside of drivers/infiniband, so
it is quite reasonable to merge them directly into the mainline
kernel.  Although 2.6.10 is now closed, 2.6.11 should be open by the
time the review process is complete.

We look forward to the community's comments and criticisms!

Thanks,
  Roland Dreier
  OpenIB Alliance


From roland at topspin.com  Thu Nov 18 10:57:58 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:57:58 -0800
Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of
	InfiniBand patches for review
Message-ID: <200411181057.U4msjhmj8U8k8coW@topspin.com>

I'm very happy to be able to post an initial version of InfiniBand
patches for review.  Although this code should be far closer to kernel
coding standards than previous open source InfiniBand drivers, this
initial posting should be treated as a request for comments and not a
request for inclusion; our ultimate goal is to have these drivers
included in the mainline kernel, but we expect that fixes and
improvements will need to be made before the code is completely
acceptable.

These patches add a minimal but complete level of InfiniBand support,
including an IB midlayer, a low-level driver for Mellanox HCAs, an
IP-over-InfiniBand driver, and a mechanism for MADs (management
datagrams) to be passed to and from userspace.  This means that these
patches are all that is required for the kernel to bring up and use an
IP-over-InfiniBand link.

The code has not been through extreme stress testing yet, but it has
been used successfully on i386, x86_64, ppc64, ia64 and sparc64
systems, including mixed 32/64 systems.

Feedback on both details of the code as well as the high-level
organization of the code will be very much appreciated.  For example,
the current set of patches puts include files in driver/infiniband/include;
would it be preferred to put include files in include/linux/infiniband/,
directly in include/linux, or perhaps in include/infiniband?

We would also like to explore the best avenue for having these patches
merged.  It may be desirable for the patches to spend some time in -mm
before moving into Linus's kernel; on the other hand, the patches make
only very minimal and safe changes outside of drivers/infiniband, so
it is quite reasonable to merge them directly into the mainline
kernel.  Although 2.6.10 is now closed, 2.6.11 should be open by the
time the review process is complete.

We look forward to the community's comments and criticisms!

Thanks,
  Roland Dreier
  OpenIB Alliance


From roland at topspin.com  Thu Nov 18 10:58:03 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:03 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
Message-ID: <200411181058.nZu5AGvCLwleEqeJ@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/f88361cc/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][1/12] Add core InfiniBand support
Date: Thu, 18 Nov 2004 10:58:03 -0800
Size: 120267
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/f88361cc/attachment.mht>

From roland at topspin.com  Thu Nov 18 10:58:10 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:10 -0800
Subject: [openib-general] [PATCH][RFC/v1][2/12] Hook up drivers/infiniband
In-Reply-To: <200411181058.nZu5AGvCLwleEqeJ@topspin.com>
Message-ID: <200411181058.K6SRbLv9kMx8dY1X@topspin.com>

Add the appropriate lines to drivers/Kconfig and drivers/Makefile so
that the kernel configuration and build systems know about drivers/infiniband.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/drivers/Kconfig
===================================================================
--- linux-bk.orig/drivers/Kconfig	2004-11-17 19:52:35.000000000 -0800
+++ linux-bk/drivers/Kconfig	2004-11-18 10:51:38.887317830 -0800
@@ -54,4 +54,6 @@
 
 source "drivers/usb/Kconfig"
 
+source "drivers/infiniband/Kconfig"
+
 endmenu
Index: linux-bk/drivers/Makefile
===================================================================
--- linux-bk.orig/drivers/Makefile	2004-11-17 19:52:44.000000000 -0800
+++ linux-bk/drivers/Makefile	2004-11-18 10:51:38.887317830 -0800
@@ -59,4 +59,5 @@
 obj-$(CONFIG_EISA)		+= eisa/
 obj-$(CONFIG_CPU_FREQ)		+= cpufreq/
 obj-$(CONFIG_MMC)		+= mmc/
+obj-$(CONFIG_INFINIBAND)	+= infiniband/
 obj-y				+= firmware/


From roland at topspin.com  Thu Nov 18 10:58:15 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:15 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][3/12] Add
	InfiniBand MAD (management datagram) support
Message-ID: <200411181058.BeTpz4xPzTYV7Nk7@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/b746d08a/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][3/12] Add InfiniBand MAD (management datagram) support
Date: Thu, 18 Nov 2004 10:58:15 -0800
Size: 108305
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/b746d08a/attachment.mht>

From roland at topspin.com  Thu Nov 18 10:58:22 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:22 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12] Add
	InfiniBand SA (Subnet Administration) query support
Message-ID: <200411181058.sHj94LsTlhUWv3cp@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/961dc934/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query	support
Date: Thu, 18 Nov 2004 10:58:22 -0800
Size: 32660
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/961dc934/attachment.mht>

From roland at topspin.com  Thu Nov 18 10:58:27 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:27 -0800
Subject: [openib-general] [PATCH][RFC/v1][5/12] Add Mellanox HCA low-level
	driver
In-Reply-To: <200411181058.sHj94LsTlhUWv3cp@topspin.com>
Message-ID: <200411181058.1FCya6SB4aFjW3YZ@topspin.com>

Add a low-level driver for Mellanox MT23108 and MT25208 HCAs.  The
MT25208 is only fully supported when in MT23108 compatibility mode;
only the very beginnings of support for native MT25208 mode (required
for HCAs without local memory) is present.

(As a side note, I believe this driver would be the first in-tree
consumer of the PCI MSI/MSI-X API)

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/drivers/infiniband/Kconfig
===================================================================
--- linux-bk.orig/drivers/infiniband/Kconfig	2004-11-18 10:51:37.708491106 -0800
+++ linux-bk/drivers/infiniband/Kconfig	2004-11-18 10:51:40.509079447 -0800
@@ -8,4 +8,6 @@
 	  any protocols you wish to use as well as drivers for your
 	  InfiniBand hardware.
 
+source "drivers/infiniband/hw/Kconfig"
+
 endmenu
Index: linux-bk/drivers/infiniband/Makefile
===================================================================
--- linux-bk.orig/drivers/infiniband/Makefile	2004-11-18 10:51:37.740486403 -0800
+++ linux-bk/drivers/infiniband/Makefile	2004-11-18 10:51:40.483083269 -0800
@@ -1 +1 @@
-obj-$(CONFIG_INFINIBAND) += core/
+obj-$(CONFIG_INFINIBAND) += core/ hw/
Index: linux-bk/drivers/infiniband/hw/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/Kconfig	2004-11-18 10:51:40.535075626 -0800
@@ -0,0 +1 @@
+source "drivers/infiniband/hw/mthca/Kconfig"
Index: linux-bk/drivers/infiniband/hw/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/Makefile	2004-11-18 10:51:40.559072099 -0800
@@ -0,0 +1 @@
+obj-$(CONFIG_INFINIBAND_MTHCA) 	 	+= mthca/
Index: linux-bk/drivers/infiniband/hw/mthca/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Kconfig	2004-11-18 10:51:40.583068572 -0800
@@ -0,0 +1,26 @@
+config INFINIBAND_MTHCA
+	tristate "Mellanox HCA support"
+	depends on PCI && INFINIBAND
+	---help---
+	  This is a low-level driver for Mellanox InfiniHost host
+	  channel adapters (HCAs), including the MT23108 PCI-X HCA
+	  ("Tavor") and the MT25208 PCI Express HCA ("Arbel").
+
+config INFINIBAND_MTHCA_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_MTHCA
+	default n
+	---help---
+	  This option causes the mthca driver produce a bunch of debug
+	  messages.  Select this is you are developing the driver or
+	  trying to diagnose a problem.
+
+config INFINIBAND_MTHCA_SSE_DOORBELL
+	bool "SSE doorbell code"
+	depends on INFINIBAND_MTHCA && X86 && !X86_64
+	default n
+	---help---
+	  This option will have the mthca driver use SSE instructions
+	  to ring hardware doorbell registers.  This may improve
+	  performance for some workloads, but the driver will not run
+	  on processors without SSE instructions.
Index: linux-bk/drivers/infiniband/hw/mthca/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Makefile	2004-11-18 10:51:40.606065191 -0800
@@ -0,0 +1,23 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o
+
+ib_mthca-objs := \
+    mthca_main.o \
+    mthca_cmd.o  \
+    mthca_profile.o \
+    mthca_reset.o \
+    mthca_allocator.o \
+    mthca_eq.o \
+    mthca_pd.o \
+    mthca_cq.o \
+    mthca_mr.o \
+    mthca_qp.o \
+    mthca_av.o \
+    mthca_mcg.o \
+    mthca_mad.o \
+    mthca_provider.o
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c	2004-11-18 10:51:40.630061664 -0800
@@ -0,0 +1,175 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $
+ */
+
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/bitmap.h> 
+
+#include "mthca_dev.h"
+
+/* Trivial bitmap-based allocator */
+u32 mthca_alloc(struct mthca_alloc *alloc)
+{
+	u32 obj;
+
+	spin_lock(&alloc->lock);
+	obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last);
+	if (obj >= alloc->max) {
+		alloc->top = (alloc->top + alloc->max) & alloc->mask;
+		obj = find_first_zero_bit(alloc->table, alloc->max);
+	}
+
+	if (obj < alloc->max) {
+		set_bit(obj, alloc->table);
+		obj |= alloc->top;
+	} else
+		obj = -1;
+
+	spin_unlock(&alloc->lock);
+
+	return obj;
+}
+
+void mthca_free(struct mthca_alloc *alloc, u32 obj)
+{
+	obj &= alloc->max - 1;
+	spin_lock(&alloc->lock);
+	clear_bit(obj, alloc->table);
+	alloc->last = min(alloc->last, obj);
+	alloc->top = (alloc->top + alloc->max) & alloc->mask;
+	spin_unlock(&alloc->lock);
+}
+
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved)
+{
+	int i;
+
+	/* num must be a power of 2 */
+	if (num != 1 << (ffs(num) - 1))
+		return -EINVAL;
+
+	alloc->last = 0;
+	alloc->top  = 0;
+	alloc->max  = num;
+	alloc->mask = mask;
+	spin_lock_init(&alloc->lock);
+	alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long),
+			       GFP_KERNEL);
+	if (!alloc->table)
+		return -ENOMEM;
+
+	bitmap_zero(alloc->table, num);
+	for (i = 0; i < reserved; ++i)
+		set_bit(i, alloc->table);
+
+	return 0;
+}
+
+void mthca_alloc_cleanup(struct mthca_alloc *alloc)
+{
+	kfree(alloc->table);
+}
+
+/*
+ * Array of pointers with lazy allocation of leaf pages.  Callers of
+ * _get, _set and _clear methods must use a lock or otherwise
+ * serialize access to the array.
+ */
+
+void *mthca_array_get(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (array->page_list[p].page) {
+		int i = index & (PAGE_SIZE / sizeof (void *) - 1);
+		return array->page_list[p].page[i];
+	} else
+		return NULL;
+}
+
+int mthca_array_set(struct mthca_array *array, int index, void *value)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
+	if (!array->page_list[p].page)
+		array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC);
+
+	if (!array->page_list[p].page)
+		return -ENOMEM;
+
+	array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] =
+		value;
+	++array->page_list[p].used;
+
+	return 0;
+}
+
+void mthca_array_clear(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (--array->page_list[p].used == 0) {
+		free_page((unsigned long) array->page_list[p].page);
+		array->page_list[p].page = NULL;
+	}
+
+	if (array->page_list[p].used < 0)
+		pr_debug("Array %p index %d page %d with ref count %d < 0\n",
+			 array, index, p, array->page_list[p].used);
+}
+
+int mthca_array_init(struct mthca_array *array, int nent)
+{
+	int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE;
+	int i;
+
+	array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL);
+	if (!array->page_list)
+		return -ENOMEM;
+
+	for (i = 0; i < npage; ++i) {
+		array->page_list[i].page = NULL;
+		array->page_list[i].used = 0;
+	}
+
+	return 0;
+}
+
+void mthca_array_cleanup(struct mthca_array *array, int nent)
+{
+	int i;
+
+	for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i)
+		free_page((unsigned long) array->page_list[i].page);
+
+	kfree(array->page_list);
+}
+
+/*
+ * Local Variables:
+ *  c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_av.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c	2004-11-18 10:51:40.653058284 -0800
@@ -0,0 +1,212 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+
+#include "mthca_dev.h"
+
+struct mthca_av {
+	u32 port_pd;
+	u8  reserved1;
+	u8  g_slid;
+	u16 dlid;
+	u8  reserved2;
+	u8  gid_index;
+	u8  msg_sr;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u32 dgid[4];
+} __attribute__((packed));
+
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah)
+{
+	u32 index = -1;
+	struct mthca_av *av = NULL;
+
+	ah->on_hca = 0;
+
+	if (!atomic_read(&pd->sqp_count) &&
+	    !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		index = mthca_alloc(&dev->av_table.alloc);
+
+		/* fall back to allocate in host memory */
+		if (index == -1)
+			goto host_alloc;
+
+		av = kmalloc(sizeof *av, GFP_KERNEL);
+		if (!av)
+			goto host_alloc;
+			
+		ah->on_hca = 1;
+		ah->avdma  = dev->av_table.ddr_av_base +
+			index * MTHCA_AV_SIZE;
+	}
+
+ host_alloc:
+	if (!ah->on_hca) {
+		ah->av = pci_pool_alloc(dev->av_table.pool,
+					SLAB_KERNEL, &ah->avdma);
+		if (!ah->av)
+			return -ENOMEM;
+
+		av = ah->av;
+	}
+
+	ah->key = pd->ntmr.ibmr.lkey;
+
+	memset(av, 0, MTHCA_AV_SIZE);
+
+	av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24));
+	av->g_slid  = ah_attr->src_path_bits;
+	av->dlid    = cpu_to_be16(ah_attr->dlid);
+	av->msg_sr  = (3 << 4) | /* 2K message */
+		ah_attr->static_rate;
+	av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+	if (ah_attr->ah_flags & IB_AH_GRH) {
+		av->g_slid |= 0x80;
+		av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len +
+			ah_attr->grh.sgid_index;
+		av->hop_limit = ah_attr->grh.hop_limit;
+		av->sl_tclass_flowlabel |=
+			cpu_to_be32((ah_attr->grh.traffic_class << 20) |
+				    ah_attr->grh.flow_label);
+		memcpy(av->dgid, ah_attr->grh.dgid.raw, 16);
+	}
+
+	if (0) {
+		int j;
+		
+		mthca_dbg(dev, "Created UDAV at %p/%08lx:\n",
+			  av, (unsigned long) ah->avdma);
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) av)[j]));
+	}
+
+	if (ah->on_hca) {
+		memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
+			    av, MTHCA_AV_SIZE);
+		kfree(av);
+	}
+
+	return 0;
+}
+
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
+{
+	if (ah->on_hca)
+		mthca_free(&dev->av_table.alloc,
+ 			   (ah->avdma - dev->av_table.ddr_av_base) /
+			   MTHCA_AV_SIZE);
+	else
+		pci_pool_free(dev->av_table.pool, ah->av, ah->avdma);
+
+	return 0;
+}
+
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header)
+{
+	if (ah->on_hca)
+		return -EINVAL;
+
+	header->lrh.service_level   = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28;
+	header->lrh.destination_lid = ah->av->dlid;
+	header->lrh.source_lid      = ah->av->g_slid & 0x7f;
+	if (ah->av->g_slid & 0x80) {
+		header->grh_present = 1;
+		header->grh.traffic_class =
+			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
+		header->grh.flow_label    =
+			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		ib_cached_gid_get(&dev->ib_dev,
+				  be32_to_cpu(ah->av->port_pd) >> 24,
+				  ah->av->gid_index,
+				  &header->grh.source_gid);
+		memcpy(header->grh.destination_gid.raw,
+		       ah->av->dgid, 16);
+	} else {
+		header->grh_present = 0;
+	}
+
+	return 0;
+}
+
+int __devinit mthca_init_av_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->av_table.alloc,
+			       dev->av_table.num_ddr_avs,
+			       dev->av_table.num_ddr_avs - 1,
+			       0);
+	if (err)
+		return err;
+
+	dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev,
+					     MTHCA_AV_SIZE,
+					     MTHCA_AV_SIZE, 0);
+	if (!dev->av_table.pool)
+		goto out_free_alloc;
+
+	if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) +
+					       dev->av_table.ddr_av_base -
+					       dev->ddr_start,
+					       dev->av_table.num_ddr_avs *
+					       MTHCA_AV_SIZE);
+		if (!dev->av_table.av_map)
+			goto out_free_pool;
+	} else
+		dev->av_table.av_map = NULL;
+
+	return 0;
+
+ out_free_pool:
+	pci_pool_destroy(dev->av_table.pool);
+
+ out_free_alloc:
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+	return -ENOMEM;
+}
+
+void __devexit mthca_cleanup_av_table(struct mthca_dev *dev)
+{
+	if (dev->av_table.av_map)
+		iounmap(dev->av_table.av_map);
+	pci_pool_destroy(dev->av_table.pool);
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c	2004-11-18 10:51:40.677054757 -0800
@@ -0,0 +1,1522 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/sched.h>
+#include <linux/pci.h>
+#include <linux/errno.h>
+#include <asm/io.h>
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+
+#define CMD_POLL_TOKEN 0xffff
+
+enum {
+	HCR_IN_PARAM_OFFSET    = 0x00,
+	HCR_IN_MODIFIER_OFFSET = 0x08,
+	HCR_OUT_PARAM_OFFSET   = 0x0c,
+	HCR_TOKEN_OFFSET       = 0x14,
+	HCR_STATUS_OFFSET      = 0x18,
+
+	HCR_OPMOD_SHIFT        = 12,
+	HCA_E_BIT              = 22,
+	HCR_GO_BIT             = 23
+};
+
+enum {
+	/* initialization and general commands */
+	CMD_SYS_EN          = 0x1,
+	CMD_SYS_DIS         = 0x2,
+	CMD_MAP_FA          = 0xfff,
+	CMD_UNMAP_FA        = 0xffe,
+	CMD_RUN_FW          = 0xff6,
+	CMD_MOD_STAT_CFG    = 0x34,
+	CMD_QUERY_DEV_LIM   = 0x3,
+	CMD_QUERY_FW        = 0x4,
+	CMD_ENABLE_LAM      = 0xff8,
+	CMD_DISABLE_LAM     = 0xff7,
+	CMD_QUERY_DDR       = 0x5,
+	CMD_QUERY_ADAPTER   = 0x6,
+	CMD_INIT_HCA        = 0x7,
+	CMD_CLOSE_HCA       = 0x8,
+	CMD_INIT_IB         = 0x9,
+	CMD_CLOSE_IB        = 0xa,
+	CMD_QUERY_HCA       = 0xb,
+	CMD_SET_IB          = 0xc,
+	CMD_ACCESS_DDR      = 0x2e,
+	CMD_MAP_ICM         = 0xffa,
+	CMD_UNMAP_ICM       = 0xff9,
+	CMD_MAP_ICM_AUX     = 0xffc,
+	CMD_UNMAP_ICM_AUX   = 0xffb,
+	CMD_SET_ICM_SIZE    = 0xffd,
+
+	/* TPT commands */
+	CMD_SW2HW_MPT 	    = 0xd,
+	CMD_QUERY_MPT 	    = 0xe,
+	CMD_HW2SW_MPT 	    = 0xf,
+	CMD_READ_MTT        = 0x10,
+	CMD_WRITE_MTT       = 0x11,
+	CMD_SYNC_TPT        = 0x2f,
+
+	/* EQ commands */
+	CMD_MAP_EQ          = 0x12,
+	CMD_SW2HW_EQ 	    = 0x13,
+	CMD_HW2SW_EQ 	    = 0x14,
+	CMD_QUERY_EQ        = 0x15,
+
+	/* CQ commands */
+	CMD_SW2HW_CQ 	    = 0x16,
+	CMD_HW2SW_CQ 	    = 0x17,
+	CMD_QUERY_CQ 	    = 0x18,
+	CMD_RESIZE_CQ       = 0x2c,
+
+	/* SRQ commands */
+	CMD_SW2HW_SRQ 	    = 0x35,
+	CMD_HW2SW_SRQ 	    = 0x36,
+	CMD_QUERY_SRQ       = 0x37,
+
+	/* QP/EE commands */
+	CMD_RST2INIT_QPEE   = 0x19,
+	CMD_INIT2RTR_QPEE   = 0x1a,
+	CMD_RTR2RTS_QPEE    = 0x1b,
+	CMD_RTS2RTS_QPEE    = 0x1c,
+	CMD_SQERR2RTS_QPEE  = 0x1d,
+	CMD_2ERR_QPEE       = 0x1e,
+	CMD_RTS2SQD_QPEE    = 0x1f,
+	CMD_SQD2SQD_QPEE    = 0x38,
+	CMD_SQD2RTS_QPEE    = 0x20,
+	CMD_ERR2RST_QPEE    = 0x21,
+	CMD_QUERY_QPEE      = 0x22,
+	CMD_INIT2INIT_QPEE  = 0x2d,
+	CMD_SUSPEND_QPEE    = 0x32,
+	CMD_UNSUSPEND_QPEE  = 0x33,
+	/* special QPs and management commands */
+	CMD_CONF_SPECIAL_QP = 0x23,
+	CMD_MAD_IFC         = 0x24,
+
+	/* multicast commands */
+	CMD_READ_MGM        = 0x25,
+	CMD_WRITE_MGM       = 0x26,
+	CMD_MGID_HASH       = 0x27,
+
+	/* miscellaneous commands */
+	CMD_DIAG_RPRT       = 0x30,
+	CMD_NOP             = 0x31,
+
+	/* debug commands */
+	CMD_QUERY_DEBUG_MSG = 0x2a,
+	CMD_SET_DEBUG_MSG   = 0x2b,
+};
+
+/*
+ * According to Mellanox code, FW may be starved and never complete
+ * commands.  So we can't use strict timeouts described in PRM -- we
+ * just arbitrarily select 60 seconds for now.
+ */
+#if 0
+/*
+ * Round up and add 1 to make sure we get the full wait time (since we
+ * will be starting in the middle of a jiffy)
+ */
+enum {
+	CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1,
+	CMD_TIME_CLASS_B = (HZ +  99) /  100 + 1,
+	CMD_TIME_CLASS_C = (HZ +   9) /   10 + 1
+};
+#else
+enum {
+	CMD_TIME_CLASS_A = 60 * HZ,
+	CMD_TIME_CLASS_B = 60 * HZ,
+	CMD_TIME_CLASS_C = 60 * HZ
+};
+#endif
+
+enum {
+	GO_BIT_TIMEOUT = HZ * 10
+};
+
+struct mthca_cmd_context {
+	struct completion done;
+	struct timer_list timer;
+	int               result;
+	int               next;
+	u64               out_param;
+	u16               token;
+	u8                status;
+};
+
+static inline int go_bit(struct mthca_dev *dev)
+{
+	return readl(dev->hcr + HCR_STATUS_OFFSET) &
+		swab32(1 << HCR_GO_BIT);
+}
+
+static int mthca_cmd_post(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 out_param,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  u16 token,
+			  int event)
+{
+	int err = 0;
+	
+	if (down_interruptible(&dev->cmd.hcr_sem))
+		return -EINTR;
+
+	if (event) {
+		unsigned long end = jiffies + GO_BIT_TIMEOUT;
+
+		while (go_bit(dev) && time_before(jiffies, end)) {
+			set_current_state(TASK_RUNNING);
+			schedule();
+		}
+	}
+
+	if (go_bit(dev)) {
+		err = -EAGAIN;
+		goto out;
+	}
+
+	/*
+	 * We use writel (instead of something like memcpy_toio)
+	 * because writes of less than 32 bits to the HCR don't work
+	 * (and some architectures such as ia64 implement memcpy_toio
+	 * in terms of writeb).
+	 */
+	__raw_writel(cpu_to_be32(in_param >> 32),           dev->hcr + 0 * 4);
+	__raw_writel(cpu_to_be32(in_param & 0xfffffffful),  dev->hcr + 1 * 4);
+	__raw_writel(cpu_to_be32(in_modifier),              dev->hcr + 2 * 4);
+	__raw_writel(cpu_to_be32(out_param >> 32),          dev->hcr + 3 * 4);
+	__raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4);
+	__raw_writel(cpu_to_be32(token << 16),              dev->hcr + 5 * 4);
+
+	/*
+	 * Flush posted writes so GO bit is written last (needed with
+	 * __raw_writel, which may not order writes).
+	 */
+	readl(dev->hcr + HCR_STATUS_OFFSET);	
+
+	__raw_writel(cpu_to_be32((1 << HCR_GO_BIT)                |
+				 (event ? (1 << HCA_E_BIT) : 0)   |
+				 (op_modifier << HCR_OPMOD_SHIFT) |
+				 op),                       dev->hcr + 6 * 4);
+
+out:
+	up(&dev->cmd.hcr_sem);
+	return err;
+}
+
+static int mthca_cmd_poll(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	unsigned long end;
+
+	if (down_interruptible(&dev->cmd.poll_sem))
+		return -EINTR;
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, CMD_POLL_TOKEN, 0);
+	if (err)
+		goto out;
+
+	end = timeout + jiffies;
+	while (go_bit(dev) && time_before(jiffies, end)) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+
+	if (go_bit(dev)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	if (out_is_imm) {
+		memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64));
+		be64_to_cpus(out_param);
+	}
+
+	*status = readb(dev->hcr + HCR_STATUS_OFFSET);
+
+out:
+	up(&dev->cmd.poll_sem);
+	return err;
+}
+
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param)
+{
+	struct mthca_cmd_context *context =
+		&dev->cmd.context[token & dev->cmd.token_mask];
+
+	/* previously timed out command completing at long last */
+	if (token != context->token)
+		return;
+
+	context->result    = 0;
+	context->status    = status;
+	context->out_param = out_param;
+
+	context->token += dev->cmd.token_mask + 1;
+
+	complete(&context->done);
+}
+
+static void event_timeout(unsigned long context_ptr)
+{
+	struct mthca_cmd_context *context =
+		(struct mthca_cmd_context *) context_ptr;
+
+	context->result = -EBUSY;
+	complete(&context->done);
+}
+
+static int mthca_cmd_wait(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	struct mthca_cmd_context *context;
+
+	if (down_interruptible(&dev->cmd.event_sem))
+		return -EINTR;
+
+	spin_lock(&dev->cmd.context_lock);
+	BUG_ON(dev->cmd.free_head < 0);
+	context = &dev->cmd.context[dev->cmd.free_head];
+	dev->cmd.free_head = context->next;
+	spin_unlock(&dev->cmd.context_lock);
+
+	init_completion(&context->done);
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, context->token, 1);
+	if (err)
+		goto out;
+
+	context->timer.expires  = jiffies + timeout;
+	add_timer(&context->timer);
+
+	wait_for_completion(&context->done);
+	del_timer_sync(&context->timer);
+
+	err = context->result;
+	if (err)
+		goto out;
+
+	*status = context->status;
+	if (*status)
+		mthca_dbg(dev, "Command %02x completed with status %02x\n",
+			  op, *status);
+
+	if (out_is_imm)
+		*out_param = context->out_param;
+
+out:
+	spin_lock(&dev->cmd.context_lock);
+	context->next = dev->cmd.free_head;
+	dev->cmd.free_head = context - dev->cmd.context;
+	spin_unlock(&dev->cmd.context_lock);
+
+	up(&dev->cmd.event_sem);
+	return err;
+}
+
+/* Invoke a command with an output mailbox */
+static int mthca_cmd_box(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/* Invoke a command with no output parameter */
+static int mthca_cmd(struct mthca_dev *dev,
+		     u64 in_param,
+		     u32 in_modifier,
+		     u8 op_modifier,
+		     u16 op,
+		     unsigned long timeout,
+		     u8 *status)
+{
+	return mthca_cmd_box(dev, in_param, 0, in_modifier,
+			     op_modifier, op, timeout, status);
+}
+
+/*
+ * Invoke a command with an immediate output parameter (and copy the
+ * output into the caller's out_param pointer after the command
+ * executes).
+ */
+static int mthca_cmd_imm(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 *out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/*
+ * Switch to using events to issue FW commands (should be called after
+ * event queue to command events has been initialized).
+ */
+int mthca_cmd_use_events(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.context = kmalloc(dev->cmd.max_cmds *
+				   sizeof (struct mthca_cmd_context),
+				   GFP_KERNEL);
+	if (!dev->cmd.context)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i) {
+		dev->cmd.context[i].token = i;
+		dev->cmd.context[i].next = i + 1;
+		init_timer(&dev->cmd.context[i].timer);
+		dev->cmd.context[i].timer.data     =
+			(unsigned long) &dev->cmd.context[i];
+		dev->cmd.context[i].timer.function = event_timeout;
+	}
+
+	dev->cmd.context[dev->cmd.max_cmds - 1].next = -1;
+	dev->cmd.free_head = 0;
+
+	sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds);
+	spin_lock_init(&dev->cmd.context_lock);
+
+	for (dev->cmd.token_mask = 1;
+	     dev->cmd.token_mask < dev->cmd.max_cmds;
+	     dev->cmd.token_mask <<= 1)
+		; /* nothing */
+	--dev->cmd.token_mask;
+
+	dev->cmd.use_events = 1;
+	down(&dev->cmd.poll_sem);
+
+	return 0;
+}
+
+/*
+ * Switch back to polling (used when shutting down the device)
+ */
+void mthca_cmd_use_polling(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.use_events = 0;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i)
+		down(&dev->cmd.event_sem);
+
+	kfree(dev->cmd.context);
+
+	up(&dev->cmd.poll_sem);
+}
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status)
+{
+	u64 out;
+	int ret;
+
+	ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status);
+
+	if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR)
+		mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, "
+			   "sladdr=%d, SPD source=%s\n",
+			   (int) (out >> 6) & 0xf, (int) (out >> 4) & 3,
+			   (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM");
+
+	return ret;
+}
+
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status);
+}
+
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int lg;
+	int nent = 0;
+	int i, j;
+	int err = 0;
+	int ts = 0;
+
+	inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma);
+	memset(inbox, 0, PAGE_SIZE);
+
+	for (i = 0; i < count; ++i) {
+		/*
+		 * We have to pass pages that are aligned to their
+		 * size, so find the least significant 1 in the
+		 * address or size and use that as our log2 size.
+		 */
+		lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1;
+		if (lg < 12) {
+			mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n",
+				   (unsigned long long) sg_dma_address(sglist + i),
+				   sg_dma_len(sglist + i));
+			err = -EINVAL;
+			goto out;
+		}
+		for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) {
+			*((__be64 *) (inbox + nent * 4 + 2)) =
+				cpu_to_be64((sg_dma_address(sglist + i) +
+					     (j << lg)) |
+					    (lg - 12));
+			ts += 1 << (lg - 10);
+			if (nent == PAGE_SIZE / 16) {
+				err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+						CMD_TIME_CLASS_B, status);
+				if (err || *status)
+					goto out;
+				nent = 0;
+			}
+		}
+	}
+
+	if (nent) {
+		err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+				CMD_TIME_CLASS_B, status);
+	}
+
+	mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts);
+
+out:
+	pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status);
+}
+
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+	u8 lg;
+
+#define QUERY_FW_OUT_SIZE             0x100
+#define QUERY_FW_VER_OFFSET            0x00
+#define QUERY_FW_MAX_CMD_OFFSET        0x0f
+#define QUERY_FW_ERR_START_OFFSET      0x30
+#define QUERY_FW_ERR_SIZE_OFFSET       0x38
+
+#define QUERY_FW_START_OFFSET          0x20
+#define QUERY_FW_END_OFFSET            0x28
+
+#define QUERY_FW_SIZE_OFFSET           0x00
+#define QUERY_FW_CLR_INT_BASE_OFFSET   0x20
+#define QUERY_FW_EQ_ARM_BASE_OFFSET    0x40
+#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma);
+	if (!outbox) {
+		return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->fw_ver,   outbox, QUERY_FW_VER_OFFSET);
+	/*
+	 * FW subminor version is at more signifant bits than minor
+	 * version, so swap here.
+	 */
+	dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) |
+		((dev->fw_ver & 0xffff0000ull) >> 16) |
+		((dev->fw_ver & 0x0000ffffull) << 16);
+
+	MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
+	dev->cmd.max_cmds = 1 << lg;
+
+	mthca_dbg(dev, "FW version %012llx, max commands %d\n",
+		  (unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
+
+	if (dev->hca_type == ARBEL_NATIVE) {
+		MTHCA_GET(dev->fw.arbel.fw_pages,       outbox, QUERY_FW_SIZE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.clr_int_base,   outbox, QUERY_FW_CLR_INT_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_arm_base,    outbox, QUERY_FW_EQ_ARM_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET);
+		mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2);
+
+		mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n",
+			  (unsigned long long) dev->fw.arbel.clr_int_base,
+			  (unsigned long long) dev->fw.arbel.eq_arm_base,
+			  (unsigned long long) dev->fw.arbel.eq_set_ci_base);
+	} else {
+		MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET);
+		MTHCA_GET(dev->fw.tavor.fw_end,   outbox, QUERY_FW_END_OFFSET);
+
+		mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n",
+			  (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10),
+			  (unsigned long long) dev->fw.tavor.fw_start,
+			  (unsigned long long) dev->fw.tavor.fw_end);
+	}
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define ENABLE_LAM_OUT_SIZE         0x100
+#define ENABLE_LAM_START_OFFSET     0x00
+#define ENABLE_LAM_END_OFFSET       0x08
+#define ENABLE_LAM_INFO_OFFSET      0x13
+
+#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4)
+#define ENABLE_LAM_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM,
+			    CMD_TIME_CLASS_C, status);
+
+	if (err)
+		goto out;
+
+	if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, ENABLE_LAM_END_OFFSET);
+	MTHCA_GET(info,           outbox, ENABLE_LAM_INFO_OFFSET);
+
+	if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & ENABLE_LAM_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status);
+}
+
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define QUERY_DDR_OUT_SIZE         0x100
+#define QUERY_DDR_START_OFFSET     0x00
+#define QUERY_DDR_END_OFFSET       0x08
+#define QUERY_DDR_INFO_OFFSET      0x13
+
+#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4)
+#define QUERY_DDR_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, QUERY_DDR_END_OFFSET);
+	MTHCA_GET(info,           outbox, QUERY_DDR_INFO_OFFSET);
+
+	if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & QUERY_DDR_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & QUERY_DDR_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	u8 field;
+	u16 size;
+	int err;
+
+#define QUERY_DEV_LIM_OUT_SIZE             0x100
+#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET    0x10
+#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET     0x11
+#define QUERY_DEV_LIM_RSVD_QP_OFFSET       0x12
+#define QUERY_DEV_LIM_MAX_QP_OFFSET        0x13
+#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET      0x14
+#define QUERY_DEV_LIM_MAX_SRQ_OFFSET       0x15
+#define QUERY_DEV_LIM_RSVD_EEC_OFFSET      0x16
+#define QUERY_DEV_LIM_MAX_EEC_OFFSET       0x17
+#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET     0x19
+#define QUERY_DEV_LIM_RSVD_CQ_OFFSET       0x1a
+#define QUERY_DEV_LIM_MAX_CQ_OFFSET        0x1b
+#define QUERY_DEV_LIM_MAX_MPT_OFFSET       0x1d
+#define QUERY_DEV_LIM_RSVD_EQ_OFFSET       0x1e
+#define QUERY_DEV_LIM_MAX_EQ_OFFSET        0x1f
+#define QUERY_DEV_LIM_RSVD_MTT_OFFSET      0x20
+#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET    0x21
+#define QUERY_DEV_LIM_RSVD_MRW_OFFSET      0x22
+#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET   0x23
+#define QUERY_DEV_LIM_MAX_AV_OFFSET        0x27
+#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET    0x29
+#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET    0x2b
+#define QUERY_DEV_LIM_MAX_RDMA_OFFSET      0x2f
+#define QUERY_DEV_LIM_ACK_DELAY_OFFSET     0x35
+#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET     0x36
+#define QUERY_DEV_LIM_VL_PORT_OFFSET       0x37
+#define QUERY_DEV_LIM_MAX_GID_OFFSET       0x3b
+#define QUERY_DEV_LIM_MAX_PKEY_OFFSET      0x3f
+#define QUERY_DEV_LIM_FLAGS_OFFSET         0x44
+#define QUERY_DEV_LIM_RSVD_UAR_OFFSET      0x48
+#define QUERY_DEV_LIM_UAR_SZ_OFFSET        0x49
+#define QUERY_DEV_LIM_PAGE_SZ_OFFSET       0x4b
+#define QUERY_DEV_LIM_MAX_SG_OFFSET        0x51
+#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET   0x52
+#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET    0x61
+#define QUERY_DEV_LIM_RSVD_MCG_OFFSET      0x62
+#define QUERY_DEV_LIM_MAX_MCG_OFFSET       0x63
+#define QUERY_DEV_LIM_RSVD_PD_OFFSET       0x64
+#define QUERY_DEV_LIM_MAX_PD_OFFSET        0x65
+#define QUERY_DEV_LIM_RSVD_RDD_OFFSET      0x66
+#define QUERY_DEV_LIM_MAX_RDD_OFFSET       0x67
+#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET  0x80
+#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET  0x82
+#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84
+#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86
+#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET  0x88
+#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET  0x8a
+#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET  0x8c
+#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET  0x8e
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET);
+	dev_lim->max_srq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET);
+	dev_lim->max_qp_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET);
+	dev_lim->reserved_qps = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET);
+	dev_lim->max_qps = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET);
+	dev_lim->reserved_srqs = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET);
+	dev_lim->max_srqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET);
+	dev_lim->reserved_eecs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET);
+	dev_lim->max_eecs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET);
+	dev_lim->max_cq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET);
+	dev_lim->reserved_cqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET);
+	dev_lim->max_cqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET);
+	dev_lim->max_mpts = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET);
+	dev_lim->reserved_eqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET);
+	dev_lim->max_eqs = 1 << (field & 0x7);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET);
+	dev_lim->reserved_mtts = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET);
+	dev_lim->max_mrw_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET);
+	dev_lim->reserved_mrws = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET);
+	dev_lim->max_mtt_seg = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET);
+	dev_lim->max_avs = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET);
+	dev_lim->max_requester_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET);
+	dev_lim->max_responder_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET);
+	dev_lim->max_rdma_global = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET);
+	dev_lim->local_ca_ack_delay = field & 0x1f;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET);
+	dev_lim->max_mtu        = field >> 4;
+	dev_lim->max_port_width = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET);
+	dev_lim->max_vl    = field >> 4;
+	dev_lim->num_ports = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET);
+	dev_lim->max_gids = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET);
+	dev_lim->max_pkeys = 1 << (field & 0xf);
+	MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET);
+	dev_lim->reserved_uars = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET);
+	dev_lim->uar_size = 1 << ((field & 0x3f) + 20);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET);
+	dev_lim->min_page_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET);
+	dev_lim->max_sg = field;
+	
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET);
+	dev_lim->max_desc_sz = size;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET);
+	dev_lim->max_qp_per_mcg = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET);
+	dev_lim->reserved_mgms = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET);
+	dev_lim->max_mcgs = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET);
+	dev_lim->reserved_pds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET);
+	dev_lim->max_pds = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET);
+	dev_lim->reserved_rdds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET);
+	dev_lim->max_rdds = 1 << (field & 0x3f);
+
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET);
+	dev_lim->eec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET);
+	dev_lim->qpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET);
+	dev_lim->eeec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET);
+	dev_lim->eqpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET);
+	dev_lim->eqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET);
+	dev_lim->cqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET);
+	dev_lim->srq_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET);
+	dev_lim->uar_scratch_entry_sz = size;
+
+	mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n",
+		  dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz);
+	mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n",
+		  dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz);
+	mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n",
+		  dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz);
+	mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n",
+		  dev_lim->reserved_mrws, dev_lim->reserved_mtts);
+	mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars);
+	mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_mgms);
+
+	mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err;
+
+#define QUERY_ADAPTER_OUT_SIZE             0x100
+#define QUERY_ADAPTER_VENDOR_ID_OFFSET     0x00
+#define QUERY_ADAPTER_DEVICE_ID_OFFSET     0x04
+#define QUERY_ADAPTER_REVISION_ID_OFFSET   0x08
+#define QUERY_ADAPTER_INTA_PIN_OFFSET      0x10
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET);
+	MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET);
+	MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET);
+	MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+
+#define INIT_HCA_IN_SIZE             	 0x200
+#define INIT_HCA_FLAGS_OFFSET        	 0x014
+#define INIT_HCA_QPC_OFFSET          	 0x020
+#define  INIT_HCA_QPC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x10)
+#define  INIT_HCA_LOG_QP_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x17)
+#define  INIT_HCA_EEC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x20)
+#define  INIT_HCA_LOG_EEC_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x27)
+#define  INIT_HCA_SRQC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x28)
+#define  INIT_HCA_LOG_SRQ_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x2f)
+#define  INIT_HCA_CQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x30)
+#define  INIT_HCA_LOG_CQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x37)
+#define  INIT_HCA_EQPC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x40)
+#define  INIT_HCA_EEEC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x50)
+#define  INIT_HCA_EQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x60)
+#define  INIT_HCA_LOG_EQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x67)
+#define  INIT_HCA_RDB_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x70)
+#define INIT_HCA_UDAV_OFFSET         	 0x0b0
+#define  INIT_HCA_UDAV_LKEY_OFFSET   	 (INIT_HCA_UDAV_OFFSET + 0x0)
+#define  INIT_HCA_UDAV_PD_OFFSET     	 (INIT_HCA_UDAV_OFFSET + 0x4)
+#define INIT_HCA_MCAST_OFFSET        	 0x0c0
+#define  INIT_HCA_MC_BASE_OFFSET         (INIT_HCA_MCAST_OFFSET + 0x00)
+#define  INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12)
+#define  INIT_HCA_MC_HASH_SZ_OFFSET      (INIT_HCA_MCAST_OFFSET + 0x16)
+#define  INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b)
+#define INIT_HCA_TPT_OFFSET              0x0f0
+#define  INIT_HCA_MPT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x00)
+#define  INIT_HCA_MTT_SEG_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x09)
+#define  INIT_HCA_LOG_MPT_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x0b)
+#define  INIT_HCA_MTT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x10)
+#define INIT_HCA_UAR_OFFSET              0x120
+#define  INIT_HCA_UAR_BASE_OFFSET        (INIT_HCA_UAR_OFFSET + 0x00)
+#define  INIT_HCA_UAR_PAGE_SZ_OFFSET     (INIT_HCA_UAR_OFFSET + 0x0b)
+#define  INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10)
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_HCA_IN_SIZE);
+
+#if defined(__LITTLE_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1);
+#elif defined(__BIG_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1);
+#else
+#error Host endianness not defined
+#endif
+	/* Check port for UD address vector: */
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1);
+
+	/* We leave wqe_quota, responder_exu, etc as 0 (default) */
+
+	/* QPC/EEC/CQC/EQC/RDB attributes */
+
+	MTHCA_PUT(inbox, param->qpc_base,     INIT_HCA_QPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_qps,  INIT_HCA_LOG_QP_OFFSET);
+	MTHCA_PUT(inbox, param->eec_base,     INIT_HCA_EEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET);
+	MTHCA_PUT(inbox, param->srqc_base,    INIT_HCA_SRQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET);
+	MTHCA_PUT(inbox, param->cqc_base,     INIT_HCA_CQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_cqs,  INIT_HCA_LOG_CQ_OFFSET);
+	MTHCA_PUT(inbox, param->eqpc_base,    INIT_HCA_EQPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eeec_base,    INIT_HCA_EEEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eqc_base,     INIT_HCA_EQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eqs,  INIT_HCA_LOG_EQ_OFFSET);
+	MTHCA_PUT(inbox, param->rdb_base,     INIT_HCA_RDB_BASE_OFFSET);
+
+	/* UD AV attributes */
+
+	/* multicast attributes */
+
+	MTHCA_PUT(inbox, param->mc_base,         INIT_HCA_MC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mc_hash_sz,      INIT_HCA_MC_HASH_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET);
+
+	/* TPT attributes */
+
+	MTHCA_PUT(inbox, param->mpt_base,   INIT_HCA_MPT_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_base,   INIT_HCA_MTT_BASE_OFFSET);
+
+	/* UAR attributes */
+	{
+		u8 uar_page_sz = PAGE_SHIFT - 12;
+		MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET);
+		MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET);
+	}
+
+	err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA,
+			HZ, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+	u32 flags;
+
+#define INIT_IB_IN_SIZE          56
+#define INIT_IB_FLAGS_OFFSET     0x00
+#define INIT_IB_FLAG_SIG         (1 << 18)
+#define INIT_IB_FLAG_NG          (1 << 17)
+#define INIT_IB_FLAG_G0          (1 << 16)
+#define INIT_IB_FLAG_1X          (1 << 8)
+#define INIT_IB_FLAG_4X          (1 << 9)
+#define INIT_IB_FLAG_12X         (1 << 11)
+#define INIT_IB_VL_SHIFT         4
+#define INIT_IB_MTU_SHIFT        12
+#define INIT_IB_MAX_GID_OFFSET   0x06
+#define INIT_IB_MAX_PKEY_OFFSET  0x0a
+#define INIT_IB_GUID0_OFFSET     0x10
+#define INIT_IB_NODE_GUID_OFFSET 0x18
+#define INIT_IB_SI_GUID_OFFSET   0x20
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_IB_IN_SIZE);
+
+	flags = 0;
+	flags |= param->enable_1x     ? INIT_IB_FLAG_1X  : 0;
+	flags |= param->enable_4x     ? INIT_IB_FLAG_4X  : 0;
+	flags |= param->set_guid0     ? INIT_IB_FLAG_G0  : 0;
+	flags |= param->set_node_guid ? INIT_IB_FLAG_NG  : 0;
+	flags |= param->set_si_guid   ? INIT_IB_FLAG_SIG : 0;
+	flags |= param->vl_cap << INIT_IB_VL_SHIFT;
+	flags |= param->mtu_cap << INIT_IB_MTU_SHIFT;
+	MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET);
+
+	MTHCA_PUT(inbox, param->gid_cap,   INIT_IB_MAX_GID_OFFSET);
+	MTHCA_PUT(inbox, param->pkey_cap,  INIT_IB_MAX_PKEY_OFFSET);
+	MTHCA_PUT(inbox, param->guid0,     INIT_IB_GUID0_OFFSET);
+	MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET);
+	MTHCA_PUT(inbox, param->si_guid,   INIT_IB_SI_GUID_OFFSET);
+
+	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
+			CMD_TIME_CLASS_A, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status)
+{
+	return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status);
+}
+
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status);
+}
+
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mpt_entry,
+			       MTHCA_MPT_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	if (mpt_entry) {
+		outdma = pci_map_single(dev->pdev, mpt_entry,
+					MTHCA_MPT_ENTRY_SIZE,
+					PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(outdma))
+			return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry,
+			    CMD_HW2SW_MPT,
+			    CMD_TIME_CLASS_B, status);
+
+	if (mpt_entry)
+		pci_unmap_single(dev->pdev, outdma,
+				 MTHCA_MPT_ENTRY_SIZE,
+				 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mtt_entry,
+			       (num_mtt + 2) * 8,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 (num_mtt + 2) * 8, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status)
+{
+	mthca_dbg(dev, "%s mask %016llx for eqn %d\n",
+		  unmap ? "Clearing" : "Setting",
+		  (unsigned long long) event_mask, eq_num);
+	return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num,
+			 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, eq_context,
+			       MTHCA_EQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, eq_context,
+				MTHCA_EQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, eq_num, 0,
+			    CMD_HW2SW_EQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_EQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, cq_context,
+			       MTHCA_CQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, cq_context,
+				MTHCA_CQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, cq_num, 0,
+			    CMD_HW2SW_CQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_CQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status)
+{
+	static const u16 op[] = {
+		[MTHCA_TRANS_RST2INIT]  = CMD_RST2INIT_QPEE,
+		[MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE,
+		[MTHCA_TRANS_INIT2RTR]  = CMD_INIT2RTR_QPEE,
+		[MTHCA_TRANS_RTR2RTS]   = CMD_RTR2RTS_QPEE,
+		[MTHCA_TRANS_RTS2RTS]   = CMD_RTS2RTS_QPEE,
+		[MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE,
+		[MTHCA_TRANS_ANY2ERR]   = CMD_2ERR_QPEE,
+		[MTHCA_TRANS_RTS2SQD]   = CMD_RTS2SQD_QPEE,
+		[MTHCA_TRANS_SQD2SQD]   = CMD_SQD2SQD_QPEE,
+		[MTHCA_TRANS_SQD2RTS]   = CMD_SQD2RTS_QPEE,
+		[MTHCA_TRANS_ANY2RST]   = CMD_ERR2RST_QPEE
+	};
+	u8 op_mod = 0;
+
+	dma_addr_t indma;
+	int err;
+
+	if (trans < 0 || trans >= ARRAY_SIZE(op))
+		return -EINVAL;
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		indma  = 0;
+		op_mod = 3;	/* don't write outbox, any->reset */
+
+		/* For debugging */
+		qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+						  &indma);
+		op_mod = 2;	/* write outbox, any->reset */
+	} else {
+		indma = pci_map_single(dev->pdev, qp_context,
+				       MTHCA_QP_CONTEXT_SIZE,
+				       PCI_DMA_TODEVICE);
+		if (pci_dma_mapping_error(indma))
+			return -ENOMEM;
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+	}
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num,
+				    op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+
+	} else
+		err = mthca_cmd(dev, indma, (!!is_ee << 24) | num,
+				op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+	if (trans != MTHCA_TRANS_ANY2RST)
+		pci_unmap_single(dev->pdev, indma,
+				 MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	else
+		pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+				    qp_context, indma);
+	return err;
+}
+
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, qp_context,
+				MTHCA_QP_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0,
+			    CMD_QUERY_QPEE,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_QP_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status)
+{
+	u8 op_mod;
+
+	switch (type) {
+	case IB_QPT_SMI:
+		op_mod = 0;
+		break;
+	case IB_QPT_GSI:
+		op_mod = 1;
+		break;
+	case IB_QPT_RAW_IPV6:
+		op_mod = 2;
+		break;
+	case IB_QPT_RAW_ETY:
+		op_mod = 3;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP,
+			 CMD_TIME_CLASS_B, status);
+}
+
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status) {
+	void *box;
+	dma_addr_t dma;
+	int err;
+
+#define MAD_IFC_BOX_SIZE 512
+
+	box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma);
+	if (!box)
+		return -ENOMEM;
+
+	memcpy(box, in_mad, 256);
+
+	err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey,
+			    CMD_MAD_IFC, CMD_TIME_CLASS_C, status);
+
+	if (!err && !*status)
+		memcpy(response_mad, box + 256, 256);
+
+	pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma);
+	return err;
+}
+
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, mgm,
+				MTHCA_MGM_ENTRY_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, index, 0,
+			    CMD_READ_MGM,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_MGM_ENTRY_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mgm,
+			       MTHCA_MGM_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	u64 imm;
+	int err;
+
+	indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH,
+			    CMD_TIME_CLASS_A, status);
+	*hash = imm;
+
+	pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE);
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h	2004-11-18 10:51:40.700051376 -0800
@@ -0,0 +1,260 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_CMD_H
+#define MTHCA_CMD_H
+
+#include <ib_verbs.h>
+
+#define MTHCA_CMD_MAILBOX_ALIGN 16UL
+#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1)
+
+enum {
+	/* command completed successfully: */
+	MTHCA_CMD_STAT_OK 	      = 0x00,
+	/* Internal error (such as a bus error) occurred while processing command: */
+	MTHCA_CMD_STAT_INTERNAL_ERR   = 0x01,
+	/* Operation/command not supported or opcode modifier not supported: */
+	MTHCA_CMD_STAT_BAD_OP 	      = 0x02,
+	/* Parameter not supported or parameter out of range: */
+	MTHCA_CMD_STAT_BAD_PARAM      = 0x03,
+	/* System not enabled or bad system state: */
+	MTHCA_CMD_STAT_BAD_SYS_STATE  = 0x04,
+	/* Attempt to access reserved or unallocaterd resource: */
+	MTHCA_CMD_STAT_BAD_RESOURCE   = 0x05,
+	/* Requested resource is currently executing a command, or is otherwise busy: */
+	MTHCA_CMD_STAT_RESOURCE_BUSY  = 0x06,
+	/* memory error: */
+	MTHCA_CMD_STAT_DDR_MEM_ERR    = 0x07,
+	/* Required capability exceeds device limits: */
+	MTHCA_CMD_STAT_EXCEED_LIM     = 0x08,
+	/* Resource is not in the appropriate state or ownership: */
+	MTHCA_CMD_STAT_BAD_RES_STATE  = 0x09,
+	/* Index out of range: */
+	MTHCA_CMD_STAT_BAD_INDEX      = 0x0a,
+	/* FW image corrupted: */
+	MTHCA_CMD_STAT_BAD_NVMEM      = 0x0b,
+	/* Attempt to modify a QP/EE which is not in the presumed state: */
+	MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10,
+	/* Bad segment parameters (Address/Size): */
+	MTHCA_CMD_STAT_BAD_SEG_PARAM  = 0x20,
+	/* Memory Region has Memory Windows bound to: */
+	MTHCA_CMD_STAT_REG_BOUND      = 0x21,
+	/* HCA local attached memory not present: */
+	MTHCA_CMD_STAT_LAM_NOT_PRE    = 0x22,
+        /* Bad management packet (silently discarded): */
+	MTHCA_CMD_STAT_BAD_PKT 	      = 0x30,
+        /* More outstanding CQEs in CQ than new CQ size: */
+	MTHCA_CMD_STAT_BAD_SIZE       = 0x40
+};
+
+enum {
+	MTHCA_TRANS_INVALID = 0,
+	MTHCA_TRANS_RST2INIT,
+	MTHCA_TRANS_INIT2INIT,
+	MTHCA_TRANS_INIT2RTR,
+	MTHCA_TRANS_RTR2RTS,
+	MTHCA_TRANS_RTS2RTS,
+	MTHCA_TRANS_SQERR2RTS,
+	MTHCA_TRANS_ANY2ERR,
+	MTHCA_TRANS_RTS2SQD,
+	MTHCA_TRANS_SQD2SQD,
+	MTHCA_TRANS_SQD2RTS,
+	MTHCA_TRANS_ANY2RST,
+};
+
+enum {
+	DEV_LIM_FLAG_SRQ = 1 << 6
+};
+
+struct mthca_dev_lim {
+	int max_srq_sz;
+	int max_qp_sz;
+	int reserved_qps;
+	int max_qps;
+	int reserved_srqs;
+	int max_srqs;
+	int reserved_eecs;
+	int max_eecs;
+	int max_cq_sz;
+	int reserved_cqs;
+	int max_cqs;
+	int max_mpts;
+	int reserved_eqs;
+	int max_eqs;
+	int reserved_mtts;
+	int max_mrw_sz;
+	int reserved_mrws;
+	int max_mtt_seg;
+	int max_avs;
+	int max_requester_per_qp;
+	int max_responder_per_qp;
+	int max_rdma_global;
+	int local_ca_ack_delay;
+	int max_mtu;
+	int max_port_width;
+	int max_vl;
+	int num_ports;
+	int max_gids;
+	int max_pkeys;
+	u32 flags;
+	int reserved_uars;
+	int uar_size;
+	int min_page_sz;
+	int max_sg;
+	int max_desc_sz;
+	int max_qp_per_mcg;
+	int reserved_mgms;
+	int max_mcgs;
+	int reserved_pds;
+	int max_pds;
+	int reserved_rdds;
+	int max_rdds;
+	int eec_entry_sz;
+	int qpc_entry_sz;
+	int eeec_entry_sz;
+	int eqpc_entry_sz;
+	int eqc_entry_sz;
+	int cqc_entry_sz;
+	int srq_entry_sz;
+	int uar_scratch_entry_sz;
+};
+
+struct mthca_adapter {
+	u32 vendor_id;
+	u32 device_id;
+	u32 revision_id;
+	u8  inta_pin;
+};
+
+struct mthca_init_hca_param {
+	u64 qpc_base;
+	u8  log_num_qps;
+	u64 eec_base;
+	u8  log_num_eecs;
+	u64 srqc_base;
+	u8  log_num_srqs;
+	u64 cqc_base;
+	u8  log_num_cqs;
+	u64 eqpc_base;
+	u64 eeec_base;
+	u64 eqc_base;
+	u8  log_num_eqs;
+	u64 rdb_base;
+	u64 mc_base;
+	u16 log_mc_entry_sz;
+	u16 mc_hash_sz;
+	u8  log_mc_table_sz;
+	u64 mpt_base;
+	u8  mtt_seg_sz;
+	u8  log_mpt_sz;
+	u64 mtt_base;
+	u64 uar_scratch_base;
+};
+
+struct mthca_init_ib_param {
+	int enable_1x;
+	int enable_4x;
+	int vl_cap;
+	int mtu_cap;
+	u16 gid_cap;
+	u16 pkey_cap;
+	int set_guid0;
+	u64 guid0;
+	int set_node_guid;
+	u64 node_guid;
+	int set_si_guid;
+	u64 si_guid;
+};
+
+int mthca_cmd_use_events(struct mthca_dev *dev);
+void mthca_cmd_use_polling(struct mthca_dev *dev);
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param);
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status);
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status);
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status);
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status);
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status);
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status);
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status);
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status);
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status);
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status);
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status);
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status);
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status);
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status);
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status);
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status);
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status);
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status);
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status);
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status);
+
+#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN))
+
+#endif /* MTHCA_CMD_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h	2004-11-18 10:51:40.724047849 -0800
@@ -0,0 +1,51 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $
+ */
+
+#ifndef MTHCA_CONFIG_REG_H
+#define MTHCA_CONFIG_REG_H
+
+#include <asm/page.h>
+
+#define MTHCA_HCR_BASE         0x80680
+#define MTHCA_HCR_SIZE         0x0001c
+#define MTHCA_ECR_BASE         0x80700
+#define MTHCA_ECR_SIZE         0x00008
+#define MTHCA_ECR_CLR_BASE     0x80708
+#define MTHCA_ECR_CLR_SIZE     0x00008
+#define MTHCA_ECR_OFFSET       (MTHCA_ECR_BASE     - MTHCA_HCR_BASE)
+#define MTHCA_ECR_CLR_OFFSET   (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE)
+#define MTHCA_CLR_INT_BASE     0xf00d8
+#define MTHCA_CLR_INT_SIZE     0x00008
+
+#define MTHCA_MAP_HCR_SIZE     (MTHCA_ECR_CLR_BASE   + \
+			        MTHCA_ECR_CLR_SIZE   - \
+			        MTHCA_HCR_BASE)
+
+#endif /* MTHCA_CONFIG_REG_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c	2004-11-18 10:51:40.747044469 -0800
@@ -0,0 +1,821 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE
+};
+
+enum {
+	MTHCA_CQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_cq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 error_eqn;
+	u32 comp_eqn;
+	u32 pd;
+	u32 lkey;
+	u32 last_notified_index;
+	u32 solicit_producer_index;
+	u32 consumer_index;
+	u32 producer_index;
+	u32 cqn;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_CQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_CQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_CQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_CQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_CQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_CQ_STATE_DISARMED     ( 0 <<  8)
+#define MTHCA_CQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_CQ_STATE_ARMED_SOL    ( 4 <<  8)
+#define MTHCA_EQ_STATE_FIRED        (10 <<  8)
+
+enum {
+	MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe
+};
+
+enum {
+	SYNDROME_LOCAL_LENGTH_ERR 	 = 0x01,
+	SYNDROME_LOCAL_QP_OP_ERR  	 = 0x02,
+	SYNDROME_LOCAL_EEC_OP_ERR 	 = 0x03,
+	SYNDROME_LOCAL_PROT_ERR   	 = 0x04,
+	SYNDROME_WR_FLUSH_ERR     	 = 0x05,
+	SYNDROME_MW_BIND_ERR      	 = 0x06,
+	SYNDROME_BAD_RESP_ERR     	 = 0x10,
+	SYNDROME_LOCAL_ACCESS_ERR 	 = 0x11,
+	SYNDROME_REMOTE_INVAL_REQ_ERR 	 = 0x12,
+	SYNDROME_REMOTE_ACCESS_ERR 	 = 0x13,
+	SYNDROME_REMOTE_OP_ERR     	 = 0x14,
+	SYNDROME_RETRY_EXC_ERR 		 = 0x15,
+	SYNDROME_RNR_RETRY_EXC_ERR 	 = 0x16,
+	SYNDROME_LOCAL_RDD_VIOL_ERR 	 = 0x20,
+	SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21,
+	SYNDROME_REMOTE_ABORTED_ERR 	 = 0x22,
+	SYNDROME_INVAL_EECN_ERR 	 = 0x23,
+	SYNDROME_INVAL_EEC_STATE_ERR 	 = 0x24
+};
+
+struct mthca_cqe {
+	u32 my_qpn;
+	u32 my_ee;
+	u32 rqpn;
+	u16 sl_g_mlpath;
+	u16 rlid;
+	u32 imm_etype_pkey_eec;
+	u32 byte_cnt;
+	u32 wqe;
+	u8  opcode;
+	u8  is_send;
+	u8  reserved;
+	u8  owner;
+} __attribute__((packed));
+
+struct mthca_err_cqe {
+	u32 my_qpn;
+	u32 reserved1[3];
+	u8  syndrome;
+	u8  reserved2;
+	u16 db_cnt;
+	u32 reserved3;
+	u32 wqe;
+	u8  opcode;
+	u8  reserved4[2];
+	u8  owner;
+} __attribute__((packed));
+
+#define MTHCA_CQ_ENTRY_OWNER_SW      (0 << 7)
+#define MTHCA_CQ_ENTRY_OWNER_HW      (1 << 7)
+
+#define MTHCA_CQ_DB_INC_CI       (1 << 24)
+#define MTHCA_CQ_DB_REQ_NOT      (2 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_SOL  (3 << 24)
+#define MTHCA_CQ_DB_SET_CI       (4 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24)
+
+static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry)
+{
+	if (cq->is_direct)
+		return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE);
+	else
+		return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf
+			+ (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int cqe_sw(struct mthca_cq *cq, int i)
+{
+	return !(MTHCA_CQ_ENTRY_OWNER_HW &
+		 get_cqe(cq, i)->owner);
+}
+
+static inline int next_cqe_sw(struct mthca_cq *cq)
+{
+	return cqe_sw(cq, cq->cons_index);
+}
+
+static inline void set_cqe_hw(struct mthca_cq *cq, int entry)
+{
+	get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW;
+}
+
+static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq,
+				  int nent)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn);
+	doorbell[1] = cpu_to_be32(nent - 1);
+		
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn)
+{
+	struct mthca_cq *cq;
+
+	spin_lock(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock(&dev->cq_table.lock);
+
+	if (!cq) {
+		mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn);
+		return;
+	}
+
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn)
+{
+	struct mthca_cq *cq;
+	struct mthca_cqe *cqe;
+	int prod_index;
+	int nfreed = 0;
+
+	spin_lock_irq(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	if (!cq)
+		return;
+
+	spin_lock_irq(&cq->lock);
+
+	/*
+	 * First we need to find the current producer index, so we
+	 * know where to start cleaning from.  It doesn't matter if HW
+	 * adds new entries after this loop -- the QP we're worried
+	 * about is already in RESET, so the new entries won't come
+	 * from our QP and therefore don't need to be checked.
+	 */
+	for (prod_index = cq->cons_index;
+	     cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1));
+	     ++prod_index)
+		if (prod_index == cq->cons_index + cq->ibcq.cqe - 1)
+			break;
+
+	if (0)
+		mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n",
+			  qpn, cqn, cq->cons_index, prod_index);
+
+	/*
+	 * Now sweep backwards through the CQ, removing CQ entries
+	 * that match our QP by copying older entries on top of them.
+	 */
+	while (prod_index > cq->cons_index) {
+		cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1));
+		if (cqe->my_qpn == cpu_to_be32(qpn))
+			++nfreed;
+		else if (nfreed)
+			memcpy(get_cqe(cq, (prod_index - 1 + nfreed) &
+				       (cq->ibcq.cqe - 1)),
+			       cqe,
+			       MTHCA_CQ_ENTRY_SIZE);
+		--prod_index;
+	}
+
+	if (nfreed) {
+		wmb();
+		inc_cons_index(dev, cq, nfreed);
+		cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1);
+	}
+
+	spin_unlock_irq(&cq->lock);
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq,
+			    struct mthca_qp *qp, int wqe_index, int is_send,
+			    struct mthca_err_cqe *cqe,
+			    struct ib_wc *entry, int *free_cqe)
+{
+	int err;
+	int dbd;
+	u32 new_wqe;
+
+	if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) {
+		int j;
+		
+		mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+	}
+
+	/*
+	 * For completions in error, only work request ID, status (and
+	 * freed resource count for RD) have to be set.
+	 */
+	switch (cqe->syndrome) {
+	case SYNDROME_LOCAL_LENGTH_ERR:
+		entry->status = IB_WC_LOC_LEN_ERR;
+		break;
+	case SYNDROME_LOCAL_QP_OP_ERR:
+		entry->status = IB_WC_LOC_QP_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_EEC_OP_ERR:
+		entry->status = IB_WC_LOC_EEC_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_PROT_ERR:
+		entry->status = IB_WC_LOC_PROT_ERR;
+		break;
+	case SYNDROME_WR_FLUSH_ERR:
+		entry->status = IB_WC_WR_FLUSH_ERR;
+		break;
+	case SYNDROME_MW_BIND_ERR:
+		entry->status = IB_WC_MW_BIND_ERR;
+		break;
+	case SYNDROME_BAD_RESP_ERR:
+		entry->status = IB_WC_BAD_RESP_ERR;
+		break;
+	case SYNDROME_LOCAL_ACCESS_ERR:
+		entry->status = IB_WC_LOC_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_REQ_ERR:
+		entry->status = IB_WC_REM_INV_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ACCESS_ERR:
+		entry->status = IB_WC_REM_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_OP_ERR:
+		entry->status = IB_WC_REM_OP_ERR;
+		break;
+	case SYNDROME_RETRY_EXC_ERR:
+		entry->status = IB_WC_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_RNR_RETRY_EXC_ERR:
+		entry->status = IB_WC_RNR_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_LOCAL_RDD_VIOL_ERR:
+		entry->status = IB_WC_LOC_RDD_VIOL_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_RD_REQ_ERR:
+		entry->status = IB_WC_REM_INV_RD_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ABORTED_ERR:
+		entry->status = IB_WC_REM_ABORT_ERR;
+		break;
+	case SYNDROME_INVAL_EECN_ERR:
+		entry->status = IB_WC_INV_EECN_ERR;
+		break;
+	case SYNDROME_INVAL_EEC_STATE_ERR:
+		entry->status = IB_WC_INV_EEC_STATE_ERR;
+		break;
+	default:
+		entry->status = IB_WC_GENERAL_ERR;
+		break;
+	}
+
+	err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe);
+	if (err)
+		return err;
+
+	/*
+	 * If we're at the end of the WQE chain, or we've used up our
+	 * doorbell count, free the CQE.  Otherwise just update it for
+	 * the next poll operation.
+	 */
+	if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd))
+		return 0;
+
+	cqe->db_cnt   = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd);
+	cqe->wqe      = new_wqe;
+	cqe->syndrome = SYNDROME_WR_FLUSH_ERR;
+
+	*free_cqe = 0;
+
+	return 0;
+}
+
+static void dump_cqe(struct mthca_cqe *cqe)
+{
+	int j;
+
+	for (j = 0; j < 8; ++j)
+		printk(KERN_DEBUG "  [%2x] %08x\n",
+		       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+}
+
+static inline int mthca_poll_one(struct mthca_dev *dev,
+				 struct mthca_cq *cq,
+				 struct mthca_qp **cur_qp,
+				 int *freed,
+				 struct ib_wc *entry)
+{
+	struct mthca_wq *wq;
+	struct mthca_cqe *cqe;
+	int wqe_index;
+	int is_error = 0;
+	int is_send;
+	int free_cqe = 1;
+	int err = 0;
+
+	if (!next_cqe_sw(cq))
+		return -EAGAIN;
+
+	rmb();
+
+	cqe = get_cqe(cq, cq->cons_index);
+
+	if (0) {
+		mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		dump_cqe(cqe);
+	}
+
+	if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) ==
+	    MTHCA_ERROR_CQE_OPCODE_MASK) {
+		is_error = 1;
+		is_send = cqe->opcode & 1;
+	} else
+		is_send = cqe->is_send & 0x80;
+
+	if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) {
+		if (*cur_qp) {
+			spin_unlock(&(*cur_qp)->lock);
+			if (atomic_dec_and_test(&(*cur_qp)->refcount))
+				wake_up(&(*cur_qp)->wait);
+		}
+
+		spin_lock(&dev->qp_table.lock);
+		*cur_qp = mthca_array_get(&dev->qp_table.qp,
+					  be32_to_cpu(cqe->my_qpn) &
+					  (dev->limits.num_qps - 1));
+		if (*cur_qp)
+			atomic_inc(&(*cur_qp)->refcount);
+		spin_unlock(&dev->qp_table.lock);
+
+		if (!*cur_qp) {
+			mthca_warn(dev, "CQ entry for unknown QP %06x\n",
+				   be32_to_cpu(cqe->my_qpn) & 0xffffff);
+			err = -EINVAL;
+			goto out;
+		}
+
+		spin_lock(&(*cur_qp)->lock);
+	}
+
+	if (is_send) {
+		wq = &(*cur_qp)->sq;
+		wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset)
+			     >> wq->wqe_shift);
+		entry->wr_id = (*cur_qp)->wrid[wqe_index +
+					       (*cur_qp)->rq.max];
+	} else {
+		wq = &(*cur_qp)->rq;
+		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		entry->wr_id = (*cur_qp)->wrid[wqe_index];
+	}
+
+	if (wq->last_comp < wqe_index)
+		wq->cur -= wqe_index - wq->last_comp;
+	else
+		wq->cur -= wq->max - wq->last_comp + wqe_index;
+
+	wq->last_comp = wqe_index;
+
+	if (0)
+		mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n",
+			  is_send ? "Send" : "Receive",
+			  (*cur_qp)->qpn, wqe_index, wq->max);
+
+	if (is_error) {
+		err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send,
+				       (struct mthca_err_cqe *) cqe,
+				       entry, &free_cqe);
+		goto out;
+	}
+
+	if (is_send) {
+		entry->opcode = IB_WC_SEND; /* XXX */
+	} else {
+		entry->byte_len = be32_to_cpu(cqe->byte_cnt);
+		switch (cqe->opcode & 0x1f) {
+		case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV;
+			break;
+		case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV_RDMA_WITH_IMM;
+			break;
+		default:
+			entry->wc_flags = 0;
+			entry->opcode = IB_WC_RECV;
+			break;
+		}
+		entry->slid 	   = be16_to_cpu(cqe->rlid);
+		entry->sl   	   = be16_to_cpu(cqe->sl_g_mlpath) >> 12;
+		entry->src_qp 	   = be32_to_cpu(cqe->rqpn) & 0xffffff;
+		entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f;
+		entry->pkey_index  = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16;
+		entry->wc_flags   |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ?
+					IB_WC_GRH : 0;
+	}
+
+	entry->status = IB_WC_SUCCESS;
+
+ out:
+	if (free_cqe) {
+		set_cqe_hw(cq, cq->cons_index);
+		++(*freed);
+		cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1);
+	}
+
+	return err;
+}
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry)
+{
+	struct mthca_dev *dev = to_mdev(ibcq->device);
+	struct mthca_cq *cq = to_mcq(ibcq);
+	struct mthca_qp *qp = NULL;
+	unsigned long flags;
+	int err = 0;
+	int freed = 0;
+	int npolled;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+		err = mthca_poll_one(dev, cq, &qp,
+				     &freed, entry + npolled);
+		if (err)
+			break;
+	}
+
+	if (qp) {
+		spin_unlock(&qp->lock);
+		if (atomic_dec_and_test(&qp->refcount))
+			wake_up(&qp->wait);
+	}
+
+	wmb();
+	inc_cons_index(dev, cq, freed);
+
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return err == 0 || err == -EAGAIN ? npolled : err;
+}
+
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited)
+{
+	u32 doorbell[2];
+
+	doorbell[0] =  cpu_to_be32((solicited ?
+				    MTHCA_CQ_DB_REQ_NOT_SOL :
+				    MTHCA_CQ_DB_REQ_NOT)      |
+				   cq->cqn);
+	doorbell[1] = 0xffffffff;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq)
+{
+	int size = nent * MTHCA_CQ_ENTRY_SIZE;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	int npages, shift;
+	u64 *dma_list = NULL;
+	struct mthca_cq_context *cq_context;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out;
+
+	cq_context = MAILBOX_ALIGN(mailbox);
+
+	if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) {
+		if (0)
+			mthca_dbg(dev, "Creating direct CQ of size %d\n", size);
+
+		cq->is_direct = 1;
+		npages        = 1;
+		shift         = get_order(size) + PAGE_SHIFT;
+
+		cq->queue.direct.buf = pci_alloc_consistent(dev->pdev,
+							    size, &t);
+		if (!cq->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&cq->queue.direct, mapping, t);
+
+		memset(cq->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		cq->is_direct = 0;
+		npages        = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+		shift         = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list,
+					      GFP_KERNEL);
+		if (!cq->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i)
+			cq->queue.page_list[i].buf = NULL;
+
+		for (i = 0; i < npages; ++i) {
+			cq->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!cq->queue.page_list[i].buf)
+				goto err_out_free;
+			
+			dma_list[i] = t;
+			pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t);
+
+			memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE);
+		}
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_cqe_hw(cq, i);
+
+	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
+	if (cq->cqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, shift, npages,
+				  0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &cq->mr);
+	if (err)
+		goto err_out_free_cq;
+
+	spin_lock_init(&cq->lock);
+	atomic_set(&cq->refcount, 1);
+	init_waitqueue_head(&cq->wait);
+
+	memset(cq_context, 0, sizeof *cq_context);
+	cq_context->flags           = cpu_to_be32(MTHCA_CQ_STATUS_OK      |
+						  MTHCA_CQ_STATE_DISARMED |
+						  MTHCA_CQ_FLAG_TR);
+	cq_context->start           = cpu_to_be64(0);
+	cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
+	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
+	cq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	cq_context->lkey            = cpu_to_be32(cq->mr.ibmr.lkey);
+	cq_context->cqn             = cpu_to_be32(cq->cqn);
+
+	err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+
+	if (status) {
+		mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	if (mthca_array_set(&dev->cq_table.cq,
+			    cq->cqn & (dev->limits.num_cqs - 1),
+			    cq)) {
+		spin_unlock_irq(&dev->cq_table.lock);
+		goto err_out_free_mr;
+	}
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	cq->cons_index = 0;
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return 0;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &cq->mr);
+
+ err_out_free_cq:
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+
+ err_out_free:
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev, size,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct, mapping));
+	else {
+		for (i = 0; i < npages; ++i)
+			if (cq->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    cq->queue.page_list[i].buf,
+						    pci_unmap_addr(&cq->queue.page_list[i],
+								   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+ err_out:
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return err;
+}
+
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq)
+{
+	void *mailbox;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_warn(dev, "No memory for mailbox to free CQ.\n");
+		return;
+	}
+
+	err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		u32 *ctx = MAILBOX_ALIGN(mailbox);
+		int j;
+		
+		printk(KERN_ERR "context for CQN %x\n", cq->cqn);
+		for (j = 0; j < 16; ++j)
+			printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j]));
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	mthca_array_clear(&dev->cq_table.cq,
+			  cq->cqn & (dev->limits.num_cqs - 1));
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	atomic_dec(&cq->refcount);
+	wait_event(cq->wait, !atomic_read(&cq->refcount));
+
+	mthca_free_mr(dev, &cq->mr);
+
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev,
+				    cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct,
+						   mapping));
+	else {
+		int i;
+
+		for (i = 0;
+		     i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+			     PAGE_SIZE;
+		     ++i)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    cq->queue.page_list[i].buf,
+					    pci_unmap_addr(&cq->queue.page_list[i],
+							   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+	kfree(mailbox);
+}
+
+int __devinit mthca_init_cq_table(struct mthca_dev *dev)
+{
+	int err;
+
+	spin_lock_init(&dev->cq_table.lock);
+
+	err = mthca_alloc_init(&dev->cq_table.alloc,
+			       dev->limits.num_cqs,
+			       (1 << 24) - 1,
+			       dev->limits.reserved_cqs);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->cq_table.cq,
+			       dev->limits.num_cqs);
+	if (err)
+		mthca_alloc_cleanup(&dev->cq_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev)
+{
+	mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs);
+	mthca_alloc_cleanup(&dev->cq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h	2004-11-18 10:51:40.770041089 -0800
@@ -0,0 +1,386 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_DEV_H
+#define MTHCA_DEV_H
+
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <asm/semaphore.h>
+#include <asm/scatterlist.h>
+
+#include "mthca_provider.h"
+#include "mthca_doorbell.h"
+
+#define DRV_NAME	"ib_mthca"
+#define PFX		DRV_NAME ": "
+#define DRV_VERSION	"0.06-pre"
+#define DRV_RELDATE	"November 8, 2004"
+
+/* Types of supported HCA */
+enum {
+	TAVOR,			/* MT23108                        */
+	ARBEL_COMPAT,		/* MT25208 in Tavor compat mode   */
+	ARBEL_NATIVE		/* MT25208 with extended features */
+};
+
+enum {
+	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
+	MTHCA_FLAG_SRQ        = 1 << 2,
+	MTHCA_FLAG_MSI        = 1 << 3,
+	MTHCA_FLAG_MSI_X      = 1 << 4,
+	MTHCA_FLAG_NO_LAM     = 1 << 5
+};
+
+enum {
+	MTHCA_KAR_PAGE  = 1,
+	MTHCA_MAX_PORTS = 2
+};
+
+enum {
+	MTHCA_MPT_ENTRY_SIZE  =  0x40,
+	MTHCA_EQ_CONTEXT_SIZE =  0x40,
+	MTHCA_CQ_CONTEXT_SIZE =  0x40,
+	MTHCA_QP_CONTEXT_SIZE = 0x200,
+	MTHCA_AV_SIZE         =  0x20,
+	MTHCA_MGM_ENTRY_SIZE  =  0x40
+};
+
+enum {
+	MTHCA_EQ_CMD,
+	MTHCA_EQ_ASYNC,
+	MTHCA_EQ_COMP,
+	MTHCA_NUM_EQ
+};
+
+struct mthca_cmd {
+	int                       use_events;
+	struct semaphore          hcr_sem;
+	struct semaphore 	  poll_sem;
+	struct semaphore 	  event_sem;
+	int              	  max_cmds;
+	spinlock_t                context_lock;
+	int                       free_head;
+	struct mthca_cmd_context *context;
+	u16                       token_mask;
+};
+
+struct mthca_limits {
+	int      num_ports;
+	int      vl_cap;
+	int      mtu_cap;
+	int      gid_table_len;
+	int      pkey_table_len;
+	int      local_ca_ack_delay;
+	int      max_sg;
+	int      num_qps;
+	int      reserved_qps;
+	int      num_srqs;
+	int      reserved_srqs;
+	int      num_eecs;
+	int      reserved_eecs;
+	int      num_cqs;
+	int      reserved_cqs;
+	int      num_eqs;
+	int      reserved_eqs;
+	int      num_mpts;
+	int      num_mtt_segs;
+	int      mtt_seg_size;
+	int      reserved_mtts;
+	int      reserved_mrws;
+	int      num_rdbs;
+	int      reserved_uars;
+	int      num_mgms;
+	int      num_amgms;
+	int      reserved_mcgs;
+	int      num_pds;
+	int      reserved_pds;
+};
+
+struct mthca_alloc {
+	u32            last;
+	u32            top;
+	u32            max;
+	u32            mask;
+	spinlock_t     lock;
+	unsigned long *table;
+};
+
+struct mthca_array {
+	struct {
+		void    **page;
+		int       used;
+	} *page_list;
+};
+
+struct mthca_pd_table {
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mr_table {
+	struct mthca_alloc mpt_alloc;
+	int                max_mtt_order;
+	unsigned long    **mtt_buddy;
+	u64                mtt_base;
+};
+
+struct mthca_eq_table {
+	struct mthca_alloc alloc;
+	void __iomem      *clr_int;
+	u32                clr_mask;
+	struct mthca_eq    eq[MTHCA_NUM_EQ];
+	int                have_irq;
+	u8                 inta_pin;
+};
+
+struct mthca_cq_table {
+	struct mthca_alloc alloc;
+	spinlock_t         lock;
+	struct mthca_array cq;
+};
+
+struct mthca_qp_table {
+	struct mthca_alloc alloc;
+	int                sqp_start;
+	spinlock_t         lock;
+	struct mthca_array qp;
+};
+
+struct mthca_av_table {
+	struct pci_pool   *pool;
+	int                num_ddr_avs;
+	u64                ddr_av_base;
+	void __iomem      *av_map;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mcg_table {
+	struct semaphore   sem;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_dev {
+	struct ib_device  ib_dev;
+	struct pci_dev   *pdev;
+
+	int          	 hca_type;
+	unsigned long	 mthca_flags;
+
+	u32              rev_id;
+
+	/* firmware info */
+	u64              fw_ver;
+	union {
+		struct {
+			u64 fw_start;
+			u64 fw_end;
+		}        tavor;
+		struct {
+			u64 clr_int_base;
+			u64 eq_arm_base;
+			u64 eq_set_ci_base;
+			struct scatterlist *mem;
+			u16 fw_pages;
+		}        arbel;
+	}                fw;
+
+	u64              ddr_start;
+	u64              ddr_end;
+
+	MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock)
+
+	void __iomem    *hcr;
+	void __iomem    *clr_base;
+	void __iomem    *kar;
+
+	struct mthca_cmd    cmd;
+	struct mthca_limits limits;
+
+	struct mthca_pd_table  pd_table;
+	struct mthca_mr_table  mr_table;
+	struct mthca_eq_table  eq_table;
+	struct mthca_cq_table  cq_table;
+	struct mthca_qp_table  qp_table;
+	struct mthca_av_table  av_table;
+	struct mthca_mcg_table mcg_table;
+
+	struct mthca_pd       driver_pd;
+	struct mthca_mr       driver_mr;
+
+	struct ib_mad_agent  *send_agent[MTHCA_MAX_PORTS][2];
+	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
+	spinlock_t            sm_lock;
+};
+
+#define mthca_dbg(mdev, format, arg...) \
+	dev_dbg(&mdev->pdev->dev, format, ## arg)
+#define mthca_err(mdev, format, arg...) \
+	dev_err(&mdev->pdev->dev, format, ## arg)
+#define mthca_info(mdev, format, arg...) \
+	dev_info(&mdev->pdev->dev, format, ## arg)
+#define mthca_warn(mdev, format, arg...) \
+	dev_warn(&mdev->pdev->dev, format, ## arg)
+
+extern void __buggy_use_of_MTHCA_GET(void);
+extern void __buggy_use_of_MTHCA_PUT(void);
+
+#define MTHCA_GET(dest, source, offset)                               \
+	do {                                                          \
+		void *__p = (char *) (source) + (offset);             \
+		switch (sizeof (dest)) {                              \
+			case 1: (dest) = *(u8 *) __p;       break;    \
+			case 2: (dest) = be16_to_cpup(__p); break;    \
+			case 4: (dest) = be32_to_cpup(__p); break;    \
+			case 8: (dest) = be64_to_cpup(__p); break;    \
+			default: __buggy_use_of_MTHCA_GET();          \
+		}                                                     \
+	} while (0)
+
+#define MTHCA_PUT(dest, source, offset)                               \
+	do {                                                          \
+		__typeof__(source) *__p =                             \
+			(__typeof__(source) *) ((char *) (dest) + (offset)); \
+		switch (sizeof(source)) {                             \
+			case 1: *__p = (source);            break;    \
+			case 2: *__p = cpu_to_be16(source); break;    \
+			case 4: *__p = cpu_to_be32(source); break;    \
+			case 8: *__p = cpu_to_be64(source); break;    \
+			default: __buggy_use_of_MTHCA_PUT();          \
+		}                                                     \
+	} while (0)
+
+int mthca_reset(struct mthca_dev *mdev);
+
+u32 mthca_alloc(struct mthca_alloc *alloc);
+void mthca_free(struct mthca_alloc *alloc, u32 obj);
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved);
+void mthca_alloc_cleanup(struct mthca_alloc *alloc);
+void *mthca_array_get(struct mthca_array *array, int index);
+int mthca_array_set(struct mthca_array *array, int index, void *value);
+void mthca_array_clear(struct mthca_array *array, int index);
+int mthca_array_init(struct mthca_array *array, int nent);
+void mthca_array_cleanup(struct mthca_array *array, int nent);
+
+int mthca_init_pd_table(struct mthca_dev *dev);
+int mthca_init_mr_table(struct mthca_dev *dev);
+int mthca_init_eq_table(struct mthca_dev *dev);
+int mthca_init_cq_table(struct mthca_dev *dev);
+int mthca_init_qp_table(struct mthca_dev *dev);
+int mthca_init_av_table(struct mthca_dev *dev);
+int mthca_init_mcg_table(struct mthca_dev *dev);
+
+void mthca_cleanup_pd_table(struct mthca_dev *dev);
+void mthca_cleanup_mr_table(struct mthca_dev *dev);
+void mthca_cleanup_eq_table(struct mthca_dev *dev);
+void mthca_cleanup_cq_table(struct mthca_dev *dev);
+void mthca_cleanup_qp_table(struct mthca_dev *dev);
+void mthca_cleanup_av_table(struct mthca_dev *dev);
+void mthca_cleanup_mcg_table(struct mthca_dev *dev);
+
+int mthca_register_device(struct mthca_dev *dev);
+void mthca_unregister_device(struct mthca_dev *dev);
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd);
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr);
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr);
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr);
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry);
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited);
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq);
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq);
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn);
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn);
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type);
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr);
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr);
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe);
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp);
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp);
+void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp);
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah);
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah);
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header);
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad);
+int mthca_create_agents(struct mthca_dev *dev);
+void mthca_free_agents(struct mthca_dev *dev);
+
+static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct mthca_dev, ib_dev);
+}
+
+#endif /* MTHCA_DEV_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h	2004-11-18 10:51:40.794037561 -0800
@@ -0,0 +1,119 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/preempt.h>
+
+#define MTHCA_RD_DOORBELL      0x00
+#define MTHCA_SEND_DOORBELL    0x10
+#define MTHCA_RECEIVE_DOORBELL 0x18
+#define MTHCA_CQ_DOORBELL      0x20
+#define MTHCA_EQ_DOORBELL      0x28
+
+#if BITS_PER_LONG == 64
+/*
+ * Assume that we can just write a 64-bit doorbell atomically.  s390
+ * actually doesn't have writeq() but S/390 systems don't even have
+ * PCI so we won't worry about it.
+ */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	__raw_writeq(*(u64 *) val, dest);
+}
+
+#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL)
+/* Use SSE to write 64 bits atomically without a lock. */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline unsigned long mthca_get_fpu(void)
+{
+	unsigned long cr0;
+
+	preempt_disable();
+	asm volatile("mov %%cr0,%0; clts" : "=r" (cr0));
+	return cr0;
+}
+
+static inline void mthca_put_fpu(unsigned long cr0)
+{
+	asm volatile("mov %0,%%cr0" : : "r" (cr0));
+	preempt_enable();
+}
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	/* i386 stack is aligned to 8 bytes, so this should be OK: */
+	u8 xmmsave[8] __attribute__((aligned(8)));
+	unsigned long cr0;
+
+	cr0 = mthca_get_fpu();
+
+	asm volatile (
+		"movlps %%xmm0,(%0); \n\t"
+		"movlps (%1),%%xmm0; \n\t"
+		"movlps %%xmm0,(%2); \n\t"
+		"movlps (%0),%%xmm0; \n\t"
+		:
+		: "r" (xmmsave), "r" (val), "r" (dest)
+		: "memory" );
+
+	mthca_put_fpu(cr0);
+}
+
+#else
+/* Just fall back to a spinlock to protect the doorbell */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name;
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)     spin_lock_init(ptr)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (ptr)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(doorbell_lock, flags);
+	__raw_writel(val[0], dest);
+	__raw_writel(val[1], dest + 4);
+	spin_unlock_irqrestore(doorbell_lock, flags);
+}
+
+#endif
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c	2004-11-18 10:51:40.818034034 -0800
@@ -0,0 +1,650 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+#include "mthca_config_reg.h"
+
+enum {
+	MTHCA_NUM_ASYNC_EQE = 0x80,
+	MTHCA_NUM_CMD_EQE   = 0x80,
+	MTHCA_EQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_eq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 pd;
+	u8  reserved1[3];
+	u8  intr;
+	u32 lost_count;
+	u32 lkey;
+	u32 reserved2[2];
+	u32 consumer_index;
+	u32 producer_index;
+	u32 reserved3[4];
+} __attribute__((packed));
+
+#define MTHCA_EQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_EQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_EQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_EQ_OWNER_SW           ( 0 << 24)
+#define MTHCA_EQ_OWNER_HW           ( 1 << 24)
+#define MTHCA_EQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_EQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_EQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_EQ_STATE_FIRED        ( 2 <<  8)
+#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 <<  8)
+
+enum {
+	MTHCA_EVENT_TYPE_COMP       	    = 0x00,
+	MTHCA_EVENT_TYPE_PATH_MIG   	    = 0x01,
+	MTHCA_EVENT_TYPE_COMM_EST   	    = 0x02,
+	MTHCA_EVENT_TYPE_SQ_DRAINED 	    = 0x03,
+	MTHCA_EVENT_TYPE_SRQ_LAST_WQE       = 0x13,
+	MTHCA_EVENT_TYPE_CQ_ERROR   	    = 0x04,
+	MTHCA_EVENT_TYPE_WQ_CATAS_ERROR     = 0x05,
+	MTHCA_EVENT_TYPE_EEC_CATAS_ERROR    = 0x06,
+	MTHCA_EVENT_TYPE_PATH_MIG_FAILED    = 0x07,
+	MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
+	MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR    = 0x11,
+	MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR    = 0x12,
+	MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR  = 0x08,
+	MTHCA_EVENT_TYPE_PORT_CHANGE        = 0x09,
+	MTHCA_EVENT_TYPE_EQ_OVERFLOW        = 0x0f,
+	MTHCA_EVENT_TYPE_ECC_DETECT         = 0x0e,
+	MTHCA_EVENT_TYPE_CMD                = 0x0a
+};
+
+#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG)           | \
+				(1ULL << MTHCA_EVENT_TYPE_COMM_EST)           | \
+				(1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED)         | \
+				(1ULL << MTHCA_EVENT_TYPE_CQ_ERROR)           | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR)     | \
+				(1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED)    | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR)  | \
+				(1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE)        | \
+				(1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW)        | \
+				(1ULL << MTHCA_EVENT_TYPE_ECC_DETECT))
+#define MTHCA_SRQ_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE)
+#define MTHCA_CMD_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_CMD)
+
+#define MTHCA_EQ_DB_INC_CI     (1 << 24)
+#define MTHCA_EQ_DB_REQ_NOT    (2 << 24)
+#define MTHCA_EQ_DB_DISARM_CQ  (3 << 24)
+#define MTHCA_EQ_DB_SET_CI     (4 << 24)
+#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24)
+
+struct mthca_eqe {
+	u8 reserved1;
+	u8 type;
+	u8 reserved2;
+	u8 subtype;
+	union {
+		u32 raw[6];
+		struct {
+			u32 cqn;
+		} __attribute__((packed)) comp;
+		struct {
+			u16 reserved1;
+			u16 token;
+			u32 reserved2;
+			u8  reserved3[3];
+			u8  status;
+			u64 out_param;
+		} __attribute__((packed)) cmd;
+		struct {
+			u32 qpn;
+		} __attribute__((packed)) qp;
+		struct {
+			u32 reserved1[2];
+			u32 port;
+		} __attribute__((packed)) port_change;
+	} event;
+	u8 reserved3[3];
+	u8 owner;
+} __attribute__((packed));
+
+#define  MTHCA_EQ_ENTRY_OWNER_SW      (0 << 7)
+#define  MTHCA_EQ_ENTRY_OWNER_HW      (1 << 7)
+
+static inline u64 async_mask(struct mthca_dev *dev)
+{
+	return dev->mthca_flags & MTHCA_FLAG_SRQ ?
+		MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK :
+		MTHCA_ASYNC_EVENT_MASK;
+}
+
+static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn);
+	doorbell[1] = cpu_to_be32(ci);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void eq_req_not(struct mthca_dev *dev, int eqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
+	doorbell[1] = 0;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
+	doorbell[1] = cpu_to_be32(cqn);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry)
+{
+	return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf
+		+ (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int next_eqe_sw(struct mthca_eq *eq)
+{
+	return !(MTHCA_EQ_ENTRY_OWNER_HW &
+		 get_eqe(eq, eq->cons_index)->owner);
+}
+
+static inline void set_eqe_hw(struct mthca_eq *eq, int entry)
+{
+	get_eqe(eq, entry)->owner =  MTHCA_EQ_ENTRY_OWNER_HW;
+}
+
+static void port_change(struct mthca_dev *dev, int port, int active)
+{
+	struct ib_event record;
+
+	mthca_dbg(dev, "Port change to %s for port %d\n",
+		  active ? "active" : "down", port);
+
+	record.device = &dev->ib_dev;
+	record.event  = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR;
+	record.element.port_num = port;
+
+	ib_dispatch_event(&record);
+}
+
+static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq)
+{
+	struct mthca_eqe *eqe;
+	int disarm_cqn;
+	int work = 0;
+
+	while (1) {
+		if (!next_eqe_sw(eq))
+			break;
+
+		eqe = get_eqe(eq, eq->cons_index);
+		work = 1;
+
+		switch (eqe->type) {
+		case MTHCA_EVENT_TYPE_COMP:
+			disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
+			disarm_cq(dev, eq->eqn, disarm_cqn);
+			mthca_cq_event(dev, disarm_cqn);
+			break;
+			
+		case MTHCA_EVENT_TYPE_PATH_MIG:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG);
+			break;
+
+		case MTHCA_EVENT_TYPE_COMM_EST:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_COMM_EST);
+			break;
+
+		case MTHCA_EVENT_TYPE_SQ_DRAINED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_SQ_DRAINED);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_FATAL);
+			break;
+
+		case MTHCA_EVENT_TYPE_PATH_MIG_FAILED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_REQ_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_ACCESS_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_CMD:
+			mthca_cmd_event(dev,
+					be16_to_cpu(eqe->event.cmd.token),
+					eqe->event.cmd.status,
+					be64_to_cpu(eqe->event.cmd.out_param));
+			break;
+
+		case MTHCA_EVENT_TYPE_PORT_CHANGE:
+			port_change(dev,
+				    (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3,
+				    eqe->subtype == 0x4);
+			break;
+
+		case MTHCA_EVENT_TYPE_CQ_ERROR:
+		case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_EQ_OVERFLOW:
+		case MTHCA_EVENT_TYPE_ECC_DETECT:
+		default:
+			mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n",
+				   eqe->type, eqe->subtype, eq->eqn);
+			break;
+		};
+
+		set_eqe_hw(eq, eq->cons_index);
+		eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1);
+	}
+
+	if (work) {
+		wmb();
+		set_eq_ci(dev, eq->eqn, eq->cons_index);
+	}
+
+	eq_req_not(dev, eq->eqn);
+}
+
+static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs)
+{
+	struct mthca_dev *dev = dev_ptr;
+	u32 ecr;
+	int work = 0;
+	int i;
+
+	if (dev->eq_table.clr_mask)
+		writel(dev->eq_table.clr_mask, dev->eq_table.clr_int);
+
+	while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) {
+		work = 1;
+
+		writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i)
+			if (ecr & dev->eq_table.eq[i].ecr_mask)
+				mthca_eq_int(dev, &dev->eq_table.eq[i]);
+	}
+
+	return IRQ_RETVAL(work);
+}
+
+static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr,
+					 struct pt_regs *regs)
+{
+	struct mthca_eq  *eq  = eq_ptr;
+	struct mthca_dev *dev = eq->dev;
+
+	writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+	mthca_eq_int(dev, eq);
+
+	/* MSI-X vectors always belong to us */
+	return IRQ_HANDLED;
+}
+
+static int __devinit mthca_create_eq(struct mthca_dev *dev,
+				     int nent,
+				     u8 intr,
+				     struct mthca_eq *eq)
+{
+	int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	u64 *dma_list = NULL;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	struct mthca_eq_context *eq_context;
+	int err = -ENOMEM;
+	int i;
+	u8 status;
+
+	eq->dev = dev;
+
+	eq->page_list = kmalloc(npages * sizeof *eq->page_list,
+				GFP_KERNEL);
+	if (!eq->page_list)
+		goto err_out;
+
+	for (i = 0; i < npages; ++i)
+		eq->page_list[i].buf = NULL;
+
+	dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+	if (!dma_list)
+		goto err_out_free;
+
+	mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free;
+	eq_context = MAILBOX_ALIGN(mailbox);
+
+	for (i = 0; i < npages; ++i) {
+		eq->page_list[i].buf = pci_alloc_consistent(dev->pdev,
+							    PAGE_SIZE, &t);
+		if (!eq->page_list[i].buf)
+			goto err_out_free;
+
+		dma_list[i] = t;
+		pci_unmap_addr_set(&eq->page_list[i], mapping, t);
+
+		memset(eq->page_list[i].buf, 0, PAGE_SIZE);
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_eqe_hw(eq, i);
+
+	eq->eqn = mthca_alloc(&dev->eq_table.alloc);
+	if (eq->eqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, PAGE_SHIFT, npages,
+				  0, npages * PAGE_SIZE,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &eq->mr);
+	if (err)
+		goto err_out_free_eq;
+
+	eq->nent = nent;
+
+	memset(eq_context, 0, sizeof *eq_context);
+	eq_context->flags           = cpu_to_be32(MTHCA_EQ_STATUS_OK   |
+						  MTHCA_EQ_OWNER_HW    |
+						  MTHCA_EQ_STATE_ARMED |
+						  MTHCA_EQ_FLAG_TR);
+	eq_context->start           = cpu_to_be64(0);
+	eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	eq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	eq_context->intr            = intr;
+	eq_context->lkey            = cpu_to_be32(eq->mr.ibmr.lkey);
+
+	err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+	if (status) {
+		mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	eq->ecr_mask   = swab32(1 << eq->eqn);
+	eq->cons_index = 0;
+
+	eq_req_not(dev, eq->eqn);
+
+	mthca_dbg(dev, "Allocated EQ %d with %d entries\n",
+		  eq->eqn, nent);
+
+	return err;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &eq->mr);
+
+ err_out_free_eq:
+	mthca_free(&dev->eq_table.alloc, eq->eqn);
+
+ err_out_free:
+	for (i = 0; i < npages; ++i)
+		if (eq->page_list[i].buf)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    eq->page_list[i].buf,
+					    pci_unmap_addr(&eq->page_list[i],
+							   mapping));
+
+	kfree(eq->page_list);
+	kfree(dma_list);
+	kfree(mailbox);
+
+ err_out:
+	return err;
+}
+
+static void mthca_free_eq(struct mthca_dev *dev,
+			  struct mthca_eq *eq)
+{
+	void *mailbox = NULL;
+	int err;
+	u8 status;
+	int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	int i;
+
+	mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		return;
+
+	err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox),
+			     eq->eqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err);
+	if (status)
+		mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn);
+		for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+
+	mthca_free_mr(dev, &eq->mr);
+	for (i = 0; i < npages; ++i)
+		pci_free_consistent(dev->pdev, PAGE_SIZE,
+				    eq->page_list[i].buf,
+				    pci_unmap_addr(&eq->page_list[i], mapping));
+
+	kfree(eq->page_list);
+	kfree(mailbox);
+}
+
+static void mthca_free_irqs(struct mthca_dev *dev)
+{
+	int i;
+
+	if (dev->eq_table.have_irq)
+		free_irq(dev->pdev->irq, dev);
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		if (dev->eq_table.eq[i].have_irq)
+			free_irq(dev->eq_table.eq[i].msi_x_vector,
+				 dev->eq_table.eq + i);
+}
+
+int __devinit mthca_init_eq_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	u8 intr;
+	int i;
+
+	err = mthca_alloc_init(&dev->eq_table.alloc,
+			       dev->limits.num_eqs,
+			       dev->limits.num_eqs - 1,
+			       dev->limits.reserved_eqs);
+	if (err)
+		return err;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI ||
+	    dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		dev->eq_table.clr_mask = 0;
+	} else {
+		dev->eq_table.clr_mask =
+			swab32(1 << (dev->eq_table.inta_pin & 31));
+		dev->eq_table.clr_int  = dev->clr_base +
+			(dev->eq_table.inta_pin < 31 ? 4 : 0);
+	}
+
+	intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ?
+		128 : dev->eq_table.inta_pin;
+
+	err = mthca_create_eq(dev, dev->limits.num_cqs,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_COMP]);
+	if (err)
+		goto err_out_free;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+	if (err)
+		goto err_out_comp;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_CMD]);
+	if (err)
+		goto err_out_async;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		static const char *eq_name[] = {
+			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
+			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
+			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)" 
+		};
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
+					  mthca_msi_x_interrupt, 0,
+					  eq_name[i], dev->eq_table.eq + i);
+			if (err)
+				goto err_out_cmd;
+			dev->eq_table.eq[i].have_irq = 1;
+		}
+	} else {
+		err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ,
+				  DRV_NAME, dev);
+		if (err)
+			goto err_out_cmd;
+		dev->eq_table.have_irq = 1;
+	}
+
+	err = mthca_MAP_EQ(dev, async_mask(dev),
+			   0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status);
+
+	err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+			   0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status);
+
+	return 0;
+
+err_out_cmd:
+	mthca_free_irqs(dev);
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]);
+
+err_out_async:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+
+err_out_comp:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]);
+
+err_out_free:
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+	return err;
+}
+
+void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev)
+{
+	u8 status;
+	int i;
+
+	mthca_free_irqs(dev);
+
+	mthca_MAP_EQ(dev, async_mask(dev),
+		     1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+		     1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		mthca_free_eq(dev, &dev->eq_table.eq[i]);
+
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c	2004-11-18 10:51:40.841030654 -0800
@@ -0,0 +1,321 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $
+ */
+
+#include <ib_verbs.h>
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	IB_SM_PORT_INFO        = 0x0015,
+	IB_SM_PKEY_TABLE       = 0x0016,
+	IB_SM_SM_INFO          = 0x0020,
+	IB_SM_VENDOR_START     = 0xff00
+};
+
+enum {
+	MTHCA_VENDOR_CLASS1 = 0x9,
+	MTHCA_VENDOR_CLASS2 = 0xa
+};
+
+struct mthca_trap_mad {
+	struct ib_mad *mad;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+static void update_sm_ah(struct mthca_dev *dev,
+			 u8 port_num, u16 lid, u8 sl)
+{
+	struct ib_ah *new_ah;
+	struct ib_ah_attr ah_attr;
+	unsigned long flags;
+
+	if (!dev->send_agent[port_num - 1][0])
+		return;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid     = lid;
+	ah_attr.sl       = sl;
+	ah_attr.port_num = port_num;
+
+	new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd,
+			      &ah_attr);
+	if (IS_ERR(new_ah))
+		return;
+
+	spin_lock_irqsave(&dev->sm_lock, flags);
+	if (dev->sm_ah[port_num - 1])
+		ib_destroy_ah(dev->sm_ah[port_num - 1]);
+	dev->sm_ah[port_num - 1] = new_ah;
+	spin_unlock_irqrestore(&dev->sm_lock, flags);
+}
+
+/*
+ * Snoop SM MADs for port info and P_Key table sets, so we can
+ * synthesize LID change and P_Key change events.
+ */
+static void smp_snoop(struct ib_device *ibdev,
+		      u8 port_num,
+		      struct ib_mad *mad)
+{
+	struct ib_event event;
+
+	if ((mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+	    mad->mad_hdr.method     == IB_MGMT_METHOD_SET) {
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) {
+			update_sm_ah(to_mdev(ibdev), port_num,
+				     be16_to_cpup((__be16 *) (mad->data + 58)),
+				     (*(u8 *) (mad->data + 76)) & 0xf);
+
+			event.device           = ibdev;
+			event.event            = IB_EVENT_LID_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) {
+			event.device           = ibdev;
+			event.event            = IB_EVENT_PKEY_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+	}
+}
+
+static void forward_trap(struct mthca_dev *dev,
+			 u8 port_num,
+			 struct ib_mad *mad)
+{
+	int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	struct mthca_trap_mad *tmad;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+		.wr	     = {
+			 .ud = {
+				 .remote_qpn  = qpn,
+				 .remote_qkey = qpn ? IB_QP1_QKEY : 0,
+				 .timeout_ms  = 0
+			 }
+		 }
+	};
+	struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn];
+	int ret;
+	unsigned long flags;
+
+	if (agent) {
+		tmad = kmalloc(sizeof *tmad, GFP_KERNEL);
+		if (!tmad)
+			return;
+
+		tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL);
+		if (!tmad->mad) {
+			kfree(tmad);
+			return;
+		}
+
+		memcpy(tmad->mad, mad, sizeof *mad);
+
+		wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr;
+		wr.wr_id         = (unsigned long) tmad;
+
+		gather_list.addr   = pci_map_single(agent->device->dma_device,
+						    tmad->mad,
+						    sizeof *tmad->mad,
+						    PCI_DMA_TODEVICE);
+		gather_list.length = sizeof *tmad->mad;
+		gather_list.lkey   = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey;
+		pci_unmap_addr_set(tmad, mapping, gather_list.addr);
+		
+		/*
+		 * We rely here on the fact that MLX QPs don't use the
+		 * address handle after the send is posted (this is
+		 * wrong following the IB spec strictly, but we know
+		 * it's OK for our devices).
+		 */
+		spin_lock_irqsave(&dev->sm_lock, flags);
+		wr.wr.ud.ah      = dev->sm_ah[port_num - 1];
+		if (wr.wr.ud.ah)
+			ret = ib_post_send_mad(agent, &wr, &bad_wr);
+		else
+			ret = -EINVAL;
+		spin_unlock_irqrestore(&dev->sm_lock, flags);
+
+		if (ret) {
+			pci_unmap_single(agent->device->dma_device,
+					 pci_unmap_addr(tmad, mapping),
+					 sizeof *tmad->mad,
+					 PCI_DMA_TODEVICE);
+			kfree(tmad->mad);
+			kfree(tmad);
+		}
+	}
+}
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad)
+{
+	int err;
+	u8 status;
+
+	/* Forward locally generated traps to the SM */
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
+	    slid == 0) {
+		forward_trap(to_mdev(ibdev), port_num, in_mad);
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+	}
+
+	/*
+	 * Only handle SM gets, sets and trap represses for SM class
+	 *
+	 * Only handle PMA and Mellanox vendor-specific class gets and
+	 * sets for other classes.
+	 */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || 
+	    in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		if (in_mad->mad_hdr.method   != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_SET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_TRAP_REPRESS)
+			return IB_MAD_RESULT_SUCCESS;
+
+		/* 
+		 * Don't process SMInfo queries or vendor-specific
+		 * MADs -- the SMA can't handle them.
+		 */
+		if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO ||
+		    be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START)
+			return IB_MAD_RESULT_SUCCESS;
+	} else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT ||
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1     || 
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) {
+		if (in_mad->mad_hdr.method  != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method  != IB_MGMT_METHOD_SET)
+			return IB_MAD_RESULT_SUCCESS;
+	} else
+		return IB_MAD_RESULT_SUCCESS;
+
+	err = mthca_MAD_IFC(to_mdev(ibdev),
+			    !!(mad_flags & IB_MAD_IGNORE_MKEY),
+			    port_num, in_mad, out_mad,
+			    &status);
+	if (err) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC failed\n");
+		return IB_MAD_RESULT_FAILURE;
+	}
+	if (status == MTHCA_CMD_STAT_BAD_PKT)
+		return IB_MAD_RESULT_SUCCESS;
+	if (status) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n",
+			  status);
+		return IB_MAD_RESULT_FAILURE;
+	}
+
+	if (!out_mad->mad_hdr.status)
+		smp_snoop(ibdev, port_num, in_mad);
+
+	/* set return bit in status of directed route responses */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		out_mad->mad_hdr.status |= cpu_to_be16(1 << 15);
+
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS)
+		/* no response for trap repress */
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+
+	return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct mthca_trap_mad *tmad =
+		(void *) (unsigned long) mad_send_wc->wr_id;
+
+	pci_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(tmad, mapping),
+			 sizeof *tmad->mad,
+			 PCI_DMA_TODEVICE);
+	kfree(tmad->mad);
+	kfree(tmad);
+}
+
+int mthca_create_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	spin_lock_init(&dev->sm_lock);
+
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q) {
+			agent = ib_register_mad_agent(&dev->ib_dev, p + 1,
+						      q ? IB_QPT_GSI : IB_QPT_SMI,
+						      NULL, 0, send_handler,
+						      NULL, NULL);
+			if (IS_ERR(agent))
+				goto err;
+			dev->send_agent[p][q] = agent;
+		}
+
+	return 0;
+
+err:
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q)
+			if (dev->send_agent[p][q])
+				ib_unregister_mad_agent(dev->send_agent[p][q]);
+
+	return PTR_ERR(agent);
+}
+
+void mthca_free_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	for (p = 0; p < dev->limits.num_ports; ++p) {
+		for (q = 0; q <= 1; ++q) {
+			agent = dev->send_agent[p][q];
+			dev->send_agent[p][q] = NULL;
+			ib_unregister_mad_agent(agent);
+		}
+
+		if (dev->sm_ah[p])
+			ib_destroy_ah(dev->sm_ah[p]);
+	}
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_main.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c	2004-11-18 10:51:40.864027274 -0800
@@ -0,0 +1,889 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
+
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+#include <asm/cpufeature.h>
+#endif
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+#include "mthca_profile.h"
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+#ifdef CONFIG_PCI_MSI
+
+static int msi_x = 0;
+module_param(msi_x, int, 0444);
+MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
+
+static int msi = 0;
+module_param(msi, int, 0444);
+MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero");
+
+#else /* CONFIG_PCI_MSI */
+
+#define msi_x (0)
+#define msi   (0)
+
+#endif /* CONFIG_PCI_MSI */
+
+static const char mthca_version[] __devinitdata =
+	"ib_mthca: Mellanox InfiniBand HCA driver v"
+	DRV_VERSION " (" DRV_RELDATE ")\n";
+
+static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
+{
+	int cap;
+	u16 val;
+
+	/* First try to max out Read Byte Count */
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) {
+			mthca_err(mdev, "Couldn't read PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) {
+			mthca_err(mdev, "Couldn't write PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == TAVOR)
+		mthca_info(mdev, "No PCI-X capability, not setting RBC.\n");
+
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) {
+			mthca_err(mdev, "Couldn't read PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) {
+			mthca_err(mdev, "Couldn't write PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == ARBEL_NATIVE ||
+		   mdev->hca_type == ARBEL_COMPAT)
+		mthca_info(mdev, "No PCI Express capability, "
+			   "not setting Max Read Request Size.\n");
+
+	return 0;
+}
+
+static int __devinit mthca_init_tavor(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	struct mthca_dev_lim        dev_lim;
+	struct mthca_init_hca_param init_hca;
+	struct mthca_adapter        adapter;
+
+	err = mthca_SYS_EN(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "SYS_EN command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "SYS_EN returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DDR(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DDR command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DDR returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	if (dev_lim.min_page_sz > PAGE_SIZE) {
+		mthca_err(mdev, "HCA minimum page size of %d bigger than "
+			  "kernel PAGE_SIZE of %ld, aborting.\n",
+			  dev_lim.min_page_sz, PAGE_SIZE);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+	if (dev_lim.num_ports > MTHCA_MAX_PORTS) {
+		mthca_err(mdev, "HCA has %d ports, but we only support %d, "
+			  "aborting.\n",
+			  dev_lim.num_ports, MTHCA_MAX_PORTS);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+
+	mdev->limits.num_ports      	= dev_lim.num_ports;
+	mdev->limits.vl_cap             = dev_lim.max_vl;
+	mdev->limits.mtu_cap            = dev_lim.max_mtu;
+	mdev->limits.gid_table_len  	= dev_lim.max_gids;
+	mdev->limits.pkey_table_len 	= dev_lim.max_pkeys;
+	mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay;
+	mdev->limits.max_sg             = dev_lim.max_sg;
+	mdev->limits.reserved_qps       = dev_lim.reserved_qps;
+	mdev->limits.reserved_srqs      = dev_lim.reserved_srqs;
+	mdev->limits.reserved_eecs      = dev_lim.reserved_eecs;
+	mdev->limits.reserved_cqs       = dev_lim.reserved_cqs;
+	mdev->limits.reserved_eqs       = dev_lim.reserved_eqs;
+	mdev->limits.reserved_mtts      = dev_lim.reserved_mtts;
+	mdev->limits.reserved_mrws      = dev_lim.reserved_mrws;
+	mdev->limits.reserved_uars      = dev_lim.reserved_uars;
+	mdev->limits.reserved_pds       = dev_lim.reserved_pds;
+
+	if (dev_lim.flags & DEV_LIM_FLAG_SRQ)
+		mdev->mthca_flags |= MTHCA_FLAG_SRQ;
+	
+	err = mthca_make_profile(mdev, &dev_lim, &init_hca);
+	if (err)
+		goto err_out_disable;
+
+	err = mthca_INIT_HCA(mdev, &init_hca, &status);
+	if (err) {
+		mthca_err(mdev, "INIT_HCA command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "INIT_HCA returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+
+	err = mthca_QUERY_ADAPTER(mdev, &adapter, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_close;
+	}
+
+	mdev->eq_table.inta_pin = adapter.inta_pin;
+	mdev->rev_id            = adapter.revision_id;
+
+	return 0;
+
+err_out_close:
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+err_out_disable:
+	mthca_SYS_DIS(mdev, &status);
+
+	return err;
+}
+
+static int __devinit mthca_load_fw(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	int num_sg;
+	int i;
+
+	/* FIXME: use HCA-attached memory for FW if present */
+
+	mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem *
+				     mdev->fw.arbel.fw_pages,
+				     GFP_KERNEL);
+	if (!mdev->fw.arbel.mem) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		return -ENOMEM;
+	}
+
+	memset(mdev->fw.arbel.mem, 0,
+	       sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages);
+
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) {
+		mdev->fw.arbel.mem[i].page   = alloc_page(GFP_HIGHUSER);
+		mdev->fw.arbel.mem[i].length = PAGE_SIZE;
+		if (!mdev->fw.arbel.mem[i].page) {
+			mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+	}
+	num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem,
+					   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+	if (num_sg <= 0) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status);
+	if (err) {
+		mthca_err(mdev, "MAP_FA command failed, aborting.\n");
+		goto err_unmap;
+	}
+	if (status) {
+		mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap;
+	}
+
+	err = mthca_RUN_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "RUN_FW command failed, aborting.\n");
+		goto err_unmap_fa;
+	}
+	if (status) {
+		mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap_fa;
+	}
+
+	return 0;
+
+err_unmap_fa:
+	mthca_UNMAP_FA(mdev, &status);
+
+err_unmap:
+	pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+		   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+err_free:
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+		if (mdev->fw.arbel.mem[i].page)
+			__free_page(mdev->fw.arbel.mem[i].page);
+	kfree(mdev->fw.arbel.mem);
+	return err;
+}
+
+static int __devinit mthca_init_arbel(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_ENABLE_LAM(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n");
+		return err;
+	}
+	if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) {
+		mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n");
+		mdev->mthca_flags |= MTHCA_FLAG_NO_LAM;
+	} else if (status) {
+		mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_load_fw(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to start FW, aborting.\n");
+		goto err_out_disable;
+	}
+
+	mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, "
+		   "aborting.\n");
+	return -ENODEV;
+
+err_out_disable:
+	if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+		mthca_DISABLE_LAM(mdev, &status);
+	return err;
+}
+
+static int __devinit mthca_init_hca(struct mthca_dev *mdev)
+{
+	if (mdev->hca_type == ARBEL_NATIVE)
+		return mthca_init_arbel(mdev);
+	else
+		return mthca_init_tavor(mdev);
+}
+
+static int __devinit mthca_setup_hca(struct mthca_dev *dev)
+{
+	int err;
+
+	MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock);
+
+	err = mthca_init_pd_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "protection domain table, aborting.\n");
+		return err;
+	}
+
+	err = mthca_init_mr_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "memory region table, aborting.\n");
+		goto err_out_pd_table_free;
+	}
+
+	err = mthca_pd_alloc(dev, &dev->driver_pd);
+	if (err) {
+		mthca_err(dev, "Failed to create driver PD, "
+			  "aborting.\n");
+		goto err_out_mr_table_free;
+	}
+
+	err = mthca_init_eq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "event queue table, aborting.\n");
+		goto err_out_pd_free;
+	}
+
+	err = mthca_cmd_use_events(dev);
+	if (err) {
+		mthca_err(dev, "Failed to switch to event-driven "
+			  "firmware commands, aborting.\n");
+		goto err_out_eq_table_free;
+	}
+
+	err = mthca_init_cq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "completion queue table, aborting.\n");
+		goto err_out_cmd_poll;
+	}
+
+	err = mthca_init_qp_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "queue pair table, aborting.\n");
+		goto err_out_cq_table_free;
+	}
+
+	err = mthca_init_av_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "address vector table, aborting.\n");
+		goto err_out_qp_table_free;
+	}
+
+	err = mthca_init_mcg_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "multicast group table, aborting.\n");
+		goto err_out_av_table_free;
+	}
+
+	return 0;
+
+err_out_av_table_free:
+	mthca_cleanup_av_table(dev);
+
+err_out_qp_table_free:
+	mthca_cleanup_qp_table(dev);
+
+err_out_cq_table_free:
+	mthca_cleanup_cq_table(dev);
+
+err_out_cmd_poll:
+	mthca_cmd_use_polling(dev);
+
+err_out_eq_table_free:
+	mthca_cleanup_eq_table(dev);
+
+err_out_pd_free:
+	mthca_pd_free(dev, &dev->driver_pd);
+
+err_out_mr_table_free:
+	mthca_cleanup_mr_table(dev);
+
+err_out_pd_table_free:
+	mthca_cleanup_pd_table(dev);
+	return err;
+}
+
+static int __devinit mthca_request_regions(struct pci_dev *pdev,
+					   int ddr_hidden)
+{
+	int err;
+
+	/*
+	 * We request our first BAR in two chunks, since the MSI-X
+	 * vector table is right in the middle.
+	 *
+	 * This is why we can't just use pci_request_regions() -- if
+	 * we did then setting up MSI-X would fail, since the PCI core
+	 * wants to do request_mem_region on the MSI-X vector table.
+	 */
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_HCR_BASE,
+				MTHCA_MAP_HCR_SIZE,
+				DRV_NAME))
+		return -EBUSY;
+
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_CLR_INT_BASE,
+				MTHCA_CLR_INT_SIZE,
+				DRV_NAME)) {
+		err = -EBUSY;
+		goto err_out_bar0_beg;
+	}
+
+	err = pci_request_region(pdev, 2, DRV_NAME);
+	if (err)
+		goto err_out_bar0_end;
+
+	if (!ddr_hidden) {
+		err = pci_request_region(pdev, 4, DRV_NAME);
+		if (err)
+			goto err_out_bar2;
+	}
+
+	return 0;
+
+err_out_bar0_beg:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+
+err_out_bar0_end:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+
+err_out_bar2:
+	pci_release_region(pdev, 2);
+	return err;
+}
+
+static void mthca_release_regions(struct pci_dev *pdev,
+				  int ddr_hidden)
+{
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+	pci_release_region(pdev, 2);
+	if (!ddr_hidden)
+		pci_release_region(pdev, 4);
+}
+
+static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev)
+{
+	struct msix_entry entries[3];
+	int err;
+
+	entries[0].entry = 0;
+	entries[1].entry = 1;
+	entries[2].entry = 2;
+
+	err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries));
+	if (err) {
+		if (err > 0)
+			mthca_info(mdev, "Only %d MSI-X vectors available, "
+				   "not using MSI-X\n", err);
+		return err;
+	}
+
+	mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector;
+	mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector;
+	mdev->eq_table.eq[MTHCA_EQ_CMD  ].msi_x_vector = entries[2].vector;
+
+	return 0;
+}
+
+static void mthca_close_hca(struct mthca_dev *mdev)
+{
+	u8 status;
+	int i;
+
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+	if (mdev->hca_type == ARBEL_NATIVE) {
+		mthca_UNMAP_FA(mdev, &status);
+
+		pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+			     mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+
+		for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+			__free_page(mdev->fw.arbel.mem[i].page);
+		kfree(mdev->fw.arbel.mem);
+
+		if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+			mthca_DISABLE_LAM(mdev, &status);
+	} else
+		mthca_SYS_DIS(mdev, &status);
+}
+
+static int __devinit mthca_init_one(struct pci_dev *pdev,
+				    const struct pci_device_id *id)
+{
+	static int mthca_version_printed = 0;
+	int ddr_hidden = 0;
+	int err;
+	unsigned long mthca_base;
+	struct mthca_dev *mdev;
+
+	if (!mthca_version_printed) {
+		printk(KERN_INFO "%s", mthca_version);
+		++mthca_version_printed;
+	}
+
+	printk(KERN_INFO PFX "Initializing %s (%s)\n",
+	       pci_pretty_name(pdev), pci_name(pdev));
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot enable PCI device, "
+			"aborting.\n");
+		return err;
+	}
+
+	/*
+	 * Check for BARs.  We expect 0: 1MB, 2: 8MB, 4: DDR (may not
+	 * be present)
+	 */
+	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 0) != 1 << 20) {
+		dev_err(&pdev->dev, "Missing DCS, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 2) != 1 << 23) {
+		dev_err(&pdev->dev, "Missing UAR, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM))
+		ddr_hidden = 1;
+
+	err = mthca_request_regions(pdev, ddr_hidden);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot obtain PCI resources, "
+			"aborting.\n");
+		goto err_out_disable_pdev;
+	}
+
+	pci_set_master(pdev);
+
+	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n");
+		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+	err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit "
+			 "consistent PCI DMA mask.\n");
+		err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, "
+				"aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+
+	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
+	if (!mdev) {
+		dev_err(&pdev->dev, "Device struct alloc failed, "
+			"aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_res;
+	}
+
+	mdev->pdev     = pdev;
+	mdev->hca_type = id->driver_data;
+
+	if (ddr_hidden)
+		mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN;
+
+	/*
+	 * Now reset the HCA before we touch the PCI capabilities or
+	 * attempt a firmware command, since a boot ROM may have left
+	 * the HCA in an undefined state.
+	 */
+	err = mthca_reset(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to reset HCA, aborting.\n");
+		goto err_out_free_dev;
+	}
+
+	if (msi_x && !mthca_enable_msi_x(mdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
+	if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) &&
+	    !pci_enable_msi(pdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI;
+
+	sema_init(&mdev->cmd.hcr_sem, 1);
+	sema_init(&mdev->cmd.poll_sem, 1);
+	mdev->cmd.use_events = 0;
+
+	mthca_base = pci_resource_start(pdev, 0);
+	mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE);
+	if (!mdev->hcr) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_dev;
+	}
+	mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE,
+				 MTHCA_CLR_INT_SIZE);
+	if (!mdev->clr_base) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap;
+	}
+
+	mthca_base = pci_resource_start(pdev, 2);
+	mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE);
+	if (!mdev->kar) {
+		mthca_err(mdev, "Couldn't map kernel access region, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap_clr;
+	}
+
+	err = mthca_tune_pci(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_init_hca(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_setup_hca(mdev);
+	if (err)
+		goto err_out_close;
+
+	err = mthca_register_device(mdev);
+	if (err)
+		goto err_out_cleanup;
+
+	err = mthca_create_agents(mdev);
+	if (err)
+		goto err_out_unregister;
+
+	pci_set_drvdata(pdev, mdev);
+
+	return 0;
+
+err_out_unregister:
+	mthca_unregister_device(mdev);
+
+err_out_cleanup:
+	mthca_cleanup_mcg_table(mdev);
+	mthca_cleanup_av_table(mdev);
+	mthca_cleanup_qp_table(mdev);
+	mthca_cleanup_cq_table(mdev);
+	mthca_cmd_use_polling(mdev);
+	mthca_cleanup_eq_table(mdev);
+
+	mthca_pd_free(mdev, &mdev->driver_pd);
+
+	mthca_cleanup_mr_table(mdev);
+	mthca_cleanup_pd_table(mdev);
+
+err_out_close:
+	mthca_close_hca(mdev);
+
+err_out_iounmap_kar:
+	iounmap(mdev->kar);
+
+err_out_iounmap_clr:
+	iounmap(mdev->clr_base);
+
+err_out_iounmap:
+	iounmap(mdev->hcr);
+
+err_out_free_dev:
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+		pci_disable_msi(pdev);
+
+	ib_dealloc_device(&mdev->ib_dev);
+
+err_out_free_res:
+	mthca_release_regions(pdev, ddr_hidden);
+
+err_out_disable_pdev:
+	pci_disable_device(pdev);
+	pci_set_drvdata(pdev, NULL);
+	return err;
+}
+
+static void __devexit mthca_remove_one(struct pci_dev *pdev)
+{
+	struct mthca_dev *mdev = pci_get_drvdata(pdev);
+	u8 status;
+	int p;
+
+	if (mdev) {
+		mthca_free_agents(mdev);
+		mthca_unregister_device(mdev);
+
+		for (p = 1; p <= mdev->limits.num_ports; ++p)
+			mthca_CLOSE_IB(mdev, p, &status);
+
+		mthca_cleanup_mcg_table(mdev);
+		mthca_cleanup_av_table(mdev);
+		mthca_cleanup_qp_table(mdev);
+		mthca_cleanup_cq_table(mdev);
+		mthca_cmd_use_polling(mdev);
+		mthca_cleanup_eq_table(mdev);
+
+		mthca_pd_free(mdev, &mdev->driver_pd);
+
+		mthca_cleanup_mr_table(mdev);
+		mthca_cleanup_pd_table(mdev);
+
+		mthca_close_hca(mdev);
+
+		iounmap(mdev->hcr);
+		iounmap(mdev->clr_base);
+
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+			pci_disable_msix(pdev);
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+			pci_disable_msi(pdev);
+
+		ib_dealloc_device(&mdev->ib_dev);
+		mthca_release_regions(pdev, mdev->mthca_flags &
+				      MTHCA_FLAG_DDR_HIDDEN);
+		pci_disable_device(pdev);
+		pci_set_drvdata(pdev, NULL);
+	}
+}
+
+static struct pci_device_id mthca_pci_table[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ 0, }
+};
+
+MODULE_DEVICE_TABLE(pci, mthca_pci_table);
+
+static struct pci_driver mthca_driver = {
+	.name		= "ib_mthca",
+	.id_table	= mthca_pci_table,
+	.probe		= mthca_init_one,
+	.remove		= __devexit_p(mthca_remove_one)
+};
+
+static int __init mthca_init(void)
+{
+	int ret;
+
+	/*
+	 * TODO: measure whether dynamically choosing doorbell code at
+	 * runtime affects our performance.  Is there a "magic" way to
+	 * choose without having to follow a function pointer every
+	 * time we ring a doorbell?
+	 */
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+	if (!cpu_has_xmm) {
+		printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n");
+		printk(KERN_ERR PFX "the current CPU does not support SSE.\n");
+		printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL "
+		       "and recompile.\n");
+		return -ENODEV;
+	}
+#endif
+
+	ret = pci_register_driver(&mthca_driver);
+	return ret < 0 ? ret : 0;
+}
+
+static void __exit mthca_cleanup(void)
+{
+	pci_unregister_driver(&mthca_driver);
+}
+
+module_init(mthca_init);
+module_exit(mthca_cleanup);
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c	2004-11-18 10:51:40.888023746 -0800
@@ -0,0 +1,372 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $
+ */
+
+#include <linux/init.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2)
+};
+
+struct mthca_mgm {
+	u32 next_gid_index;
+	u32 reserved[3];
+	u8  gid[16];
+	u32 qp[MTHCA_QP_PER_MGM];
+} __attribute__((packed));
+
+static const u8 zero_gid[16];	/* automatically initialized to 0 */
+
+/*
+ * Caller must hold MCG table semaphore.  gid and mgm parameters must
+ * be properly aligned for command interface.
+ *
+ *  Returns 0 unless a firmware command error occurs.
+ *
+ * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1
+ * and *mgm holds MGM entry.
+ *
+ * if GID is found in AMGM, *index = index in AMGM, *prev = index of
+ * previous entry in hash chain and *mgm holds AMGM entry.
+ *
+ * If no AMGM exists for given gid, *index = -1, *prev = index of last
+ * entry in hash chain and *mgm holds end of hash chain.
+ */
+static int find_mgm(struct mthca_dev *dev,
+		    u8 *gid, struct mthca_mgm *mgm,
+		    u16 *hash, int *prev, int *index)
+{
+	void *mailbox;
+	u8 *mgid;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgid = MAILBOX_ALIGN(mailbox);
+
+	memcpy(mgid, gid, 16);
+
+	err = mthca_MGID_HASH(dev, mgid, hash, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "MGID_HASH returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (0)
+		mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:"
+			  "%04x:%04x:%04x:%04x is %04x\n",
+			  be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]),
+			  be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]),
+			  be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]),
+			  be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]),
+			  *hash);
+
+	*index = *hash;
+	*prev  = -1;
+
+	do {
+		err = mthca_READ_MGM(dev, *index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			return -EINVAL;
+		}
+
+		if (!memcmp(mgm->gid, zero_gid, 16)) {
+			if (*index != *hash) {
+				mthca_err(dev, "Found zero MGID in AMGM.\n");
+				err = -EINVAL;
+			}
+			goto out;
+		}
+
+		if (!memcmp(mgm->gid, gid, 16))
+			goto out;
+
+		*prev = *index;
+		*index = be32_to_cpu(mgm->next_gid_index) >> 5;
+	} while (*index);
+
+	*index = -1;
+
+ out:
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int index, prev;
+	int link = 0;
+	int i;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index != -1) {
+		if (!memcmp(mgm->gid, zero_gid, 16))
+			memcpy(mgm->gid, gid->raw, 16);
+	} else {
+		link = 1;
+
+		index = mthca_alloc(&dev->mcg_table.alloc);
+		if (index == -1) {
+			mthca_err(dev, "No AMGM entries left\n");
+			err = -ENOMEM;
+			goto out;
+		}
+
+		err = mthca_READ_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		memcpy(mgm->gid, gid->raw, 16);
+		mgm->next_gid_index = 0;
+	}
+
+	for (i = 0; i < MTHCA_QP_PER_MGM; ++i)
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) {
+			mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31));
+			break;
+		}
+
+	if (i == MTHCA_QP_PER_MGM) {
+		mthca_err(dev, "MGM at index %x is full.\n", index);
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+	if (!link)
+		goto out;
+
+	err = mthca_READ_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "READ_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->next_gid_index = cpu_to_be32(index << 5);
+
+	err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int prev, index;
+	int i, loc;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index == -1) {	
+		mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x "
+			  "not found\n",
+			  be16_to_cpu(((u16 *) gid->raw)[0]),
+			  be16_to_cpu(((u16 *) gid->raw)[1]),
+			  be16_to_cpu(((u16 *) gid->raw)[2]),
+			  be16_to_cpu(((u16 *) gid->raw)[3]),
+			  be16_to_cpu(((u16 *) gid->raw)[4]),
+			  be16_to_cpu(((u16 *) gid->raw)[5]),
+			  be16_to_cpu(((u16 *) gid->raw)[6]),
+			  be16_to_cpu(((u16 *) gid->raw)[7]));
+		err = -EINVAL;
+		goto out;
+	}
+
+	for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) {
+		if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31)))
+			loc = i;
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31)))
+			break;
+	}
+
+	if (loc == -1) {
+		mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->qp[loc]   = mgm->qp[i - 1];
+	mgm->qp[i - 1] = 0;
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (i != 1)
+		goto out;
+
+	goto out;
+
+	if (prev == -1) {
+		/* Remove entry from MGM */
+		if (be32_to_cpu(mgm->next_gid_index) >> 5) {
+			err = mthca_READ_MGM(dev,
+					     be32_to_cpu(mgm->next_gid_index) >> 5,
+					     mgm, &status);
+			if (err)
+				goto out;
+			if (status) {
+				mthca_err(dev, "READ_MGM returned status %02x\n",
+					  status);
+				err = -EINVAL;
+				goto out;
+			}
+		} else
+			memset(mgm->gid, 0, 16);
+
+		err = mthca_WRITE_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		/* Remove entry from AMGM */
+		index = be32_to_cpu(mgm->next_gid_index) >> 5;
+		err = mthca_READ_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		mgm->next_gid_index = cpu_to_be32(index << 5);
+
+		err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int __devinit mthca_init_mcg_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->mcg_table.alloc,
+			       dev->limits.num_amgms,
+			       dev->limits.num_amgms - 1,
+			       0);
+	if (err)
+		return err;
+
+	init_MUTEX(&dev->mcg_table.sem);
+
+	return 0;
+}
+
+void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev)
+{
+	mthca_alloc_cleanup(&dev->mcg_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c	2004-11-18 10:51:40.917019484 -0800
@@ -0,0 +1,389 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+struct mthca_mpt_entry {
+	u32 flags;
+	u32 page_size;
+	u32 key;
+	u32 pd;
+	u64 start;
+	u64 length;
+	u32 lkey;
+	u32 window_count;
+	u32 window_count_limit;
+	u64 mtt_seg;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_MPT_FLAG_SW_OWNS       (0xfUL << 28)
+#define MTHCA_MPT_FLAG_MIO           (1 << 17)
+#define MTHCA_MPT_FLAG_BIND_ENABLE   (1 << 15)
+#define MTHCA_MPT_FLAG_PHYSICAL      (1 <<  9)
+#define MTHCA_MPT_FLAG_REGION        (1 <<  8)
+
+#define MTHCA_MTT_FLAG_PRESENT       1
+
+/*
+ * Buddy allocator for MTT segments (currently not very efficient
+ * since it doesn't keep a free list and just searches linearly
+ * through the bitmaps)
+ */
+
+static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+{
+	int o;
+	int m;
+	u32 seg;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	for (o = order; o <= dev->mr_table.max_mtt_order; ++o) {
+		m = 1 << (dev->mr_table.max_mtt_order - o);
+		seg = find_first_bit(dev->mr_table.mtt_buddy[o], m);
+		if (seg < m)
+			goto found;
+	}
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	return -1;
+
+ found:
+	clear_bit(seg, dev->mr_table.mtt_buddy[o]);
+
+	while (o > order) {
+		--o;
+		seg <<= 1;
+		set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]);
+	}
+					  
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+
+	seg <<= order;
+
+	return seg;
+}
+
+static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+{
+	seg >>= order;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) {
+		clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]);
+		seg >>= 1;
+		++order;
+	}
+
+	set_bit(seg, dev->mr_table.mtt_buddy[order]);
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+}
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	struct mthca_mpt_entry *mpt_entry;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mr->order = -1;
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+		return -ENOMEM;
+	}
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_PHYSICAL    |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+	mpt_entry->page_size = 0;
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = 0;
+	mpt_entry->length    = ~0ULL;
+
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	u64 *mtt_entry;
+	struct mthca_mpt_entry *mpt_entry;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+	WARN_ON(buffer_size_shift >= 32);
+
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	for (i = dev->limits.mtt_seg_size / 8, mr->order = 0;
+	     i < list_len;
+	     i <<= 1, ++mr->order)
+		/* nothing */ ;
+
+	mr->first_seg = mthca_alloc_mtt(dev, mr->order);
+	if (mr->first_seg == -1)
+		goto err_out_mpt_free;
+
+	/*
+	 * If list_len is odd, we add one more dummy entry for
+	 * firmware efficiency.
+	 */
+	mailbox = kmalloc(max(sizeof *mpt_entry,
+			      (size_t) 8 * (list_len + (list_len & 1) + 2)) +
+			  MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free_mtt;
+
+	mtt_entry = MAILBOX_ALIGN(mailbox);
+
+	mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base +
+				   mr->first_seg * dev->limits.mtt_seg_size);
+	mtt_entry[1] = 0;
+	for (i = 0; i < list_len; ++i)
+		mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+	if (list_len & 1) {
+		mtt_entry[i + 2] = 0;
+		++list_len;
+	}
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry\n");
+		for (i = 0; i < list_len + 2; ++i)
+			printk(KERN_ERR "[%2d] %016llx\n",
+			       i, (unsigned long long) be64_to_cpu(mtt_entry[i]));
+	}
+
+	err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status);
+	if (err) {
+		mthca_warn(dev, "WRITE_MTT failed (%d)\n", err);
+		goto err_out_mailbox_free;
+	}
+	if (status) {
+		mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_mailbox_free;
+	}
+
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+
+	mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12);
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = cpu_to_be64(iova);
+	mpt_entry->length    = cpu_to_be64(total_size);
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+	mpt_entry->mtt_seg   = cpu_to_be64(dev->mr_table.mtt_base +
+					   mr->first_seg * dev->limits.mtt_seg_size);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
+		for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i]));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+
+ err_out_mailbox_free:
+	kfree(mailbox);
+
+ err_out_free_mtt:
+	mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+ err_out_mpt_free:
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	return err;
+}
+
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr)
+{
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	err = mthca_HW2SW_MPT(dev, NULL,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n",
+			   status);
+
+	if (mr->order >= 0)
+		mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);		   
+}
+
+int __devinit mthca_init_mr_table(struct mthca_dev *dev)
+{
+	int err;
+	int i, s;
+
+	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
+			       dev->limits.num_mpts,
+			       ~0, dev->limits.reserved_mrws);
+	if (err)
+		return err;
+
+	err = -ENOMEM;
+
+	for (i = 1, dev->mr_table.max_mtt_order = 0;
+	     i < dev->limits.num_mtt_segs;
+	     i <<= 1, ++dev->mr_table.max_mtt_order)
+		/* nothing */ ;
+
+	dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) *
+					  sizeof (long *),
+					  GFP_KERNEL);
+	if (!dev->mr_table.mtt_buddy)
+		goto err_out;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		dev->mr_table.mtt_buddy[i] = NULL;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) {
+		s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i));
+		dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long),
+						     GFP_KERNEL);
+		if (!dev->mr_table.mtt_buddy[i])
+			goto err_out_free;
+		bitmap_zero(dev->mr_table.mtt_buddy[i],
+			    1 << (dev->mr_table.max_mtt_order - i));
+	}
+
+	set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]);
+
+	for (i = 0; i < dev->mr_table.max_mtt_order; ++i)
+		if (1 << i >= dev->limits.reserved_mtts)
+			break;
+
+	if (i == dev->mr_table.max_mtt_order) {
+		mthca_err(dev, "MTT table of order %d is "
+			  "too small.\n", i);
+		goto err_out_free;
+	}
+
+	(void) mthca_alloc_mtt(dev, i);
+
+	return 0;
+
+ err_out_free:
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+
+ err_out:
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev)
+{
+	int i;
+
+	/* XXX check if any MRs are still allocated? */
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+	kfree(dev->mr_table.mtt_buddy);
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c	2004-11-18 10:51:40.940016104 -0800
@@ -0,0 +1,76 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	int err;
+
+	might_sleep();
+
+	atomic_set(&pd->sqp_count, 0);
+	pd->pd_num = mthca_alloc(&dev->pd_table.alloc);
+	if (pd->pd_num == -1)
+		return -ENOMEM;
+
+	err = mthca_mr_alloc_notrans(dev, pd->pd_num,
+				     MTHCA_MPT_FLAG_LOCAL_READ |
+				     MTHCA_MPT_FLAG_LOCAL_WRITE,
+				     &pd->ntmr);
+	if (err)
+		mthca_free(&dev->pd_table.alloc, pd->pd_num);
+
+	return err;
+}
+
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	might_sleep();
+	mthca_free_mr(dev, &pd->ntmr);
+	mthca_free(&dev->pd_table.alloc, pd->pd_num);
+}
+
+int __devinit mthca_init_pd_table(struct mthca_dev *dev)
+{
+	return mthca_alloc_init(&dev->pd_table.alloc,
+				dev->limits.num_pds,
+				(1 << 24) - 1,
+				dev->limits.reserved_pds);
+}
+
+void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev)
+{
+	/* XXX check if any PDs are still allocated? */
+	mthca_alloc_cleanup(&dev->pd_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c	2004-11-18 10:51:40.964012577 -0800
@@ -0,0 +1,222 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include "mthca_profile.h"
+
+static int default_profile[MTHCA_RES_NUM] = {
+	[MTHCA_RES_QP]    = 1 << 16,
+	[MTHCA_RES_EQP]   = 1 << 16,
+	[MTHCA_RES_CQ]    = 1 << 16,
+	[MTHCA_RES_EQ]    = 32,
+	[MTHCA_RES_RDB]   = 1 << 18,
+	[MTHCA_RES_MCG]   = 1 << 13,
+	[MTHCA_RES_MPT]   = 1 << 17,
+	[MTHCA_RES_MTT]   = 1 << 20,
+	[MTHCA_RES_UDAV]  = 1 << 15
+};
+
+enum {
+	MTHCA_RDB_ENTRY_SIZE = 32,
+	MTHCA_MTT_SEG_SIZE   = 64
+};
+
+enum {
+	MTHCA_NUM_PDS = 1 << 15
+};
+
+int mthca_make_profile(struct mthca_dev *dev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca)
+{
+	/* just use default profile for now */
+	struct mthca_resource {
+		u64 size;
+		u64 start;
+		int type;
+		int num;
+		int log_num;
+	};
+
+	u64 total_size = 0;
+	struct mthca_resource *profile;
+	struct mthca_resource tmp;
+	int i, j;
+
+	default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE;
+
+	profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL);
+	if (!profile)
+		return -ENOMEM;
+
+	profile[MTHCA_RES_QP].size   = dev_lim->qpc_entry_sz;
+	profile[MTHCA_RES_EEC].size  = dev_lim->eec_entry_sz;
+	profile[MTHCA_RES_SRQ].size  = dev_lim->srq_entry_sz;
+	profile[MTHCA_RES_CQ].size   = dev_lim->cqc_entry_sz;
+	profile[MTHCA_RES_EQP].size  = dev_lim->eqpc_entry_sz;
+	profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz;
+	profile[MTHCA_RES_EQ].size   = dev_lim->eqc_entry_sz;
+	profile[MTHCA_RES_RDB].size  = MTHCA_RDB_ENTRY_SIZE;
+	profile[MTHCA_RES_MCG].size  = MTHCA_MGM_ENTRY_SIZE;
+	profile[MTHCA_RES_MPT].size  = MTHCA_MPT_ENTRY_SIZE;
+	profile[MTHCA_RES_MTT].size  = MTHCA_MTT_SEG_SIZE;
+	profile[MTHCA_RES_UAR].size  = dev_lim->uar_scratch_entry_sz;
+	profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE;
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		profile[i].type     = i;
+		profile[i].num      = default_profile[i];
+		profile[i].log_num  = max(ffs(default_profile[i]) - 1, 0);
+		profile[i].size    *= default_profile[i];
+	}
+
+	/* 
+	 * Sort the resources in decreasing order of size.  Since they
+	 * all have sizes that are powers of 2, we'll be able to keep
+	 * resources aligned to their size and pack them without gaps
+	 * using the sorted order.
+	 */
+	for (i = MTHCA_RES_NUM; i > 0; --i)
+		for (j = 1; j < i; ++j) {
+			if (profile[j].size > profile[j - 1].size) {
+				tmp            = profile[j];
+				profile[j]     = profile[j - 1];
+				profile[j - 1] = tmp;
+			}
+		}
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		if (profile[i].size) {
+			profile[i].start = dev->ddr_start + total_size;
+			total_size      += profile[i].size;
+		}
+		if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) {
+			mthca_err(dev, "Profile requires 0x%llx bytes; "
+				  "won't fit between DDR start at 0x%016llx "
+				  "and FW start at 0x%016llx.\n",
+				  (unsigned long long) total_size,
+				  (unsigned long long) dev->ddr_start,
+				  (unsigned long long) dev->fw.tavor.fw_start);
+			kfree(profile);
+			return -ENOMEM;
+		}
+
+		if (profile[i].size)
+			mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx "
+				  "(size 0x%8llx)\n",
+				  i, profile[i].type, profile[i].log_num,
+				  (unsigned long long) profile[i].start,
+				  (unsigned long long) profile[i].size);
+	}
+
+	mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n",
+		  (int) (total_size >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10));
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		switch (profile[i].type) {
+		case MTHCA_RES_QP:
+			dev->limits.num_qps   = profile[i].num;
+			init_hca->qpc_base    = profile[i].start;
+			init_hca->log_num_qps = profile[i].log_num;
+			break;
+		case MTHCA_RES_EEC:
+			dev->limits.num_eecs   = profile[i].num;
+			init_hca->eec_base     = profile[i].start;
+			init_hca->log_num_eecs = profile[i].log_num;
+			break;
+		case MTHCA_RES_SRQ:
+			dev->limits.num_srqs   = profile[i].num;
+			init_hca->srqc_base    = profile[i].start;
+			init_hca->log_num_srqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_CQ:
+			dev->limits.num_cqs   = profile[i].num;
+			init_hca->cqc_base    = profile[i].start;
+			init_hca->log_num_cqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_EQP:
+			init_hca->eqpc_base = profile[i].start;
+			break;
+		case MTHCA_RES_EEEC:
+			init_hca->eeec_base = profile[i].start;
+			break;
+		case MTHCA_RES_EQ:
+			dev->limits.num_eqs   = profile[i].num;
+			init_hca->eqc_base    = profile[i].start;
+			init_hca->log_num_eqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_RDB:
+			dev->limits.num_rdbs = profile[i].num;
+			init_hca->rdb_base   = profile[i].start;
+			break;
+		case MTHCA_RES_MCG:
+			dev->limits.num_mgms      = profile[i].num >> 1;
+			dev->limits.num_amgms     = profile[i].num >> 1;
+			init_hca->mc_base         = profile[i].start;
+			init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1;
+			init_hca->log_mc_table_sz = profile[i].log_num;
+			init_hca->mc_hash_sz      = 1 << (profile[i].log_num - 1);
+			break;
+		case MTHCA_RES_MPT:
+			dev->limits.num_mpts = profile[i].num;
+			init_hca->mpt_base   = profile[i].start;
+			init_hca->log_mpt_sz = profile[i].log_num;
+			break;
+		case MTHCA_RES_MTT:
+			dev->limits.num_mtt_segs = profile[i].num;
+			dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE;
+			dev->mr_table.mtt_base   = profile[i].start;
+			init_hca->mtt_base       = profile[i].start;
+			init_hca->mtt_seg_sz     = ffs(MTHCA_MTT_SEG_SIZE) - 7;
+			break;
+		case MTHCA_RES_UAR:
+			init_hca->uar_scratch_base = profile[i].start;
+			break;
+		case MTHCA_RES_UDAV:
+			dev->av_table.ddr_av_base = profile[i].start;
+			dev->av_table.num_ddr_avs = profile[i].num;
+		default:
+			break;
+		}
+	}
+
+	/*
+	 * PDs don't take any HCA memory, but we assign them as part
+	 * of the HCA profile anyway.
+	 */
+	dev->limits.num_pds = MTHCA_NUM_PDS;
+
+	kfree(profile);
+	return 0;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h	2004-11-18 10:51:40.989008903 -0800
@@ -0,0 +1,58 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $
+ */
+
+#ifndef MTHCA_PROFILE_H
+#define MTHCA_PROFILE_H
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_RES_QP,
+	MTHCA_RES_EEC,
+	MTHCA_RES_SRQ,
+	MTHCA_RES_CQ,
+	MTHCA_RES_EQP,
+	MTHCA_RES_EEEC,
+	MTHCA_RES_EQ,
+	MTHCA_RES_RDB,
+	MTHCA_RES_MCG,
+	MTHCA_RES_MPT,
+	MTHCA_RES_MTT,
+	MTHCA_RES_UAR,
+	MTHCA_RES_UDAV,
+	MTHCA_RES_NUM
+};
+
+int mthca_make_profile(struct mthca_dev *mdev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca);
+
+#endif /* MTHCA_PROFILE_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c	2004-11-18 10:51:41.916872516 -0800
@@ -0,0 +1,629 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $
+ */
+
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+/* Temporary until we get core support straightened out */
+enum {
+	IB_SMP_ATTRIB_NODE_INFO        = 0x0011,
+	IB_SMP_ATTRIB_GUID_INFO        = 0x0014,
+	IB_SMP_ATTRIB_PORT_INFO        = 0x0015,
+	IB_SMP_ATTRIB_PKEY_TABLE       = 0x0016
+};
+
+static int mthca_query_device(struct ib_device *ibdev,
+			      struct ib_device_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	props->fw_ver        = to_mdev(ibdev)->fw_ver;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    1, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->vendor_id      = be32_to_cpup((u32 *) (out_mad->data + 76)) &
+		0xffffff;
+	props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70));
+	props->hw_ver         = be16_to_cpup((u16 *) (out_mad->data + 72));
+	memcpy(&props->sys_image_guid, out_mad->data + 44, 8);
+	memcpy(&props->node_guid,      out_mad->data + 52, 8);
+
+	err = 0;
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_port(struct ib_device *ibdev,
+			    u8 port, struct ib_port_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->lid               = be16_to_cpup((u16 *) (out_mad->data + 56));
+	props->lmc               = (*(u8 *) (out_mad->data + 74)) & 0x7;
+	props->sm_lid            = be16_to_cpup((u16 *) (out_mad->data + 58));
+	props->sm_sl             = (*(u8 *) (out_mad->data + 76)) & 0xf;
+	props->state             = (*(u8 *) (out_mad->data + 72)) & 0xf;
+	props->port_cap_flags    = be32_to_cpup((u32 *) (out_mad->data + 60));
+	props->gid_tbl_len       = to_mdev(ibdev)->limits.gid_table_len;
+	props->pkey_tbl_len      = to_mdev(ibdev)->limits.pkey_table_len;
+	props->qkey_viol_cntr    = be16_to_cpup((u16 *) (out_mad->data + 88));
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_modify_port(struct ib_device *ibdev,
+			     u8 port, int port_modify_mask,
+			     struct ib_port_modify *props)
+{
+	return 0;
+}
+
+static int mthca_query_pkey(struct ib_device *ibdev,
+			    u8 port, u16 index, u16 *pkey)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 32);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	*pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_gid(struct ib_device *ibdev, u8 port,
+			   int index, union ib_gid *gid)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw, out_mad->data + 48, 8);
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 8);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev)
+{
+	struct mthca_pd *pd;
+	int err;
+
+	pd = kmalloc(sizeof *pd, GFP_KERNEL);
+	if (!pd)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_pd_alloc(to_mdev(ibdev), pd);
+	if (err) {
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	return &pd->ibpd;
+}
+
+static int mthca_dealloc_pd(struct ib_pd *pd)
+{
+	mthca_pd_free(to_mdev(pd->device), to_mpd(pd));
+	kfree(pd);
+
+	return 0;
+}
+
+static struct ib_ah *mthca_ah_create(struct ib_pd *pd,
+				     struct ib_ah_attr *ah_attr)
+{
+	int err;
+	struct mthca_ah *ah;
+
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah);
+	if (err) {
+		kfree(ah);
+		return ERR_PTR(err);
+	}
+
+	return &ah->ibah;
+}
+
+static int mthca_ah_destroy(struct ib_ah *ah)
+{
+	mthca_destroy_ah(to_mdev(ah->device), to_mah(ah));
+	kfree(ah);
+
+	return 0;
+}
+
+static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
+				     struct ib_qp_init_attr *init_attr)
+{
+	struct mthca_qp *qp;
+	int err;
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_RC:
+	case IB_QPT_UC:
+	case IB_QPT_UD:
+	{
+		qp = kmalloc(sizeof *qp, GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd),
+				     to_mcq(init_attr->send_cq),
+				     to_mcq(init_attr->recv_cq),
+				     init_attr->qp_type, init_attr->sq_sig_type,
+				     init_attr->rq_sig_type, qp);
+		qp->ibqp.qp_num = qp->qpn;
+		break;
+	}
+	case IB_QPT_SMI:
+	case IB_QPT_GSI:
+	{
+		qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1;
+
+		err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
+				      to_mcq(init_attr->send_cq),
+				      to_mcq(init_attr->recv_cq),
+				      init_attr->sq_sig_type, init_attr->rq_sig_type,
+				      qp->ibqp.qp_num, init_attr->port_num,
+				      to_msqp(qp));
+		break;
+	}
+	default:
+		/* Don't support raw QPs */
+		return ERR_PTR(-ENOSYS);
+	}
+
+	if (err) {
+		kfree(qp);
+		return ERR_PTR(err);
+	}
+
+        init_attr->cap.max_inline_data = 0;
+
+	return &qp->ibqp;
+}
+
+static int mthca_destroy_qp(struct ib_qp *qp)
+{
+	mthca_free_qp(to_mdev(qp->device), to_mqp(qp));
+	kfree(qp);
+	return 0;
+}
+
+static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries)
+{
+	struct mthca_cq *cq;
+	int nent;
+	int err;
+
+	cq = kmalloc(sizeof *cq, GFP_KERNEL);
+	if (!cq)
+		return ERR_PTR(-ENOMEM);
+
+	for (nent = 1; nent < entries; nent <<= 1)
+		; /* nothing */
+
+	err = mthca_init_cq(to_mdev(ibdev), nent, cq);
+	if (err) {
+		kfree(cq);
+		cq = ERR_PTR(err);
+	} else
+		cq->ibcq.cqe = nent;
+
+	return &cq->ibcq;
+}
+
+static int mthca_destroy_cq(struct ib_cq *cq)
+{
+	mthca_free_cq(to_mdev(cq->device), to_mcq(cq));
+	kfree(cq);
+
+	return 0;
+}
+
+static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+{
+	mthca_arm_cq(to_mdev(cq->device), to_mcq(cq),
+		     notify == IB_CQ_SOLICITED);
+	return 0;
+}
+
+static inline u32 convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC       : 0) |
+	       (acc & IB_ACCESS_REMOTE_WRITE  ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) |
+	       (acc & IB_ACCESS_REMOTE_READ   ? MTHCA_MPT_FLAG_REMOTE_READ  : 0) |
+	       (acc & IB_ACCESS_LOCAL_WRITE   ? MTHCA_MPT_FLAG_LOCAL_WRITE  : 0) |
+	       MTHCA_MPT_FLAG_LOCAL_READ;
+}
+
+static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct mthca_mr *mr;
+	int err;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_mr_alloc_notrans(to_mdev(pd->device),
+				     to_mpd(pd)->pd_num,
+				     convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+}
+
+static struct ib_mr *mthca_reg_phys_mr(struct ib_pd       *pd,
+				       struct ib_phys_buf *buffer_list,
+				       int                 num_phys_buf,
+				       int                 acc,
+				       u64                *iova_start)
+{
+	struct mthca_mr *mr;
+	u64 *page_list;
+	u64 total_size;
+	u64 mask;
+	int shift;
+	int npages;
+	int err;
+	int i, j, n;
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	mask = 0;
+	total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (buffer_list[i].addr & ~PAGE_MASK)
+			return ERR_PTR(-EINVAL);
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return ERR_PTR(-EINVAL);
+
+		total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	/* Find largest page shift we can use to cover buffers */
+	for (shift = PAGE_SHIFT; shift < 31; ++shift)
+		if (num_phys_buf > 1) {
+			if ((1ULL << shift) & mask)
+				break;
+		} else {
+			if (1ULL << shift >= 
+			    buffer_list[0].size + 
+			    (buffer_list[0].addr & ((1ULL << shift) - 1)))
+				break;
+		}
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1);
+	buffer_list[0].addr &= ~0ull << shift;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+
+	if (!npages)
+		return &mr->ibmr;
+
+	page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL);
+	if (!page_list) {
+		kfree(mr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+		     ++j)
+			page_list[n++] = buffer_list[i].addr + ((u64) j << shift);
+
+	mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) "
+		  "in PD %x; shift %d, npages %d.\n",
+		  (unsigned long long) buffer_list[0].addr,
+		  (unsigned long long) *iova_start,
+		  to_mpd(pd)->pd_num,
+		  shift, npages);
+
+	err = mthca_mr_alloc_phys(to_mdev(pd->device),
+				  to_mpd(pd)->pd_num,
+				  page_list, shift, npages,
+				  *iova_start, total_size,
+				  convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	kfree(page_list);
+	return &mr->ibmr;
+}
+
+static int mthca_dereg_mr(struct ib_mr *mr)
+{
+	mthca_free_mr(to_mdev(mr->device), to_mmr(mr));
+	kfree(mr);
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x\n", dev->rev_id);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32),
+		       (int) (dev->fw_ver >> 16) & 0xffff,
+		       (int) dev->fw_ver & 0xffff);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	switch (dev->hca_type) {
+	case TAVOR:        return sprintf(buf, "MT23108\n");
+	case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n");
+	case ARBEL_NATIVE: return sprintf(buf, "MT25208\n");
+	default:           return sprintf(buf, "unknown\n");
+	}
+}
+
+static CLASS_DEVICE_ATTR(hw_rev,   S_IRUGO, show_rev,    NULL);
+static CLASS_DEVICE_ATTR(fw_ver,   S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca,    NULL);
+
+static struct class_device_attribute *mthca_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type
+};
+
+int mthca_register_device(struct mthca_dev *dev)
+{
+	int ret;
+	int i;
+
+	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
+	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
+	dev->ib_dev.dma_device           = dev->pdev;
+	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
+	dev->ib_dev.query_device         = mthca_query_device;
+	dev->ib_dev.query_port           = mthca_query_port;
+	dev->ib_dev.modify_port          = mthca_modify_port;
+	dev->ib_dev.query_pkey           = mthca_query_pkey;
+	dev->ib_dev.query_gid            = mthca_query_gid;
+	dev->ib_dev.alloc_pd             = mthca_alloc_pd;
+	dev->ib_dev.dealloc_pd           = mthca_dealloc_pd;
+	dev->ib_dev.create_ah            = mthca_ah_create;
+	dev->ib_dev.destroy_ah           = mthca_ah_destroy;
+	dev->ib_dev.create_qp            = mthca_create_qp;
+	dev->ib_dev.modify_qp            = mthca_modify_qp;
+	dev->ib_dev.destroy_qp           = mthca_destroy_qp;
+	dev->ib_dev.post_send            = mthca_post_send;
+	dev->ib_dev.post_recv            = mthca_post_receive;
+	dev->ib_dev.create_cq            = mthca_create_cq;
+	dev->ib_dev.destroy_cq           = mthca_destroy_cq;
+	dev->ib_dev.poll_cq              = mthca_poll_cq;
+	dev->ib_dev.req_notify_cq        = mthca_req_notify_cq;
+	dev->ib_dev.get_dma_mr           = mthca_get_dma_mr;
+	dev->ib_dev.reg_phys_mr          = mthca_reg_phys_mr;
+	dev->ib_dev.dereg_mr             = mthca_dereg_mr;
+	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
+	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
+	dev->ib_dev.process_mad          = mthca_process_mad;
+
+	ret = ib_register_device(&dev->ib_dev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ib_dev.class_dev,
+					       mthca_class_attributes[i]);
+		if (ret) {
+			ib_unregister_device(&dev->ib_dev);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+void mthca_unregister_device(struct mthca_dev *dev)
+{
+	ib_unregister_device(&dev->ib_dev);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h	2004-11-18 10:51:41.940868988 -0800
@@ -0,0 +1,221 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $
+ */
+
+#ifndef MTHCA_PROVIDER_H
+#define MTHCA_PROVIDER_H
+
+#include <ib_verbs.h>
+#include <ib_pack.h>
+
+#define MTHCA_MPT_FLAG_ATOMIC        (1 << 14)
+#define MTHCA_MPT_FLAG_REMOTE_WRITE  (1 << 13)
+#define MTHCA_MPT_FLAG_REMOTE_READ   (1 << 12)
+#define MTHCA_MPT_FLAG_LOCAL_WRITE   (1 << 11)
+#define MTHCA_MPT_FLAG_LOCAL_READ    (1 << 10)
+
+struct mthca_buf_list {
+	void *buf;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+struct mthca_mr {
+	struct ib_mr ibmr;
+	int order;
+	u32 first_seg;
+};
+
+struct mthca_pd {
+	struct ib_pd    ibpd;
+	u32             pd_num;
+	atomic_t        sqp_count;
+	struct mthca_mr ntmr;
+};
+
+struct mthca_eq {
+	struct mthca_dev      *dev;
+	int                    eqn;
+	u32                    ecr_mask;
+	u16                    msi_x_vector;
+	u16                    msi_x_entry;
+	int                    have_irq;
+	int                    nent;
+	int                    cons_index;
+	struct mthca_buf_list *page_list;
+	struct mthca_mr        mr;
+};
+
+struct mthca_av;
+
+struct mthca_ah {
+	struct ib_ah     ibah;
+	int              on_hca;
+	u32              key;
+	struct mthca_av *av;
+	dma_addr_t       avdma;
+};
+
+/*
+ * Quick description of our CQ/QP locking scheme:
+ *
+ * We have one global lock that protects dev->cq/qp_table.  Each
+ * struct mthca_cq/qp also has its own lock.  An individual qp lock
+ * may be taken inside of an individual cq lock.  Both cqs attached to
+ * a qp may be locked, with the send cq locked first.  No other
+ * nesting should be done.
+ *
+ * Each struct mthca_cq/qp also has an atomic_t ref count.  The
+ * pointer from the cq/qp_table to the struct counts as one reference.
+ * This reference also is good for access through the consumer API, so
+ * modifying the CQ/QP etc doesn't need to take another reference.
+ * Access because of a completion being polled does need a reference.
+ *
+ * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the
+ * destroy function to sleep on.
+ *
+ * This means that access from the consumer API requires nothing but
+ * taking the struct's lock.
+ *
+ * Access because of a completion event should go as follows:
+ * - lock cq/qp_table and look up struct
+ * - increment ref count in struct
+ * - drop cq/qp_table lock
+ * - lock struct, do your thing, and unlock struct
+ * - decrement ref count; if zero, wake up waiters
+ *
+ * To destroy a CQ/QP, we can do the following:
+ * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock
+ * - decrement ref count
+ * - wait_event until ref count is zero
+ *
+ * It is the consumer's responsibilty to make sure that no QP
+ * operations (WQE posting or state modification) are pending when the
+ * QP is destroyed.  Also, the consumer must make sure that calls to
+ * qp_modify are serialized.
+ *
+ * Possible optimizations (wait for profile data to see if/where we
+ * have locks bouncing between CPUs):
+ * - split cq/qp table lock into n separate (cache-aligned) locks,
+ *   indexed (say) by the page in the table
+ * - split QP struct lock into three (one for common info, one for the
+ *   send queue and one for the receive queue)
+ */
+
+struct mthca_cq {
+	struct ib_cq           ibcq;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	int                    cqn;
+	int                    cons_index;
+	int                    is_direct;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+	struct mthca_mr        mr;
+	wait_queue_head_t      wait;
+};
+
+struct mthca_wq {
+	int   max;
+	int   cur;
+	int   next;
+	int   last_comp;
+	void *last;
+	int   max_gs;
+	int   wqe_shift;
+	enum ib_sig_type policy;
+};
+
+struct mthca_qp {
+	struct ib_qp           ibqp;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	u32                    qpn;
+	int                    transport;
+	enum ib_qp_state       state;
+	int                    is_direct;
+	struct mthca_mr        mr;
+
+	struct mthca_wq        rq;
+	struct mthca_wq        sq;
+	int                    send_wqe_offset;
+
+	u64                   *wrid;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+
+	wait_queue_head_t      wait;
+};
+
+struct mthca_sqp {
+	struct mthca_qp qp;
+	int             port;
+	int             pkey_index;
+	u32             qkey;
+	u32             send_psn;
+	struct ib_ud_header ud_header;
+	int             header_buf_size;
+	void           *header_buf;
+	dma_addr_t      header_dma;
+};
+
+static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct mthca_mr, ibmr);
+}
+
+static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct mthca_pd, ibpd);
+}
+
+static inline struct mthca_ah *to_mah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct mthca_ah, ibah);
+}
+
+static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct mthca_cq, ibcq);
+}
+
+static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct mthca_qp, ibqp);
+}
+
+static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp)
+{
+	return container_of(qp, struct mthca_sqp, qp);
+}
+
+#endif /* MTHCA_PROVIDER_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c	2004-11-18 10:51:41.963865608 -0800
@@ -0,0 +1,1485 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_qp.c 1227 2004-11-13 22:31:53Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE,
+	MTHCA_ACK_REQ_FREQ       = 10,
+	MTHCA_FLIGHT_LIMIT       = 9,
+	MTHCA_UD_HEADER_SIZE     = 72 /* largest UD header possible */
+};
+
+enum {
+	MTHCA_QP_STATE_RST  = 0,
+	MTHCA_QP_STATE_INIT = 1,
+	MTHCA_QP_STATE_RTR  = 2,
+	MTHCA_QP_STATE_RTS  = 3,
+	MTHCA_QP_STATE_SQE  = 4,
+	MTHCA_QP_STATE_SQD  = 5,
+	MTHCA_QP_STATE_ERR  = 6,
+	MTHCA_QP_STATE_DRAINING = 7
+};
+
+enum {
+	MTHCA_QP_ST_RC 	= 0x0,
+	MTHCA_QP_ST_UC 	= 0x1,
+	MTHCA_QP_ST_RD 	= 0x2,
+	MTHCA_QP_ST_UD 	= 0x3,
+	MTHCA_QP_ST_MLX = 0x7
+};
+
+enum {
+	MTHCA_QP_PM_MIGRATED = 0x3,
+	MTHCA_QP_PM_ARMED    = 0x0,
+	MTHCA_QP_PM_REARM    = 0x1
+};
+
+enum {
+	/* qp_context flags */
+	MTHCA_QP_BIT_DE  = 1 <<  8,
+	/* params1 */
+	MTHCA_QP_BIT_SRE = 1 << 15,
+	MTHCA_QP_BIT_SWE = 1 << 14,
+	MTHCA_QP_BIT_SAE = 1 << 13,
+	MTHCA_QP_BIT_SIC = 1 <<  4,
+	MTHCA_QP_BIT_SSC = 1 <<  3,
+	/* params2 */
+	MTHCA_QP_BIT_RRE = 1 << 15,
+	MTHCA_QP_BIT_RWE = 1 << 14,
+	MTHCA_QP_BIT_RAE = 1 << 13,
+	MTHCA_QP_BIT_RIC = 1 <<  4,
+	MTHCA_QP_BIT_RSC = 1 <<  3
+};
+
+struct mthca_qp_path {
+	u32 port_pkey;
+	u8  rnr_retry;
+	u8  g_mylmc;
+	u16 rlid;
+	u8  ackto;
+	u8  mgid_index;
+	u8  static_rate;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u8  rgid[16];
+} __attribute__((packed));
+
+struct mthca_qp_context {
+	u32 flags;
+	u32 sched_queue;
+	u32 mtu_msgmax;
+	u32 usr_page;
+	u32 local_qpn;
+	u32 remote_qpn;
+	u32 reserved1[2];
+	struct mthca_qp_path pri_path;
+	struct mthca_qp_path alt_path;
+	u32 rdd;
+	u32 pd;
+	u32 wqe_base;
+	u32 wqe_lkey;
+	u32 params1;
+	u32 reserved2;
+	u32 next_send_psn;
+	u32 cqn_snd;
+	u32 next_snd_wqe[2];
+	u32 last_acked_psn;
+	u32 ssn;
+	u32 params2;
+	u32 rnr_nextrecvpsn;
+	u32 ra_buff_indx;
+	u32 cqn_rcv;
+	u32 next_rcv_wqe[2];
+	u32 qkey;
+	u32 srqn;
+	u32 rmsn;
+	u32 reserved3[19];
+} __attribute__((packed));
+
+struct mthca_qp_param {
+	u32 opt_param_mask;
+	u32 reserved1;
+	struct mthca_qp_context context;
+	u32 reserved2[62];
+} __attribute__((packed));
+
+enum {
+	MTHCA_QP_OPTPAR_ALT_ADDR_PATH     = 1 << 0,
+	MTHCA_QP_OPTPAR_RRE               = 1 << 1,
+	MTHCA_QP_OPTPAR_RAE               = 1 << 2,
+	MTHCA_QP_OPTPAR_REW               = 1 << 3,
+	MTHCA_QP_OPTPAR_PKEY_INDEX        = 1 << 4,
+	MTHCA_QP_OPTPAR_Q_KEY             = 1 << 5,
+	MTHCA_QP_OPTPAR_RNR_TIMEOUT       = 1 << 6,
+	MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7,
+	MTHCA_QP_OPTPAR_SRA_MAX           = 1 << 8,
+	MTHCA_QP_OPTPAR_RRA_MAX           = 1 << 9,
+	MTHCA_QP_OPTPAR_PM_STATE          = 1 << 10,
+	MTHCA_QP_OPTPAR_PORT_NUM          = 1 << 11,
+	MTHCA_QP_OPTPAR_RETRY_COUNT       = 1 << 12,
+	MTHCA_QP_OPTPAR_ALT_RNR_RETRY     = 1 << 13,
+	MTHCA_QP_OPTPAR_ACK_TIMEOUT       = 1 << 14,
+	MTHCA_QP_OPTPAR_RNR_RETRY         = 1 << 15,
+	MTHCA_QP_OPTPAR_SCHED_QUEUE       = 1 << 16
+};
+
+enum {
+	MTHCA_OPCODE_NOP            = 0x00,
+	MTHCA_OPCODE_RDMA_WRITE     = 0x08,
+	MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09,
+	MTHCA_OPCODE_SEND           = 0x0a,
+	MTHCA_OPCODE_SEND_IMM       = 0x0b,
+	MTHCA_OPCODE_RDMA_READ      = 0x10,
+	MTHCA_OPCODE_ATOMIC_CS      = 0x11,
+	MTHCA_OPCODE_ATOMIC_FA      = 0x12,
+	MTHCA_OPCODE_BIND_MW        = 0x18,
+	MTHCA_OPCODE_INVALID        = 0xff
+};
+
+enum {
+	MTHCA_NEXT_DBD       = 1 << 7,
+	MTHCA_NEXT_FENCE     = 1 << 6,
+	MTHCA_NEXT_CQ_UPDATE = 1 << 3,
+	MTHCA_NEXT_EVENT_GEN = 1 << 2,
+	MTHCA_NEXT_SOLICIT   = 1 << 1,
+
+	MTHCA_MLX_VL15       = 1 << 17,
+	MTHCA_MLX_SLR        = 1 << 16
+};
+
+struct mthca_next_seg {
+	u32 nda_op;		/* [31:6] next WQE [4:0] next opcode */
+	u32 ee_nds;		/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
+	u32 flags;		/* [3] CQ [2] Event [1] Solicit */
+	u32 imm;		/* immediate data */
+} __attribute__((packed));
+
+struct mthca_ud_seg {
+	u32 reserved1;
+	u32 lkey;
+	u64 av_addr;
+	u32 reserved2[4];
+	u32 dqpn;
+	u32 qkey;
+	u32 reserved3[2];
+} __attribute__((packed));
+
+struct mthca_bind_seg {
+	u32 flags;		/* [31] Atomic [30] rem write [29] rem read */
+	u32 reserved;
+	u32 new_rkey;
+	u32 lkey;
+	u64 addr;
+	u64 length;
+} __attribute__((packed));
+
+struct mthca_raddr_seg {
+	u64 raddr;
+	u32 rkey;
+	u32 reserved;
+} __attribute__((packed));
+
+struct mthca_atomic_seg {
+	u64 swap_add;
+	u64 compare;
+} __attribute__((packed));
+
+struct mthca_data_seg {
+	u32 byte_count;
+	u32 lkey;
+	u64 addr;
+} __attribute__((packed));
+
+struct mthca_mlx_seg {
+	u32 nda_op;
+	u32 nds;
+	u32 flags;		/* [17] VL15 [16] SLR [14:12] static rate
+				   [11:8] SL [3] C [2] E */
+	u16 rlid;
+	u16 vcrc;
+} __attribute__((packed));
+
+static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 3;
+}
+
+static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 1;
+}
+
+static void *get_recv_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + (n << qp->rq.wqe_shift);
+	else
+		return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf +
+			((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1));
+}
+
+static void *get_send_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + qp->send_wqe_offset +
+			(n << qp->sq.wqe_shift);
+	else
+		return qp->queue.page_list[(qp->send_wqe_offset +
+					    (n << qp->sq.wqe_shift)) >>
+					   PAGE_SHIFT].buf +
+			((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) &
+			 (PAGE_SIZE - 1));
+}
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type)
+{
+	struct mthca_qp *qp;
+	struct ib_event event;
+
+	spin_lock(&dev->qp_table.lock);
+	qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1));
+	if (qp)
+		atomic_inc(&qp->refcount);
+	spin_unlock(&dev->qp_table.lock);
+
+	if (!qp) {
+		mthca_warn(dev, "Async event for bogus QP %08x\n", qpn);
+		return;
+	}
+
+	event.device      = &dev->ib_dev;
+	event.event       = event_type;
+	event.element.qp  = &qp->ibqp;
+	if (qp->ibqp.event_handler)
+		qp->ibqp.event_handler(&event, qp->ibqp.qp_context);
+
+	if (atomic_dec_and_test(&qp->refcount))
+		wake_up(&qp->wait);
+}
+
+static int to_mthca_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET: return MTHCA_QP_STATE_RST;
+	case IB_QPS_INIT:  return MTHCA_QP_STATE_INIT;
+	case IB_QPS_RTR:   return MTHCA_QP_STATE_RTR;
+	case IB_QPS_RTS:   return MTHCA_QP_STATE_RTS;
+	case IB_QPS_SQD:   return MTHCA_QP_STATE_SQD;
+	case IB_QPS_SQE:   return MTHCA_QP_STATE_SQE;
+	case IB_QPS_ERR:   return MTHCA_QP_STATE_ERR;
+	default:                return -1;
+	}
+}
+
+enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS };
+
+static int to_mthca_st(int transport)
+{
+	switch (transport) {
+	case RC:  return MTHCA_QP_ST_RC;
+	case UC:  return MTHCA_QP_ST_UC;
+	case UD:  return MTHCA_QP_ST_UD;
+	case RD:  return MTHCA_QP_ST_RD;
+	case MLX: return MTHCA_QP_ST_MLX;
+	default:  return -1;
+	}
+}
+
+static const struct {
+	int trans;
+	u32 req_param[NUM_TRANS];
+	u32 opt_param[NUM_TRANS];
+} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = {
+	[IB_QPS_RESET] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_RST2INIT,
+			.req_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			},
+			/* bug-for-bug compatibility with VAPI: */
+			.opt_param = {
+				[MLX] = IB_QP_PORT
+			}
+		},
+	},
+	[IB_QPS_INIT]  = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_INIT2INIT,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_RTR]   = {
+			.trans = MTHCA_TRANS_INIT2RTR,
+			.req_param = {
+				[RC]  = (IB_QP_AV                  |
+					 IB_QP_PATH_MTU            |
+					 IB_QP_DEST_QPN            |
+					 IB_QP_RQ_PSN              |
+					 IB_QP_MAX_DEST_RD_ATOMIC  |
+					 IB_QP_MIN_RNR_TIMER),
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ALT_PATH     |
+					 IB_QP_ACCESS_FLAGS |
+					 IB_QP_PKEY_INDEX),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTR]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTR2RTS,
+			.req_param = {
+				[UD]  = IB_QP_SQ_PSN,
+				[RC]  = (IB_QP_TIMEOUT           |
+					 IB_QP_RETRY_CNT         |
+					 IB_QP_RNR_RETRY         |
+					 IB_QP_SQ_PSN            |
+					 IB_QP_MAX_QP_RD_ATOMIC),
+				[MLX] = IB_QP_SQ_PSN,
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTS]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTS2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ACCESS_FLAGS          |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_PATH_MIG_STATE        |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_RTS2SQD,
+		},
+	},
+	[IB_QPS_SQD]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQD2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_SQD2SQD,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_AV                    |
+					 IB_QP_TIMEOUT               |
+					 IB_QP_RETRY_CNT             |
+					 IB_QP_RNR_RETRY             |
+					 IB_QP_MAX_QP_RD_ATOMIC      |
+					 IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_SQE]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQERR2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_ERR] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }
+	}
+};
+
+static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr,
+			int attr_mask)
+{
+	if (attr_mask & IB_QP_PKEY_INDEX)
+		sqp->pkey_index = attr->pkey_index;
+	if (attr_mask & IB_QP_QKEY)
+		sqp->qkey = attr->qkey;
+	if (attr_mask & IB_QP_SQ_PSN)
+		sqp->send_psn = attr->sq_psn;
+}
+
+static void init_port(struct mthca_dev *dev, int port)
+{
+	int err;
+	u8 status;
+	struct mthca_init_ib_param param;
+
+	memset(&param, 0, sizeof param);
+
+	param.enable_1x = 1;
+	param.enable_4x = 1;
+	param.vl_cap    = dev->limits.vl_cap;
+	param.mtu_cap   = dev->limits.mtu_cap;
+	param.gid_cap   = dev->limits.gid_table_len;
+	param.pkey_cap  = dev->limits.pkey_table_len;
+
+	err = mthca_INIT_IB(dev, &param, port, &status);
+	if (err)
+		mthca_warn(dev, "INIT_IB failed, return code %d.\n", err);
+	if (status)
+		mthca_warn(dev, "INIT_IB returned status %02x.\n", status);
+}
+
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	enum ib_qp_state cur_state, new_state;
+	void *mailbox = NULL;
+	struct mthca_qp_param *qp_param;
+	struct mthca_qp_context *qp_context;
+	u32 req_param, opt_param;
+	u8 status;
+	int err;
+
+	if (attr_mask & IB_QP_CUR_STATE) {
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			cur_state = attr->cur_qp_state;
+	} else {
+		spin_lock_irq(&qp->lock);
+		cur_state = qp->state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	if (attr_mask & IB_QP_STATE) {
+               if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
+			return -EINVAL;
+		new_state = attr->qp_state;
+	} else
+		new_state = cur_state;
+
+	if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) {
+		mthca_dbg(dev, "Illegal QP transition "
+			  "%d->%d\n", cur_state, new_state);
+		return -EINVAL;
+	}
+
+	req_param = state_table[cur_state][new_state].req_param[qp->transport];
+	opt_param = state_table[cur_state][new_state].opt_param[qp->transport];
+
+	if ((req_param & attr_mask) != req_param) {
+		mthca_dbg(dev, "QP transition "
+			  "%d->%d missing req attr 0x%08x\n",
+			  cur_state, new_state,
+			  req_param & ~attr_mask);
+		return -EINVAL;
+	}
+
+	if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) {
+		mthca_dbg(dev, "QP transition (transport %d) "
+			  "%d->%d has extra attr 0x%08x\n",
+			  qp->transport,
+			  cur_state, new_state,
+			  attr_mask & ~(req_param | opt_param |
+						 IB_QP_STATE));
+		return -EINVAL;
+	}
+
+	mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	qp_param = MAILBOX_ALIGN(mailbox);
+	qp_context = &qp_param->context;
+	memset(qp_param, 0, sizeof *qp_param);
+
+	qp_context->flags      = cpu_to_be32((to_mthca_state(new_state) << 28) |
+					     (to_mthca_st(qp->transport) << 16));
+	qp_context->flags     |= cpu_to_be32(MTHCA_QP_BIT_DE);
+	if (!(attr_mask & IB_QP_PATH_MIG_STATE))
+		qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+	else {
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE);
+		switch (attr->path_mig_state) {
+		case IB_MIG_MIGRATED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+			break;
+		case IB_MIG_REARM:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11);
+			break;
+		case IB_MIG_ARMED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11);
+			break;
+		}
+	}
+	/* leave sched_queue as 0 */
+	if (qp->transport == MLX || qp->transport == UD)
+		qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) |
+						     (11 << 24));
+	else if (attr_mask & IB_QP_PATH_MTU) {
+		qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) |
+						     (31 << 24));
+	}
+	qp_context->usr_page   = cpu_to_be32(MTHCA_KAR_PAGE);
+	qp_context->local_qpn  = cpu_to_be32(qp->qpn);
+	if (attr_mask & IB_QP_DEST_QPN) {
+		qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
+	}
+
+	if (qp->transport == MLX)
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(to_msqp(qp)->port << 24);
+	else {
+		if (attr_mask & IB_QP_PORT) {
+			qp_context->pri_path.port_pkey |=
+				cpu_to_be32(attr->port_num << 24);
+			qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM);
+		}
+	}
+
+	if (attr_mask & IB_QP_PKEY_INDEX) {
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(attr->pkey_index);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX);
+	}
+
+	if (attr_mask & IB_QP_RNR_RETRY) {
+		qp_context->pri_path.rnr_retry = attr->rnr_retry << 5;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY);
+	}
+
+	if (attr_mask & IB_QP_AV) {
+		qp_context->pri_path.g_mylmc     = attr->ah_attr.src_path_bits & 0x7f;
+		qp_context->pri_path.rlid        = cpu_to_be16(attr->ah_attr.dlid);
+		qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3;
+		if (attr->ah_attr.ah_flags & IB_AH_GRH) {
+			qp_context->pri_path.g_mylmc |= 1 << 7;
+			qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index;
+			qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit;
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32((attr->ah_attr.sl << 28)                |
+					    (attr->ah_attr.grh.traffic_class << 20) |
+					    (attr->ah_attr.grh.flow_label));
+			memcpy(qp_context->pri_path.rgid,
+			       attr->ah_attr.grh.dgid.raw, 16);
+		} else {
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32(attr->ah_attr.sl << 28);
+		}
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH);	
+	}
+
+	if (attr_mask & IB_QP_TIMEOUT) {
+		qp_context->pri_path.ackto = attr->timeout;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT);
+	}
+
+	/* XXX alt_path */
+
+	/* leave rdd as 0 */
+	qp_context->pd         = cpu_to_be32(to_mpd(ibqp->pd)->pd_num);
+	/* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */
+	qp_context->wqe_lkey   = cpu_to_be32(qp->mr.ibmr.lkey);
+	qp_context->params1    = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) |
+					     (MTHCA_FLIGHT_LIMIT << 24) |
+					     MTHCA_QP_BIT_SRE           |
+					     MTHCA_QP_BIT_SWE           |
+					     MTHCA_QP_BIT_SAE);
+	if (qp->sq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC);
+	if (attr_mask & IB_QP_RETRY_CNT) {
+		qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT);
+	}
+
+	/* XXX initiator resources */
+	if (attr_mask & IB_QP_SQ_PSN)
+		qp_context->next_send_psn = cpu_to_be32(attr->sq_psn);
+	qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn);
+
+	/* XXX RDMA/atomic enable, responder resources */
+
+	if (qp->rq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC);
+	if (attr_mask & IB_QP_MIN_RNR_TIMER) {
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT);
+	}
+	if (attr_mask & IB_QP_RQ_PSN)
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn);
+
+	/* XXX ra_buff_indx */
+
+	qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn);
+
+	if (attr_mask & IB_QP_QKEY) {
+		qp_context->qkey = cpu_to_be32(attr->qkey);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY);
+	}
+
+	err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans,
+			      qp->qpn, 0, qp_param, 0, &status);
+	if (status) {
+		mthca_warn(dev, "modify QP %d returned status %02x.\n",
+			   state_table[cur_state][new_state].trans, status);
+		err = -EINVAL;
+	}
+
+	if (!err) {
+		spin_lock_irq(&qp->lock);
+		/* XXX deal with async transitions to ERROR */
+		qp->state = new_state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	kfree(mailbox);
+
+	if (is_sqp(dev, qp))
+		store_attrs(to_msqp(qp), attr, attr_mask);
+
+	/* 
+	 * If we are moving QP0 to RTR, bring the IB link up; if we
+	 * are moving QP0 to RESET or ERROR, bring the link back down.
+	 */
+	if (is_qp0(dev, qp)) {
+		if (cur_state != IB_QPS_RTR &&
+		    new_state == IB_QPS_RTR)
+			init_port(dev, to_msqp(qp)->port);
+
+		if (cur_state != IB_QPS_RESET &&
+		    cur_state != IB_QPS_ERR &&
+		    (new_state == IB_QPS_RESET ||
+		     new_state == IB_QPS_ERR))
+			mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status);
+	}
+
+	return err;
+}
+
+/*
+ * Allocate and register buffer for WQEs.  qp->rq.max, sq.max,
+ * rq.max_gs and sq.max_gs must all be assigned.
+ * mthca_alloc_wqe_buf will calculate rq.wqe_shift and
+ * sq.wqe_shift (as well as send_wqe_offset, is_direct, and
+ * queue)
+ */
+static int mthca_alloc_wqe_buf(struct mthca_dev *dev,
+			       struct mthca_pd *pd,
+			       struct mthca_qp *qp)
+{
+	int size;
+	int i;
+	int npages, shift;
+	dma_addr_t t;
+	u64 *dma_list = NULL;
+	int err = -ENOMEM;
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->rq.max_gs * sizeof (struct mthca_data_seg);
+
+	for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size;
+	     qp->rq.wqe_shift++)
+		; /* nothing */
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->sq.max_gs * sizeof (struct mthca_data_seg);
+	if (qp->transport == MLX)
+		size += 2 * sizeof (struct mthca_data_seg);
+	else if (qp->transport == UD)
+		size += sizeof (struct mthca_ud_seg);
+	else /* bind seg is as big as atomic + raddr segs */
+		size += sizeof (struct mthca_bind_seg);
+
+	for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size;
+	     qp->sq.wqe_shift++)
+		; /* nothing */
+
+	qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift,
+				    1 << qp->sq.wqe_shift);
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64),
+			   GFP_KERNEL);
+	if (!qp->wrid)
+		goto err_out;
+
+	if (size <= MTHCA_MAX_DIRECT_QP_SIZE) {
+		qp->is_direct = 1;
+		npages = 1;
+		shift = get_order(size) + PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n",
+				  size, shift);
+
+		qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t);
+		if (!qp->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&qp->queue.direct, mapping, t);
+
+		memset(qp->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		qp->is_direct = 0;
+		npages = size / PAGE_SIZE;
+		shift = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		qp->queue.page_list = kmalloc(npages *
+					      sizeof *qp->queue.page_list,
+					      GFP_KERNEL);
+		if (!qp->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i) {
+			qp->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!qp->queue.page_list[i].buf)
+				goto err_out_free;
+
+			memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE);
+
+			pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t);
+			dma_list[i] = t;
+		}
+	}
+
+	err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift,
+				  npages, 0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &qp->mr);
+	if (err)
+		goto err_out_free;
+
+	kfree(dma_list);
+	return 0;
+
+ err_out_free:
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else
+		for (i = 0; i < npages; ++i) {
+			if (qp->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    qp->queue.page_list[i].buf,
+						    pci_unmap_addr(&qp->queue.page_list[i],
+								   mapping));
+
+		}
+
+ err_out:
+	kfree(qp->wrid);
+	kfree(dma_list);
+	return err;
+}
+
+static int mthca_alloc_qp_common(struct mthca_dev *dev,
+				 struct mthca_pd *pd,
+				 struct mthca_cq *send_cq,
+				 struct mthca_cq *recv_cq,
+				 enum ib_sig_type send_policy,
+				 enum ib_sig_type recv_policy,
+				 struct mthca_qp *qp)
+{
+	int err;
+
+	spin_lock_init(&qp->lock);
+	atomic_set(&qp->refcount, 1);
+	qp->state    	 = IB_QPS_RESET;
+	qp->sq.policy    = send_policy;
+	qp->rq.policy    = recv_policy;
+	qp->rq.cur       = 0;
+	qp->sq.cur       = 0;
+	qp->rq.next      = 0;
+	qp->sq.next      = 0;
+	qp->rq.last_comp = qp->rq.max - 1;
+	qp->sq.last_comp = qp->sq.max - 1;
+	qp->rq.last      = NULL;
+	qp->sq.last      = NULL;
+
+	err = mthca_alloc_wqe_buf(dev, pd, qp);
+	return err;
+}
+
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp)
+{
+	int err;
+
+	switch (type) {
+	case IB_QPT_RC: qp->transport = RC; break;
+	case IB_QPT_UC: qp->transport = UC; break;
+	case IB_QPT_UD: qp->transport = UD; break;
+	default: return -EINVAL;
+	}		
+
+	qp->qpn = mthca_alloc(&dev->qp_table.alloc);
+	if (qp->qpn == -1)
+		return -ENOMEM;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy, qp);
+	if (err) {
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+		return err;
+	}
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_set(&dev->qp_table.qp,
+			qp->qpn & (dev->limits.num_qps - 1), qp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	return 0;
+}
+
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp)
+{
+	int err = 0;
+	u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1;
+
+	sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE;
+	sqp->header_buf = pci_alloc_consistent(dev->pdev, sqp->header_buf_size,
+					       &sqp->header_dma);
+	if (!sqp->header_buf)
+		return -ENOMEM;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	if (mthca_array_get(&dev->qp_table.qp, mqpn))
+		err = -EBUSY;
+	else
+		mthca_array_set(&dev->qp_table.qp, mqpn, sqp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	if (err)
+		goto err_out;
+
+	sqp->port = port;
+	sqp->qp.qpn       = mqpn;
+	sqp->qp.transport = MLX;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy,
+				    &sqp->qp);
+	if (err)
+		goto err_out_free;
+
+	atomic_inc(&pd->sqp_count);
+
+	return 0;
+
+ err_out_free:
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp, mqpn);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+ err_out:
+	pci_free_consistent(dev->pdev, sqp->header_buf_size,
+			    sqp->header_buf, sqp->header_dma);
+
+	return err;
+}
+
+void mthca_free_qp(struct mthca_dev *dev,
+		   struct mthca_qp *qp)
+{
+	u8 status;
+	int size;
+	int i;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp,
+			  qp->qpn & (dev->limits.num_qps - 1));
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	atomic_dec(&qp->refcount);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
+
+	if (qp->state != IB_QPS_RESET)
+		mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status);
+
+	mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn);
+	if (qp->ibqp.send_cq != qp->ibqp.recv_cq)
+		mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn);
+
+	mthca_free_mr(dev, &qp->mr);
+
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else {
+		for (i = 0; i < size / PAGE_SIZE; ++i) {
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    qp->queue.page_list[i].buf,
+					    pci_unmap_addr(&qp->queue.page_list[i],
+							   mapping));
+		}
+	}
+
+	kfree(qp->wrid);
+
+	if (is_sqp(dev, qp)) {
+		atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count));
+		pci_free_consistent(dev->pdev,
+				    to_msqp(qp)->header_buf_size,
+				    to_msqp(qp)->header_buf,
+				    to_msqp(qp)->header_dma);
+	}
+	else
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+}
+
+/* Create UD header for an MLX send and build a data segment for it */
+static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp,
+			    int ind, struct ib_send_wr *wr,
+			    struct mthca_mlx_seg *mlx,
+			    struct mthca_data_seg *data)
+{
+	int header_size;
+	int err;
+
+	ib_ud_header_init(256, /* assume a MAD */
+			  sqp->ud_header.grh_present,
+			  &sqp->ud_header);
+
+	err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header);
+	if (err)
+		return err;
+	mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1);
+	mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) |
+				  (sqp->ud_header.lrh.destination_lid == 0xffff ?
+				   MTHCA_MLX_SLR : 0) |
+				  (sqp->ud_header.lrh.service_level << 8));
+	mlx->rlid = sqp->ud_header.lrh.destination_lid;
+	mlx->vcrc = 0;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY;
+		sqp->ud_header.immediate_present = 0;
+		break;
+	case IB_WR_SEND_WITH_IMM:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+		sqp->ud_header.immediate_present = 1;
+		sqp->ud_header.immediate_data = wr->imm_data;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	sqp->ud_header.lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
+	if (sqp->ud_header.lrh.destination_lid == 0xffff)
+		sqp->ud_header.lrh.source_lid = 0xffff;
+	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+	if (!sqp->qp.ibqp.qp_num)
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   sqp->pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	else
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   wr->wr.ud.pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	cpu_to_be16s(&sqp->ud_header.bth.pkey);
+	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
+	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
+	sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
+					       sqp->qkey : wr->wr.ud.remote_qkey);
+	sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
+
+	header_size = ib_ud_header_pack(&sqp->ud_header,
+					sqp->header_buf +
+					ind * MTHCA_UD_HEADER_SIZE);
+
+	data->byte_count = cpu_to_be32(header_size);
+	data->lkey       = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey);
+	data->addr       = cpu_to_be64(sqp->header_dma +
+				       ind * MTHCA_UD_HEADER_SIZE);
+
+	return 0;
+}
+
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	void *wqe;
+	void *prev_wqe;
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	u32 f0 = 0;
+	int ind;
+	u8 op0 = 0;
+
+	static const u8 opcode[] = {
+		[IB_WR_SEND]                 = MTHCA_OPCODE_SEND,
+		[IB_WR_SEND_WITH_IMM]        = MTHCA_OPCODE_SEND_IMM,
+		[IB_WR_RDMA_WRITE]           = MTHCA_OPCODE_RDMA_WRITE,
+		[IB_WR_RDMA_WRITE_WITH_IMM]  = MTHCA_OPCODE_RDMA_WRITE_IMM,
+		[IB_WR_RDMA_READ]            = MTHCA_OPCODE_RDMA_READ,
+		[IB_WR_ATOMIC_CMP_AND_SWP]   = MTHCA_OPCODE_ATOMIC_CS,
+		[IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA,
+	};
+
+	spin_lock_irqsave(&qp->lock, flags);
+
+	/* XXX check that state is OK to post send */
+
+	ind = qp->sq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->sq.cur + nreq >= qp->sq.max) {
+			mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n",
+				  qp->sq.cur, qp->sq.max, nreq);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_send_wqe(qp, ind);
+		prev_wqe = qp->sq.last;
+		qp->sq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 0;
+		((struct mthca_next_seg *) wqe)->flags =
+			((wr->send_flags & IB_SEND_SIGNALED) ?
+			 cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) |
+			((wr->send_flags & IB_SEND_SOLICITED) ?
+			 cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0)   |
+			cpu_to_be32(1);
+		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
+		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+			((struct mthca_next_seg *) wqe)->flags = wr->imm_data;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (qp->transport == UD) {
+			((struct mthca_ud_seg *) wqe)->lkey =
+				cpu_to_be32(to_mah(wr->wr.ud.ah)->key);
+			((struct mthca_ud_seg *) wqe)->av_addr =
+				cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma);
+			((struct mthca_ud_seg *) wqe)->dqpn =
+				cpu_to_be32(wr->wr.ud.remote_qpn);
+			((struct mthca_ud_seg *) wqe)->qkey =
+				cpu_to_be32(wr->wr.ud.remote_qkey);
+
+			wqe += sizeof (struct mthca_ud_seg);
+			size += sizeof (struct mthca_ud_seg) / 16;
+		} else if (qp->transport == MLX) {
+			err = build_mlx_header(dev, to_msqp(qp), ind, wr,
+					       wqe - sizeof (struct mthca_next_seg),
+					       wqe);
+			if (err) {
+				*bad_wr = wr;
+				goto out;
+			}
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		if (wr->num_sge > qp->sq.max_gs) {
+			mthca_err(dev, "too many gathers\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		/* Add one more inline data segment for ICRC */
+		if (qp->transport == MLX) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32((1 << 31) | 4);
+			((u32 *) wqe)[1] = 0;
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind + qp->rq.max] = wr->wr_id;
+
+		if (wr->opcode >= ARRAY_SIZE(opcode)) {
+			mthca_err(dev, "opcode invalid\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32(((ind << qp->sq.wqe_shift) +
+					     qp->send_wqe_offset) |
+					    opcode[wr->opcode]);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size);
+		}
+
+		if (!size0) {
+			size0 = size;
+			op0   = opcode[wr->opcode];
+		}
+
+		++ind;
+		if (unlikely(ind >= qp->sq.max))
+			ind -= qp->sq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) +
+					   qp->send_wqe_offset) | f0 | op0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_SEND_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->sq.cur += nreq;
+	qp->sq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	int ind;
+	void *wqe;
+	void *prev_wqe;
+
+	spin_lock_irqsave(&qp->lock, flags);
+	
+	/* XXX check that state is OK to post receive */
+
+	ind = qp->rq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->rq.cur + nreq >= qp->rq.max) {
+			mthca_err(dev, "RQ %06x full\n", qp->qpn);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_recv_wqe(qp, ind);
+		prev_wqe = qp->rq.last;
+		qp->rq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 
+			cpu_to_be32(MTHCA_NEXT_DBD);
+		((struct mthca_next_seg *) wqe)->flags =
+			(wr->recv_flags & IB_RECV_SIGNALED) ?
+			cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (wr->num_sge > qp->rq.max_gs) {
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind] = wr->wr_id;
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32(MTHCA_NEXT_DBD | size);
+		}
+
+		if (!size0)
+			size0 = size;
+
+		++ind;
+		if (unlikely(ind >= qp->rq.max))
+			ind -= qp->rq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->rq.cur += nreq;
+	qp->rq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe)
+{
+	struct mthca_next_seg *next;
+
+	if (is_send)
+		next = get_send_wqe(qp, index);
+	else
+		next = get_recv_wqe(qp, index);
+
+	*dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD));
+	if (next->ee_nds & cpu_to_be32(0x3f))
+		*new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) |
+			(next->ee_nds & cpu_to_be32(0x3f));
+	else
+		*new_wqe = 0;
+
+	return 0;
+}
+
+int __devinit mthca_init_qp_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	int i;
+
+	spin_lock_init(&dev->qp_table.lock);
+
+	/*
+	 * We reserve 2 extra QPs per port for the special QPs.  The
+	 * special QP for port 1 has to be even, so round up.
+	 */
+	dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL;
+	err = mthca_alloc_init(&dev->qp_table.alloc,
+			       dev->limits.num_qps,
+			       (1 << 24) - 1,
+			       dev->qp_table.sqp_start +
+			       MTHCA_MAX_PORTS * 2);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->qp_table.qp,
+			       dev->limits.num_qps);
+	if (err) {
+		mthca_alloc_cleanup(&dev->qp_table.alloc);
+		return err;
+	}
+
+	for (i = 0; i < 2; ++i) {
+		err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI,
+					    dev->qp_table.sqp_start + i * 2,
+					    &status);
+		if (err)
+			goto err_out;
+		if (status) {
+			mthca_warn(dev, "CONF_SPECIAL_QP returned "
+				   "status %02x, aborting.\n",
+				   status);
+			err = -EINVAL;
+			goto err_out;
+		}
+	}
+	return 0;
+
+ err_out:
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps);
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev)
+{
+	int i;
+	u8 status;
+
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c	2004-11-18 10:51:41.988861934 -0800
@@ -0,0 +1,228 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+int mthca_reset(struct mthca_dev *mdev)
+{
+	int i;
+	int err = 0;
+	u32 *hca_header    = NULL;
+	u32 *bridge_header = NULL;
+	struct pci_dev *bridge = NULL;
+
+#define MTHCA_RESET_OFFSET 0xf0010
+#define MTHCA_RESET_VALUE  cpu_to_be32(1)
+
+	/*
+	 * Reset the chip.  This is somewhat ugly because we have to
+	 * save off the PCI header before reset and then restore it
+	 * after the chip reboots.  We skip config space offsets 22
+	 * and 23 since those have a special meaning.
+	 *
+	 * To make matters worse, for Tavor (PCI-X HCA) we have to
+	 * find the associated bridge device and save off its PCI
+	 * header as well.
+	 */
+
+	if (mdev->hca_type == TAVOR) {
+		/* Look for the bridge -- its device ID will be 2 more
+		   than HCA's device ID. */
+		while ((bridge = pci_get_device(mdev->pdev->vendor,
+						mdev->pdev->device + 2,
+						bridge)) != NULL) {
+			if (bridge->hdr_type    == PCI_HEADER_TYPE_BRIDGE &&
+			    bridge->subordinate == mdev->pdev->bus) {
+				mthca_dbg(mdev, "Found bridge: %s (%s)\n",
+					  pci_pretty_name(bridge), pci_name(bridge));
+				break;
+			}
+		}
+
+		if (!bridge) {
+			/*
+			 * Didn't find a bridge for a Tavor device --
+			 * assume we're in no-bridge mode and hope for
+			 * the best.
+			 */
+			mthca_warn(mdev, "No bridge found for %s (%s)\n",
+				  pci_pretty_name(mdev->pdev), pci_name(mdev->pdev));
+		}
+			
+	}
+
+	/* For Arbel do we need to save off the full 4K PCI Express header?? */
+	hca_header = kmalloc(256, GFP_KERNEL);
+	if (!hca_header) {
+		err = -ENOMEM;
+		mthca_err(mdev, "Couldn't allocate memory to save HCA "
+			  "PCI header, aborting.\n");
+		goto out;
+	}
+
+	for (i = 0; i < 64; ++i) {
+		if (i == 22 || i == 23)
+			continue;
+		if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't save HCA "
+				  "PCI header, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (bridge) {
+		bridge_header = kmalloc(256, GFP_KERNEL);
+		if (!bridge_header) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't allocate memory to save HCA "
+				  "bridge PCI header, aborting.\n");
+			goto out;
+		}
+
+		for (i = 0; i < 64; ++i) {
+			if (i == 22 || i == 23)
+				continue;
+			if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't save HCA bridge "
+					  "PCI header, aborting.\n");
+				goto out;
+			}
+		}
+	}
+
+	/* actually hit reset */
+	{
+		void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) +
+					      MTHCA_RESET_OFFSET, 4);
+
+		if (!reset) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't map HCA reset register, "
+				  "aborting.\n");
+			goto out;
+		}
+
+		writel(MTHCA_RESET_VALUE, reset);
+		iounmap(reset);
+	}
+
+	/* Docs say to wait one second before accessing device */
+	msleep(1000);
+
+	/* Now wait for PCI device to start responding again */
+	{
+		u32 v;
+		int c = 0;
+
+		for (c = 0; c < 100; ++c) {
+			if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't access HCA after reset, "
+					  "aborting.\n");
+				goto out;
+			}
+
+			if (v != 0xffffffff)
+				goto good;
+
+			msleep(100);
+		}
+
+		err = -ENODEV;
+		mthca_err(mdev, "PCI device did not come back after reset, "
+			  "aborting.\n");
+		goto out;
+	}
+
+good:
+	/* Now restore the PCI headers */
+	if (bridge) {
+		/*
+		 * Bridge control register is at 0x3e, so we'll
+		 * naturally restore it last in this loop.
+		 */
+		for (i = 0; i < 16; ++i) {
+			if (i * 4 == PCI_COMMAND)
+				continue;
+
+			if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't restore HCA bridge reg %x, "
+					  "aborting.\n", i);
+				goto out;
+			}
+		}
+
+		if (pci_write_config_dword(bridge, PCI_COMMAND,
+					   bridge_header[PCI_COMMAND / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, "
+				  "aborting.\n");
+			goto out;
+		}
+	}
+
+	for (i = 0; i < 16; ++i) {
+		if (i * 4 == PCI_COMMAND)
+			continue;
+
+		if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA reg %x, "
+				  "aborting.\n", i);
+			goto out;
+		}
+	}
+
+	if (pci_write_config_dword(mdev->pdev, PCI_COMMAND,
+				   hca_header[PCI_COMMAND / 4])) {
+		err = -ENODEV;
+		mthca_err(mdev, "Couldn't restore HCA COMMAND, "
+			  "aborting.\n");
+		goto out;
+	}
+
+out:
+	if (bridge)
+		pci_dev_put(bridge);
+	kfree(bridge_header);
+	kfree(hca_header);
+
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Thu Nov 18 10:58:28 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:28 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB
	(IP-over-InfiniBand) driver
Message-ID: <200411181058.E12z7tsbPyqcrIIc@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/7d8489e6/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver
Date: Thu, 18 Nov 2004 10:58:28 -0800
Size: 100897
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/7d8489e6/attachment.mht>

From roland at topspin.com  Thu Nov 18 10:58:34 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:34 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][7/12] Add
	InfiniBand userspace MAD support
Message-ID: <200411181058.kKqIomD9Cag8nD8w@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/c23c50e6/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][7/12] Add InfiniBand userspace MAD support
Date: Thu, 18 Nov 2004 10:58:34 -0800
Size: 23295
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/c23c50e6/attachment.mht>

From roland at topspin.com  Thu Nov 18 10:58:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:40 -0800
Subject: [openib-general] [PATCH][RFC/v1][8/12] Document InfiniBand ioctl use
In-Reply-To: <200411181058.kKqIomD9Cag8nD8w@topspin.com>
Message-ID: <200411181058.GB144BQ0h6EKgdPS@topspin.com>

Add the 0x1b ioctl magic number used by ib_umad module to
Documentation/ioctl-number.txt.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/Documentation/ioctl-number.txt
===================================================================
--- linux-bk.orig/Documentation/ioctl-number.txt	2004-11-17 19:52:39.000000000 -0800
+++ linux-bk/Documentation/ioctl-number.txt	2004-11-18 10:51:44.604477463 -0800
@@ -72,6 +72,7 @@
 0x09	all	linux/md.h
 0x12	all	linux/fs.h
 		linux/blkpg.h
+0x1b	all	InfiniBand Subsystem	<http://www.openib.org/>
 0x20	all	drivers/cdrom/cm206.h
 0x22	all	scsi/sg.h
 '#'	00-3F	IEEE 1394 Subsystem	Block for the entire subsystem


From roland at topspin.com  Thu Nov 18 10:58:45 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:45 -0800
Subject: [openib-general] [PATCH][RFC/v1][9/12] Add InfiniBand Documentation
	files
In-Reply-To: <200411181058.GB144BQ0h6EKgdPS@topspin.com>
Message-ID: <200411181058.Vg9HQwEgGRimClBJ@topspin.com>

Add files to Documentation/infiniband that describe the tree under
/sys/class/infiniband, the IPoIB driver and the userspace MAD access driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/Documentation/infiniband/ipoib.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/ipoib.txt	2004-11-18 10:51:44.838443072 -0800
@@ -0,0 +1,55 @@
+IP OVER INFINIBAND
+
+  The ib_ipoib driver is an implementation of the IP over InfiniBand
+  protocol as specified by the latest Internet-Drafts issued by the
+  IETF ipoib working group.  It is a "native" implementation in the
+  sense of setting the interface type to ARPHRD_INFINIBAND and the
+  hardware address length to 20 (earlier proprietary implementations
+  masqueraded to the kernel as ethernet interfaces).
+
+Partitions and P_Keys
+
+  When the IPoIB driver is loaded, it creates one interface for each
+  port using the P_Key at index 0.  To create an interface with a
+  different P_Key, write the desired P_Key into the main interface's
+  /sys/class/net/<intf name>/create_child file.  For example:
+
+    echo 0x8001 > /sys/class/net/ib0/create_child
+
+  This will create an interface named ib0.8001 with P_Key 0x8001.  To
+  remove a subinterface, use the "delete_child" file:
+
+    echo 0x8001 > /sys/class/net/ib0/delete_child
+
+  The P_Key for any interface is given by the "pkey" file, and the
+  main interface for a subinterface is in "parent."
+
+Debugging Information
+
+  By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
+  to 'y', tracing messages are compiled into the driver.  They are
+  turned on by setting the module parameters debug_level and
+  mcast_debug_level to 1.  These parameters can be controlled at
+  runtime through files in /sys/module/ib_ipoib/.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs"
+  virtual filesystem.  By mounting this filesystem, for example with
+
+    mkdir -p /ipoib_debugfs
+    mount -t ipoib_debugfs none /ipoib_debufs
+
+  it is possible to get statistics about multicast groups from the
+  files /ipoib_debugfs/ib0_mcg and so on.
+
+  The performance impact of this option is negligible, so it
+  is safe to enable this option with debug_level set to 0 for normal
+  operation.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output
+  in the data path when debug_level is set to 2.  However, even with
+  the output disabled, this option will affect performance.
+
+References
+
+  IETF IP over InfiniBand (ipoib) Working Group
+    http://ietf.org/html.charters/ipoib-charter.html
Index: linux-bk/Documentation/infiniband/sysfs.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/sysfs.txt	2004-11-18 10:51:44.866438957 -0800
@@ -0,0 +1,63 @@
+SYSFS FILES
+
+  For each InfiniBand device, the InfiniBand drivers create the
+  following files under /sys/class/infiniband/<device name>:
+
+    node_guid      - Node GUID
+    sys_image_guid - System image GUID
+
+  In addition, there is a "ports" subdirectory, with one subdirectory
+  for each port.  For example, if mthca0 is a 2-port HCA, there will
+  be two directories:
+
+    /sys/class/infiniband/mthca0/ports/1
+    /sys/class/infiniband/mthca0/ports/2
+
+  (A switch will only have a single "0" subdirectory for switch port
+  0; no subdirectory is created for normal switch ports)
+
+  In each port subdirectory, the following files are created:
+
+    cap_mask       - Port capability mask
+    lid            - Port LID
+    lid_mask_count - Port LID mask count
+    sm_lid         - Subnet manager LID for port's subnet
+    sm_sl          - Subnet manager SL for port's subnet
+    state          - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
+
+  There is also a "counters" subdirectory, with files
+
+    VL15_dropped
+    excessive_buffer_overrun_errors
+    link_downed
+    link_error_recovery
+    local_link_integrity_errors
+    port_rcv_constraint_errors
+    port_rcv_data
+    port_rcv_errors
+    port_rcv_packets
+    port_rcv_remote_physical_errors
+    port_rcv_switch_relay_errors
+    port_xmit_constraint_errors
+    port_xmit_data
+    port_xmit_discards
+    port_xmit_packets
+    symbol_error
+
+  Each of these files contains the corresponding value from the port's
+  Performance Management PortCounters attribute, as described in
+  section 16.1.3.5 of the InfiniBand Architecture Specification.
+
+  The "pkeys" and "gids" subdirectories contain one file for each
+  entry in the port's P_Key or GID table respectively.  For example,
+  ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
+  table.
+
+MTHCA
+
+  The Mellanox HCA driver also creates the files:
+
+    hw_rev   - Hardware revision number
+    fw_ver   - Firmware version
+    hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
+               or "MT25208"
Index: linux-bk/Documentation/infiniband/user_mad.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/user_mad.txt	2004-11-18 10:51:44.892435136 -0800
@@ -0,0 +1,77 @@
+USERSPACE MAD ACCESS
+
+Device files
+
+  Each port of each InfiniBand device has a "umad" device attached.
+  For example, a two-port HCA will have two devices, while a switch
+  will have one device (for switch port 0).
+
+Creating MAD agents
+
+  A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+  and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+  descriptor for the appropriate device file.  If the registration
+  request succeeds, a 32-bit id will be returned in the structure.
+  For example:
+
+	struct ib_user_mad_reg_req req = { /* ... */ };
+	ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+        if (!ret)
+		my_agent = req.id;
+	else
+		perror("agent register");
+
+  Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+  ioctl.  Also, all agents registered through a file descriptor will
+  be unregistered when the descriptor is closed.
+
+Receiving MADs
+
+  MADs are received using read().  The buffer passed to read() must be
+  large enough to hold at least one struct ib_user_mad.  For example:
+
+	struct ib_user_mad mad;
+	ret = read(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("read");
+
+  In addition to the actual MAD contents, the other struct ib_user_mad
+  fields will be filled in with information on the received MAD.  For
+  example, the remote LID will be in mad.lid.
+
+  If a send times out, a receive will be generated with mad.status set
+  to ETIMEDOUT.  Otherwise when a MAD has been successfully received,
+  mad.status will be 0.
+
+  poll()/select() may be used to wait until a MAD can be read.
+
+Sending MADs
+
+  MADs are sent using write().  The agent ID for sending should be
+  filled into the id field of the MAD, the destination LID should be
+  filled into the lid field, and so on.  For example:
+
+	struct ib_user_mad mad;
+
+	/* fill in mad.data */
+
+	mad.id  = my_agent;	/* req.id from agent registration */
+	mad.lid = my_dest;	/* in network byte order... */
+	/* etc. */
+
+	ret = write(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("write");
+
+/dev files
+
+  To create the appropriate character device files automatically with
+  udev, a rule like
+
+    KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad"
+
+  can be used.  This will create a device node named
+
+    /dev/infiniband/mthca0/ports/1/mad
+
+  for port 1 of device mthca0, and so on.


From roland at topspin.com  Thu Nov 18 10:58:50 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:50 -0800
Subject: [openib-general] [PATCH][RFC/v1][10/12] IPoIB IPv4 multicast
In-Reply-To: <200411181058.Vg9HQwEgGRimClBJ@topspin.com>
Message-ID: <200411181058.X9N1lq3F3k3Sfu3e@topspin.com>

Add ip_ib_mc_map() to convert IPv$ multicast addresses to IPoIB
hardware addresses.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/include/net/ip.h
===================================================================
--- linux-bk.orig/include/net/ip.h	2004-11-17 19:52:25.000000000 -0800
+++ linux-bk/include/net/ip.h	2004-11-18 10:51:45.214387812 -0800
@@ -229,6 +229,39 @@
 	buf[3]=addr&0x7F;
 }
 
+/*
+ *	Map a multicast IP onto multicast MAC for type IP-over-InfiniBand.
+ *	Leave P_Key as 0 to be filled in by driver.
+ */
+
+static inline void ip_ib_mc_map(u32 addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	addr    = ntohl(addr);
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x40;		/* IPv4 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	buf[10] = 0;
+	buf[11] = 0;
+	buf[12] = 0;
+	buf[13] = 0;
+	buf[14] = 0;
+	buf[15] = 0;
+	buf[19] = addr & 0xff;
+	addr  >>= 8;
+	buf[18] = addr & 0xff;
+	addr  >>= 8;
+	buf[17] = addr & 0xff;
+	addr  >>= 8;
+	buf[16] = addr & 0x0f;
+}
+
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 #include <linux/ipv6.h>
 #endif
Index: linux-bk/net/ipv4/arp.c
===================================================================
--- linux-bk.orig/net/ipv4/arp.c	2004-11-17 19:52:34.000000000 -0800
+++ linux-bk/net/ipv4/arp.c	2004-11-18 10:51:45.214387812 -0800
@@ -213,6 +213,9 @@
 	case ARPHRD_IEEE802_TR:
 		ip_tr_mc_map(addr, haddr);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ip_ib_mc_map(addr, haddr);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(haddr, dev->broadcast, dev->addr_len);


From roland at topspin.com  Thu Nov 18 10:58:56 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:58:56 -0800
Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support
In-Reply-To: <200411181058.X9N1lq3F3k3Sfu3e@topspin.com>
Message-ID: <200411181058.05X2BGHkQm9UQnbx@topspin.com>

Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB
hardware addresses, and add support for autoconfiguration for devices
with type ARPHRD_INFINIBAND.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Nitin Hande <Nitin.Hande at Sun.Com>
Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/include/net/if_inet6.h
===================================================================
--- linux-bk.orig/include/net/if_inet6.h	2004-11-17 19:52:39.000000000 -0800
+++ linux-bk/include/net/if_inet6.h	2004-11-18 10:51:45.514343721 -0800
@@ -266,5 +266,20 @@
 {
 	buf[0] = 0x00;
 }
+
+static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x60;		/* IPv6 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	memcpy(buf + 10, addr->s6_addr + 6, 10);
+}
 #endif
 #endif
Index: linux-bk/net/ipv6/addrconf.c
===================================================================
--- linux-bk.orig/net/ipv6/addrconf.c	2004-11-17 19:52:35.000000000 -0800
+++ linux-bk/net/ipv6/addrconf.c	2004-11-18 10:51:45.515343574 -0800
@@ -1098,6 +1098,13 @@
 		memset(eui, 0, 7);
 		eui[7] = *(u8*)dev->dev_addr;
 		return 0;
+	case ARPHRD_INFINIBAND:
+		/* XXX: replace len with IPOIB_HW_ADDR_LEN later */
+		if (dev->addr_len != 20)
+			return -1;
+		memcpy(eui, dev->dev_addr + 12, 8);
+		eui[0] |= 2;
+		return 0;
 	}
 	return -1;
 }
@@ -1797,6 +1804,7 @@
 	if ((dev->type != ARPHRD_ETHER) && 
 	    (dev->type != ARPHRD_FDDI) &&
 	    (dev->type != ARPHRD_IEEE802_TR) &&
+	    (dev->type != ARPHRD_INFINIBAND) &&
 	    (dev->type != ARPHRD_ARCNET)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;
Index: linux-bk/net/ipv6/ndisc.c
===================================================================
--- linux-bk.orig/net/ipv6/ndisc.c	2004-11-17 19:52:19.000000000 -0800
+++ linux-bk/net/ipv6/ndisc.c	2004-11-18 10:51:45.516343427 -0800
@@ -260,6 +260,9 @@
 	case ARPHRD_ARCNET:
 		ipv6_arcnet_mc_map(addr, buf);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ipv6_ib_mc_map(addr, buf);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(buf, dev->broadcast, dev->addr_len);


From roland at topspin.com  Thu Nov 18 10:59:01 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 10:59:01 -0800
Subject: [openib-general] [PATCH][RFC/v1][12/12] InfiniBand MAINTAINERS entry
In-Reply-To: <200411181058.05X2BGHkQm9UQnbx@topspin.com>
Message-ID: <200411181059.m3Tu2pba6tQT74nE@topspin.com>

Add OpenIB maintainers information to MAINTAINERS.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/MAINTAINERS
===================================================================
--- linux-bk.orig/MAINTAINERS	2004-11-17 19:52:19.000000000 -0800
+++ linux-bk/MAINTAINERS	2004-11-18 10:51:45.861292723 -0800
@@ -1075,6 +1075,17 @@
 L:	linux-fbdev-devel at lists.sourceforge.net
 S:	Maintained
 
+INFINIBAND SUBSYSTEM
+P:	Roland Dreier
+M:	roland at topspin.com
+P:	Sean Hefty
+M:	mshefty at ichips.intel.com
+P:	Hal Rosenstock
+M:	halr at voltaire.com
+L:	openib-general at openib.org
+W:	http://www.openib.org/
+S:	Supported
+
 INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS
 P:	Vojtech Pavlik
 M:	vojtech at suse.cz


From roland at topspin.com  Thu Nov 18 11:00:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:00:14 -0800
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names
In-Reply-To: <20041118185305.GQ27658@sventech.com> (Johannes Erdfelt's
	message of "Thu, 18 Nov 2004 10:53:05 -0800")
References: <1100797323.3277.19.camel@localhost.localdomain>
	<52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com>
Message-ID: <52hdnnkmo1.fsf@topspin.com>

    Johannes> You mean

    Johannes> 	char name[sizeof "ib_mad123" + 1];

    Johannes> right? :)

No, actually sizeof a string includes the trailing nul.  Try the
following program:

    int main() { printf("%d\n", (int) sizeof "123"); }

I bet it prints "4" :)

 - R.


From roland at topspin.com  Thu Nov 18 11:01:37 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:01:37 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
In-Reply-To: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> (Roland Dreier's
	message of "Thu, 18 Nov 2004 10:58:03 -0800")
References: <200411181058.nZu5AGvCLwleEqeJ@topspin.com>
Message-ID: <52d5ybkmlq.fsf@topspin.com>

Hmm... looks like our spamassassin is a little trigger happy :)

 - R.


From roland at topspin.com  Thu Nov 18 11:12:16 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:12:16 -0800
Subject: [openib-general] Re: mthca crash on startup
In-Reply-To: <1100803744.3280.11.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 18 Nov 2004 13:49:04 -0500")
References: <1100803744.3280.11.camel@localhost.localdomain>
Message-ID: <524qjnkm3z.fsf@topspin.com>

    > modprobe: page allocation failure. order:6, mode:0x20
    >  [<d09098cc>] mthca_alloc_sqp+0x6c/0x420 [ib_mthca]

It's not actually a crash.  It's just failing to allocate 2048 * 72
bytes of bus-coherent memory (send queue depth time size of a UD
header) while creating a special QP.  The system should survive this,
although of course MAD services won't work.

There are a few things that can be done:

 - There's no reason mthca needs to allocate all this memory in one
   physically contiguous chunk, although it makes the code simpler.
   If this issue persists, we can fix the special QP allocation code
   (everything else in mthca is pretty good about not requiring
   contiguous pages).

 - I seem to recall messages recently on lkml that recent kernels have
   VM problems that lead to page allocation failures.  I think there
   are some VM tunables and some patches in -mm that are supposed to help.

 - Having "#define IB_MAD_QP_SEND_SIZE	2048" seems a bit excessive to
   me.  It seems a much shallower send queue should be plenty,
   especially for QP0.  Reducing this will reduce the amount of
   contiguous memory required, which should improve things.

 - Roland


From peter at pantasys.com  Thu Nov 18 11:13:09 2004
From: peter at pantasys.com (Peter Buckingham)
Date: Thu, 18 Nov 2004 11:13:09 -0800
Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support
In-Reply-To: <200411181058.05X2BGHkQm9UQnbx@topspin.com>
References: <200411181058.05X2BGHkQm9UQnbx@topspin.com>
Message-ID: <419CF445.407@pantasys.com>

Hi Roland,


> Index: linux-bk/net/ipv6/addrconf.c
> ===================================================================
> --- linux-bk.orig/net/ipv6/addrconf.c	2004-11-17 19:52:35.000000000 -0800
> +++ linux-bk/net/ipv6/addrconf.c	2004-11-18 10:51:45.515343574 -0800
> @@ -1098,6 +1098,13 @@
>  		memset(eui, 0, 7);
>  		eui[7] = *(u8*)dev->dev_addr;
>  		return 0;
> +	case ARPHRD_INFINIBAND:
> +		/* XXX: replace len with IPOIB_HW_ADDR_LEN later */
> +		if (dev->addr_len != 20)

why not make this change to IPOIB_HW_ADDR_LEN now?

that's all i've got for now ;-)

peter


From johannes at erdfelt.com  Thu Nov 18 11:18:31 2004
From: johannes at erdfelt.com (Johannes Erdfelt)
Date: Thu, 18 Nov 2004 11:18:31 -0800
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names
In-Reply-To: <52hdnnkmo1.fsf@topspin.com>
References: <1100797323.3277.19.camel@localhost.localdomain>
	<52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com>
	<52hdnnkmo1.fsf@topspin.com>
Message-ID: <20041118191831.GS27658@sventech.com>

On Thu, Nov 18, 2004, Roland Dreier <roland at topspin.com> wrote:
>     Johannes> You mean
> 
>     Johannes> 	char name[sizeof "ib_mad123" + 1];
> 
>     Johannes> right? :)
> 
> No, actually sizeof a string includes the trailing nul.  Try the
> following program:
> 
>     int main() { printf("%d\n", (int) sizeof "123"); }
> 
> I bet it prints "4" :)

You know, I knew that at one point, but I guess I forgot it for some
reason because you're absolutely right.

I guess I was thinking strlen instead of sizeof.

JE


From roland at topspin.com  Thu Nov 18 11:20:35 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:20:35 -0800
Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support
In-Reply-To: <419CF445.407@pantasys.com> (Peter Buckingham's message of
	"Thu, 18 Nov 2004 11:13:09 -0800")
References: <200411181058.05X2BGHkQm9UQnbx@topspin.com>
	<419CF445.407@pantasys.com>
Message-ID: <52vfc3j75o.fsf@topspin.com>

    Peter> why not make this change to IPOIB_HW_ADDR_LEN now?

Not a bad idea (although I guess it should be IPOIB_ALEN to match the
rest of the kernel).  I wonder where to put the value though... is it
really worth creating a <linux/if_ipoib.h> for this value?

 - R.


From halr at voltaire.com  Thu Nov 18 11:22:35 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 14:22:35 -0500
Subject: [openib-general] [PATCH] mad: Add port number to MAD thread	names
In-Reply-To: <52hdnnkmo1.fsf@topspin.com>
References: <1100797323.3277.19.camel@localhost.localdomain>
	<52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com>
	<52hdnnkmo1.fsf@topspin.com>
Message-ID: <1100805755.3280.15.camel@localhost.localdomain>

On Thu, 2004-11-18 at 14:00, Roland Dreier wrote:
>     Johannes> You mean
> 
>     Johannes> 	char name[sizeof "ib_mad123" + 1];
> 
>     Johannes> right? :)
> 
> No, actually sizeof a string includes the trailing nul.  Try the
> following program:
> 
>     int main() { printf("%d\n", (int) sizeof "123"); }
> 
> I bet it prints "4" :)

OK. It's back to just sizeof r.t. sizeof + 1...

-- Hal


From tduffy at sun.com  Thu Nov 18 11:27:51 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 18 Nov 2004 11:27:51 -0800
Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support
In-Reply-To: <52vfc3j75o.fsf@topspin.com>
References: <200411181058.05X2BGHkQm9UQnbx@topspin.com>
	<419CF445.407@pantasys.com>  <52vfc3j75o.fsf@topspin.com>
Message-ID: <1100806071.21672.7.camel@duffman>

On Thu, 2004-11-18 at 11:20 -0800, Roland Dreier wrote:
>     Peter> why not make this change to IPOIB_HW_ADDR_LEN now?
> 
> Not a bad idea (although I guess it should be IPOIB_ALEN to match the
> rest of the kernel).  I wonder where to put the value though... is it
> really worth creating a <linux/if_ipoib.h> for this value?

Yeah, I think since everything else has one, ipoib should as well.
There are some pretty short files in there like if_cablemodem.h or
if_strip.h.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/a3b4d38a/attachment.sig>

From roland at topspin.com  Thu Nov 18 11:35:58 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:35:58 -0800
Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support
In-Reply-To: <1100806071.21672.7.camel@duffman> (Tom Duffy's message of
	"Thu, 18 Nov 2004 11:27:51 -0800")
References: <200411181058.05X2BGHkQm9UQnbx@topspin.com>
	<419CF445.407@pantasys.com> <52vfc3j75o.fsf@topspin.com>
	<1100806071.21672.7.camel@duffman>
Message-ID: <52r7mrj6g1.fsf@topspin.com>

    Tom> Yeah, I think since everything else has one, ipoib should as
    Tom> well.  There are some pretty short files in there like
    Tom> if_cablemodem.h or if_strip.h.

Good point.  It's pretty hard to beat if_strip.h for brevity...

OK, I'll update the patches.

Thanks,
  Roland


From halr at voltaire.com  Thu Nov 18 11:42:00 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 14:42:00 -0500
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12]	Add
	InfiniBand SA (Subnet Administration) query support
In-Reply-To: <200411181058.sHj94LsTlhUWv3cp@topspin.com>
References: <200411181058.sHj94LsTlhUWv3cp@topspin.com>
Message-ID: <1100806919.3280.17.camel@localhost.localdomain>

Nit alert...

On Thu, 2004-11-18 at 13:58, Roland Dreier wrote:
> Content preview:  Add support for sending queries to the SA (Subnet
>   Administrator).
    ^^^^^^^^^^^^^
    Administration).


From roland at topspin.com  Thu Nov 18 11:49:50 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:49:50 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12]	Add
	InfiniBand SA (Subnet Administration) query support
In-Reply-To: <1100806919.3280.17.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 18 Nov 2004 14:42:00 -0500")
References: <200411181058.sHj94LsTlhUWv3cp@topspin.com>
	<1100806919.3280.17.camel@localhost.localdomain>
Message-ID: <52mzxfj5sx.fsf@topspin.com>

    Hal> Nit alert...

Thanks, fixed in my patches.

 - R.


From halr at voltaire.com  Thu Nov 18 11:46:05 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 14:46:05 -0500
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add	IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
Message-ID: <1100807164.3280.20.camel@localhost.localdomain>

On Thu, 2004-11-18 at 13:58, Roland Dreier wrote:

> The ARP/ND implementation for this driver is not completely
> straightforward, because InfiniBand requires an additional path lookup
> be performed (through an IB-specific mechanism) after a remote
> hardware address has been resolved.  We are very open to suggestions
> of a better way to handle this than the current implementation.

Is it also worth pointing out about multicast vis a vis IB ? 

-- Hal


From roland at topspin.com  Thu Nov 18 11:50:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:50:53 -0800
Subject: [openib-general] Re: mthca crash on startup
In-Reply-To: <524qjnkm3z.fsf@topspin.com> (Roland Dreier's message of "Thu,
	18 Nov 2004 11:12:16 -0800")
References: <1100803744.3280.11.camel@localhost.localdomain>
	<524qjnkm3z.fsf@topspin.com>
Message-ID: <52is83j5r6.fsf@topspin.com>

I committed this change, which should help a little
(pci_alloc_consistent() is implicitly GFP_ATOMIC).

 - R.

Index: infiniband/hw/mthca/mthca_qp.c
===================================================================
--- infiniband/hw/mthca/mthca_qp.c	(revision 1265)
+++ infiniband/hw/mthca/mthca_qp.c	(working copy)
@@ -967,8 +967,8 @@
 	u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1;
 
 	sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE;
-	sqp->header_buf = pci_alloc_consistent(dev->pdev, sqp->header_buf_size,
-					       &sqp->header_dma);
+	sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size,
+					     &sqp->header_dma, GFP_KERNEL);
 	if (!sqp->header_buf)
 		return -ENOMEM;
 

From roland at topspin.com  Thu Nov 18 11:55:33 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 11:55:33 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add	IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <1100807164.3280.20.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 18 Nov 2004 14:46:05 -0500")
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
Message-ID: <52ekirj5je.fsf@topspin.com>

    Hal> Is it also worth pointing out about multicast vis a vis IB ?

I didn't think so, because having to join a multicast group with the
SA is conceptually similar to having to program a multicast hash table
for an ethernet NIC or something like that.  So I don't think we have
the sort of layering violation that we have for ARP -- the kernel
already expects the driver to have to do something driver-specific to
handle multicast.

On the other hand, if you have some suggested verbiage, I'm happy to
include it.

 - R.


From halr at voltaire.com  Thu Nov 18 12:02:18 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 15:02:18 -0500
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12]	Add	IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <52ekirj5je.fsf@topspin.com>
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
	<52ekirj5je.fsf@topspin.com>
Message-ID: <1100808138.3280.30.camel@localhost.localdomain>

On Thu, 2004-11-18 at 14:55, Roland Dreier wrote:
>     Hal> Is it also worth pointing out about multicast vis a vis IB ?
> 
> I didn't think so, because having to join a multicast group with the
> SA is conceptually similar to having to program a multicast hash table
> for an ethernet NIC or something like that.  So I don't think we have
> the sort of layering violation that we have for ARP -- the kernel
> already expects the driver to have to do something driver-specific to
> handle multicast.

Yes and no. While it is the appears the same in terms of the host (and
Linux already handles part of the problem, the semantics are not as rich
as they need to be for IB. I am referring to knowing whether to join as
a send only member, non member, or full member. (I think at least the
non member/full member distinction is important; I can live without send
only members as this is a minor optimization IMO although it does match
the idea of an IP multicast transmitter).

> On the other hand, if you have some suggested verbiage, I'm happy to
> include it.

If you think the idea above is correct, I will craft some verbiage.

-- Hal


From roland at topspin.com  Thu Nov 18 12:11:10 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 12:11:10 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12]	Add	IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <1100808138.3280.30.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 18 Nov 2004 15:02:18 -0500")
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
	<52ekirj5je.fsf@topspin.com>
	<1100808138.3280.30.camel@localhost.localdomain>
Message-ID: <523bz6kjdt.fsf@topspin.com>

    Hal> Yes and no. While it is the appears the same in terms of the
    Hal> host (and Linux already handles part of the problem, the
    Hal> semantics are not as rich as they need to be for IB. I am
    Hal> referring to knowing whether to join as a send only member,
    Hal> non member, or full member. (I think at least the non
    Hal> member/full member distinction is important; I can live
    Hal> without send only members as this is a minor optimization IMO
    Hal> although it does match the idea of an IP multicast
    Hal> transmitter).

    Hal> If you think the idea above is correct, I will craft some
    Hal> verbiage.

Sounds good.

 - Roland


From halr at voltaire.com  Thu Nov 18 12:41:13 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 15:41:13 -0500
Subject: [openib-general] *****SPAM*****	[PATCH][RFC/v1][6/12]	Add	IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <523bz6kjdt.fsf@topspin.com>
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
	<52ekirj5je.fsf@topspin.com>
	<1100808138.3280.30.camel@localhost.localdomain>
	<523bz6kjdt.fsf@topspin.com>
Message-ID: <1100810473.3280.47.camel@localhost.localdomain>

On Thu, 2004-11-18 at 15:11, Roland Dreier wrote:
>     Hal> Yes and no. While it is the appears the same in terms of the
>     Hal> host (and Linux already handles part of the problem, the
>     Hal> semantics are not as rich as they need to be for IB. I am
>     Hal> referring to knowing whether to join as a send only member,
>     Hal> non member, or full member. (I think at least the non
>     Hal> member/full member distinction is important; I can live
>     Hal> without send only members as this is a minor optimization IMO
>     Hal> although it does match the idea of an IP multicast
>     Hal> transmitter).
> 
>     Hal> If you think the idea above is correct, I will craft some
>     Hal> verbiage.
> 
> Sounds good.

Here's a first cut at some working on multicast:

Although IB has a special join mode intended to support IP multicast
routing (non member), as no means to identify different multicast styles
has yet been determined, all joins are currently full member. We are
looking for guidance in how to solve this.

One more thing:
Do we also want to say something about no SM right now ? Or is that
putting a cosmic kick me sign on ?

-- Hal


From tduffy at sun.com  Thu Nov 18 12:59:36 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 18 Nov 2004 12:59:36 -0800
Subject: [openib-general] [PATCH][RFC/v1][6/12] Add PoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <1100810473.3280.47.camel@localhost.localdomain>
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
	<52ekirj5je.fsf@topspin.com>
	<1100808138.3280.30.camel@localhost.localdomain>
	<523bz6kjdt.fsf@topspin.com>
	<1100810473.3280.47.camel@localhost.localdomain>
Message-ID: <1100811576.21672.31.camel@duffman>

On Thu, 2004-11-18 at 15:41 -0500, Hal Rosenstock wrote:
> One more thing:
> Do we also want to say something about no SM right now ? Or is that
> putting a cosmic kick me sign on ?

Actually, I thought we agreed that this was necessary to have before we
submit to lkml.  At least for inclusion.  How is somebody going to test
the openib code without an SM.  (And no, buying a topspin switch is not
the answer :-P  Nor is using Solaris, or the old gen1 stack)

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041118/ce30738e/attachment.sig>

From halr at voltaire.com  Thu Nov 18 13:06:26 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Thu, 18 Nov 2004 16:06:26 -0500
Subject: [openib-general] [PATCH][RFC/v1][6/12] Add PoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <1100811576.21672.31.camel@duffman>
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
	<52ekirj5je.fsf@topspin.com>
	<1100808138.3280.30.camel@localhost.localdomain>
	<523bz6kjdt.fsf@topspin.com>
	<1100810473.3280.47.camel@localhost.localdomain>
	<1100811576.21672.31.camel@duffman>
Message-ID: <1100811986.3280.59.camel@localhost.localdomain>

On Thu, 2004-11-18 at 15:59, Tom Duffy wrote:
> Actually, I thought we agreed that this was necessary to have before we
> submit to lkml.  At least for inclusion.  How is somebody going to test
> the openib code without an SM.  (And no, buying a topspin switch is not
> the answer :-P  Nor is using Solaris, or the old gen1 stack)

Don't forget Voltaire too...

Anyhow, we are within days of starting on this.

There are 2 main portions of this:
1. Port to gen2 API
2. Fix build 

The other aspects can wait if necessary.

How long before we need the first part ? Is there any expectation on how
long code review would last ? Or would they also be running the code ?

-- Hal


From mshefty at ichips.intel.com  Thu Nov 18 13:16:00 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 18 Nov 2004 13:16:00 -0800
Subject: [openib-general] [PATCH][RFC/v1][6/12] Add
	PoIB	(IP-over-InfiniBand) driver
In-Reply-To: <1100811986.3280.59.camel@localhost.localdomain>
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>	<1100807164.3280.20.camel@localhost.localdomain>	<52ekirj5je.fsf@topspin.com>	<1100808138.3280.30.camel@localhost.localdomain>	<523bz6kjdt.fsf@topspin.com>	<1100810473.3280.47.camel@localhost.localdomain>	<1100811576.21672.31.camel@duffman>
	<1100811986.3280.59.camel@localhost.localdomain>
Message-ID: <419D1110.2080900@ichips.intel.com>

Hal Rosenstock wrote:

> Anyhow, we are within days of starting on this.
> 
> There are 2 main portions of this:
> 1. Port to gen2 API
> 2. Fix build 
> 
> The other aspects can wait if necessary.
> 
> How long before we need the first part ? Is there any expectation on how
> long code review would last ? Or would they also be running the code ?

I would use the first part today if I had it.  I wouldn't worry too much 
about code review right away, since its a user-mode component that 
wouldn't be included in the kernel.

- Sean


From roland at topspin.com  Thu Nov 18 13:25:44 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 13:25:44 -0800
Subject: [openib-general] *****SPAM*****	[PATCH][RFC/v1][6/12]	Add	IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <1100810473.3280.47.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Thu, 18 Nov 2004 15:41:13 -0500")
References: <200411181058.E12z7tsbPyqcrIIc@topspin.com>
	<1100807164.3280.20.camel@localhost.localdomain>
	<52ekirj5je.fsf@topspin.com>
	<1100808138.3280.30.camel@localhost.localdomain>
	<523bz6kjdt.fsf@topspin.com>
	<1100810473.3280.47.camel@localhost.localdomain>
Message-ID: <52y8gyj1d3.fsf@topspin.com>

    Hal> Although IB has a special join mode intended to support IP
    Hal> multicast routing (non member), as no means to identify
    Hal> different multicast styles has yet been determined, all joins
    Hal> are currently full member. We are looking for guidance in how
    Hal> to solve this.

OK, added to to my patches.

    Hal> One more thing: Do we also want to say something about no SM
    Hal> right now ? Or is that putting a cosmic kick me sign on ?

As far as I know the SM is now purely a userspace issue -- we have
enough kernel support to run the SM.  It's probably worth mentioning
that OpenSM porting still needs to be done (has anyone started?).

 - Roland


From iod00d at hp.com  Thu Nov 18 13:48:27 2004
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 18 Nov 2004 13:48:27 -0800
Subject: [openib-general] [PATCH][RFC/v1][10/12] IPoIB IPv4 multicast
In-Reply-To: <200411181058.X9N1lq3F3k3Sfu3e@topspin.com>
References: <200411181058.Vg9HQwEgGRimClBJ@topspin.com>
	<200411181058.X9N1lq3F3k3Sfu3e@topspin.com>
Message-ID: <20041118214827.GA15892@esmail.cup.hp.com>

On Thu, Nov 18, 2004 at 10:58:50AM -0800, Roland Dreier wrote:
> Add ip_ib_mc_map() to convert IPv$ multicast addresses to IPoIB
> hardware addresses.
...
> +	addr    = ntohl(addr);
...
> +	buf[19] = addr & 0xff;
> +	addr  >>= 8;
> +	buf[18] = addr & 0xff;
> +	addr  >>= 8;
> +	buf[17] = addr & 0xff;
> +	addr  >>= 8;
> +	buf[16] = addr & 0x0f;

Can the same be done instead with the following?

	addr &= 0x0fffffff;
	((unsigned int *)buf)[4] = cpu_to_be32(addr);

Or are there possible alignment issues with buf?

Maybe the following is also correct:
	((unsigned int *)buf)[4] = addr & htonl(0x0fffffff);

anyway...just some micro-optimizations...probably really only matters
on BE machines.

thanks,
grant


From roland at topspin.com  Thu Nov 18 13:53:32 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 18 Nov 2004 13:53:32 -0800
Subject: [openib-general] [PATCH][RFC/v1][10/12] IPoIB IPv4 multicast
In-Reply-To: <20041118214827.GA15892@esmail.cup.hp.com> (Grant Grundler's
	message of "Thu, 18 Nov 2004 13:48:27 -0800")
References: <200411181058.Vg9HQwEgGRimClBJ@topspin.com>
	<200411181058.X9N1lq3F3k3Sfu3e@topspin.com>
	<20041118214827.GA15892@esmail.cup.hp.com>
Message-ID: <52llcyj02r.fsf@topspin.com>

    Grant> Can the same be done instead with the following?

I think only your second proposal is correct (since addr is in network
byte order).  However, the existing ip_eth_mc_map() function in
<net/ip.h> uses the "one byte at a time" method, so I thought we might
as well follow existing practice.

 - R.


From mlleinin at hpcn.ca.sandia.gov  Thu Nov 18 14:25:46 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 18 Nov 2004 14:25:46 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
In-Reply-To: <52d5ybkmlq.fsf@topspin.com>
References: <200411181058.nZu5AGvCLwleEqeJ@topspin.com>
	<52d5ybkmlq.fsf@topspin.com>
Message-ID: <1100816746.32165.77.camel@trinity>

On Thu, 2004-11-18 at 11:01 -0800, Roland Dreier wrote:
> Hmm... looks like our spamassassin is a little trigger happy :)
> 

  Well since Roland is sending us all spam I can either boot him off the
list or increase the spamassassin threshold.  :)  I decide to increase
the threshold to 7.5 (all the IB patches got a spam score of 6.6) so
future kernel patches shouldn't be listed as spam.  

  - Matt


From iod00d at hp.com  Thu Nov 18 14:52:28 2004
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 18 Nov 2004 14:52:28 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
In-Reply-To: <1100816746.32165.77.camel@trinity>
References: <200411181058.nZu5AGvCLwleEqeJ@topspin.com>
	<52d5ybkmlq.fsf@topspin.com> <1100816746.32165.77.camel@trinity>
Message-ID: <20041118225228.GB15892@esmail.cup.hp.com>

On Thu, Nov 18, 2004 at 02:25:46PM -0800, Matt Leininger wrote:
> On Thu, 2004-11-18 at 11:01 -0800, Roland Dreier wrote:
> > Hmm... looks like our spamassassin is a little trigger happy :)
> > 
> 
>   Well since Roland is sending us all spam I can either boot him off the
> list or increase the spamassassin threshold.  :)  I decide to increase
> the threshold to 7.5 (all the IB patches got a spam score of 6.6) so
> future kernel patches shouldn't be listed as spam.  

There's 3rd choice: adjust the scoring of individual tests the
mail triggered so the total score for those email is < 5.

hth,
grant


From David.Brean at Sun.COM  Thu Nov 18 15:56:27 2004
From: David.Brean at Sun.COM (David M. Brean)
Date: Thu, 18 Nov 2004 18:56:27 -0500
Subject: Bonding (was: Re: [openib-general] Re: More on IPoIB Multicast)
In-Reply-To: <1100796136.3277.9.camel@localhost.localdomain>
References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com>
	<1100796136.3277.9.camel@localhost.localdomain>
Message-ID: <419D36AB.3060304@sun.com>

Doesn't the failover mechanism used by the bonding driver move the link 
layer address from the failed NIC to the standby NIC?  If so, the IPoIB 
link layer address contains a port GUID, so it may be a bit more complex 
than usual to port bonding over IPoIB.

I also thought that the bonding driver would be extended to support 
ethernet link aggregation (802.3ad) as the load balancing/failover 
mechanism at some point (I can't find schedule information).  This 
function, too, would not work over IPoIB.

-David

Hal Rosenstock wrote:

> On Thu, 2004-11-18 at 13:41, Nitin Hande wrote:
>
>> / But before that would like to hear from people about
>
> />/ various approaches.
> /
> Some vendors have implemented this by combining multiple HCA ports and
> failing over from one to the other. Bonding may provide striping (using
> both ports concurrently).
> I will need to read up on bonding to understand what it provides and
> compare it to what can be done under the IPoIB driver.
>
> -- Hal
>

>On Tue, 2004-11-09 at 12:07, Roland Dreier wrote:
>  
>
>>multiport bonding/failover
>>(although my feeling is that it would be better to extend the existing
>>bonding driver rather than trying to put this in the IPoIB driver), ....
>>    
>>
>
>I'm not clear what the tradeoffs / pros / cons of the two approaches
>(use the bonding driver (above the IPoIB driver) or implement it inside
>the IPoIB driver) would be.
>
>  
>


From halr at voltaire.com  Fri Nov 19 07:40:01 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 19 Nov 2004 10:40:01 -0500
Subject: [openib-general] OpenIB Thread Usage
Message-ID: <1100878801.19061.5.camel@hpc-1>

Hi Roland,

I noticed that IPoIB uses a single thread whereas the MAD layer uses a
thread (per port) per CPU. Could/should IPoIB be multithreaded and would
that help performance on multiple processors ?

Thanks.

-- Hal


From roland at topspin.com  Fri Nov 19 08:28:33 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:28:33 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <1100878801.19061.5.camel@hpc-1> (Hal Rosenstock's message of
	"Fri, 19 Nov 2004 10:40:01 -0500")
References: <1100878801.19061.5.camel@hpc-1>
Message-ID: <524qjliz0u.fsf@topspin.com>

    Hal> Hi Roland, I noticed that IPoIB uses a single thread whereas
    Hal> the MAD layer uses a thread (per port) per CPU. Could/should
    Hal> IPoIB be multithreaded and would that help performance on
    Hal> multiple processors ?

I doubt it.  The IPoIB workqueue is used for non-data path stuff like
starting multicast joins, which are very far from being CPU-bound.
All the data path stuff is run from interrupt context.  Of course
that's a theoretical argument, and if someone actually measures that
changing the create_singlethread_workqueue() to create_workqueue()
improves performance, I would have no problem making the change.

In fact I'm not sure that having some many MAD workqueue threads isn't
overkill that wastes resources, especially on machines with a lot of
CPUs.

 - Roland


From roland at topspin.com  Fri Nov 19 08:44:26 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:44:26 -0800
Subject: [openib-general] Updated patches coming
Message-ID: <52zn1dhjpx.fsf@topspin.com>

I'm posting new versions of all the patches.  These should incorporate
all suggestions from yesterday.  Please post any new comments (and let
me know if I've screwed up fixing your previous comments).

I'll send these patches to linux-kernel (with networking patches cc'ed
to netdev) on Monday morning, incorporating any last comments I get by
Sunday.  Of course I'll also cc openib-general so that everyone here
can see the responses without having to sift through linux-kernel.

Thanks,
  Roland


From roland at topspin.com  Fri Nov 19 08:47:46 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:47:46 -0800
Subject: [openib-general] [PATCH][RFC/v2][0/12] Initial submission of
	InfiniBand patches for review
Message-ID: <20041119 847.L6YuhFRk6dxcC9sS@topspin.com>

I'm very happy to be able to post an initial version of InfiniBand
patches for review.  Although this code should be far closer to kernel
coding standards than previous open source InfiniBand drivers, this
initial posting should be treated as a request for comments and not a
request for inclusion; our ultimate goal is to have these drivers
included in the mainline kernel, but we expect that fixes and
improvements will need to be made before the code is completely
acceptable.

These patches add a minimal but complete level of InfiniBand support,
including an IB midlayer, a low-level driver for Mellanox HCAs, an
IP-over-InfiniBand driver, and a mechanism for MADs (management
datagrams) to be passed to and from userspace.  This means that these
patches are all that is required for the kernel to bring up and use an
IP-over-InfiniBand link.  (The OpenSM subnet manager has not been
ported to this kernel API yet, although this work is underway.  This
means that at the moment, a kernel with these patches cannot be used
to bring up a fabric; however, the kernel side is complete)

The code has not been through extreme stress testing yet, but it has
been used successfully on i386, x86_64, ppc64, ia64 and sparc64
systems, including mixed 32/64 systems.

Feedback on both details of the code as well as the high-level
organization of the code will be very much appreciated.  For example,
the current set of patches puts include files in driver/infiniband/include;
would it be preferred to put include files in include/linux/infiniband/,
directly in include/linux, or perhaps in include/infiniband?

We would also like to explore the best avenue for having these patches
merged.  It may be desirable for the patches to spend some time in -mm
before moving into Linus's kernel; on the other hand, the patches make
only very minimal and safe changes outside of drivers/infiniband, so
it is quite reasonable to merge them directly into the mainline
kernel.  Although 2.6.10 is now closed, 2.6.11 will probably be open
by the time the review process is complete.

We look forward to the community's comments and criticisms!

Thanks,
  Roland Dreier
  OpenIB Alliance


From roland at topspin.com  Fri Nov 19 08:47:52 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:47:52 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][1/12] Add core
	InfiniBand support
Message-ID: <20041119 847.0UsrM0D745D1EXvV@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/1b13cc4f/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v2][1/12] Add core InfiniBand support
Date: Fri, 19 Nov 2004 08:47:52 -0800
Size: 120284
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/1b13cc4f/attachment.mht>

From roland at topspin.com  Fri Nov 19 08:47:59 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:47:59 -0800
Subject: [openib-general] [PATCH][RFC/v2][2/12] Hook up drivers/infiniband
In-Reply-To: <20041119 847.0UsrM0D745D1EXvV@topspin.com>
Message-ID: <20041119 847.Alul4BnW1lXB9SBr@topspin.com>

Add the appropriate lines to drivers/Kconfig and drivers/Makefile so
that the kernel configuration and build systems know about drivers/infiniband.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/drivers/Kconfig
===================================================================
--- linux-bk.orig/drivers/Kconfig	2004-11-19 08:34:39.892304998 -0800
+++ linux-bk/drivers/Kconfig	2004-11-19 08:36:00.427436899 -0800
@@ -54,4 +54,6 @@
 
 source "drivers/usb/Kconfig"
 
+source "drivers/infiniband/Kconfig"
+
 endmenu
Index: linux-bk/drivers/Makefile
===================================================================
--- linux-bk.orig/drivers/Makefile	2004-11-19 08:35:05.292561917 -0800
+++ linux-bk/drivers/Makefile	2004-11-19 08:36:00.428436751 -0800
@@ -59,4 +59,5 @@
 obj-$(CONFIG_EISA)		+= eisa/
 obj-$(CONFIG_CPU_FREQ)		+= cpufreq/
 obj-$(CONFIG_MMC)		+= mmc/
+obj-$(CONFIG_INFINIBAND)	+= infiniband/
 obj-y				+= firmware/


From roland at topspin.com  Fri Nov 19 08:48:04 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:04 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][3/12] Add
	InfiniBand MAD (management datagram) support
Message-ID: <20041119 848.Sx9CmcXJ37MTHJMY@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/1d2483f6/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v2][3/12] Add InfiniBand MAD (management datagram) support
Date: Fri, 19 Nov 2004 08:48:04 -0800
Size: 108322
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/1d2483f6/attachment.mht>

From roland at topspin.com  Fri Nov 19 08:48:17 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:17 -0800
Subject: [openib-general] [PATCH][RFC/v2][5/12] Add Mellanox HCA low-level
	driver
In-Reply-To: <20041119 848.bGZXOMXI6bjJEWQr@topspin.com>
Message-ID: <20041119 848.kWwVxIYmeAt15lmS@topspin.com>

Add a low-level driver for Mellanox MT23108 and MT25208 HCAs.  The
MT25208 is only fully supported when in MT23108 compatibility mode;
only the very beginnings of support for native MT25208 mode (required
for HCAs without local memory) is present.

(As a side note, I believe this driver would be the first in-tree
consumer of the PCI MSI/MSI-X API)

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/drivers/infiniband/Kconfig
===================================================================
--- linux-bk.orig/drivers/infiniband/Kconfig	2004-11-19 08:35:58.828672505 -0800
+++ linux-bk/drivers/infiniband/Kconfig	2004-11-19 08:36:02.081193188 -0800
@@ -8,4 +8,6 @@
 	  any protocols you wish to use as well as drivers for your
 	  InfiniBand hardware.
 
+source "drivers/infiniband/hw/Kconfig"
+
 endmenu
Index: linux-bk/drivers/infiniband/Makefile
===================================================================
--- linux-bk.orig/drivers/infiniband/Makefile	2004-11-19 08:35:58.864667201 -0800
+++ linux-bk/drivers/infiniband/Makefile	2004-11-19 08:36:02.056196872 -0800
@@ -1 +1 @@
-obj-$(CONFIG_INFINIBAND) += core/
+obj-$(CONFIG_INFINIBAND) += core/ hw/
Index: linux-bk/drivers/infiniband/hw/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/Kconfig	2004-11-19 08:36:02.124186852 -0800
@@ -0,0 +1 @@
+source "drivers/infiniband/hw/mthca/Kconfig"
Index: linux-bk/drivers/infiniband/hw/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/Makefile	2004-11-19 08:36:02.158181842 -0800
@@ -0,0 +1 @@
+obj-$(CONFIG_INFINIBAND_MTHCA) 	 	+= mthca/
Index: linux-bk/drivers/infiniband/hw/mthca/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Kconfig	2004-11-19 08:36:02.184178011 -0800
@@ -0,0 +1,26 @@
+config INFINIBAND_MTHCA
+	tristate "Mellanox HCA support"
+	depends on PCI && INFINIBAND
+	---help---
+	  This is a low-level driver for Mellanox InfiniHost host
+	  channel adapters (HCAs), including the MT23108 PCI-X HCA
+	  ("Tavor") and the MT25208 PCI Express HCA ("Arbel").
+
+config INFINIBAND_MTHCA_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_MTHCA
+	default n
+	---help---
+	  This option causes the mthca driver produce a bunch of debug
+	  messages.  Select this is you are developing the driver or
+	  trying to diagnose a problem.
+
+config INFINIBAND_MTHCA_SSE_DOORBELL
+	bool "SSE doorbell code"
+	depends on INFINIBAND_MTHCA && X86 && !X86_64
+	default n
+	---help---
+	  This option will have the mthca driver use SSE instructions
+	  to ring hardware doorbell registers.  This may improve
+	  performance for some workloads, but the driver will not run
+	  on processors without SSE instructions.
Index: linux-bk/drivers/infiniband/hw/mthca/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Makefile	2004-11-19 08:36:02.224172118 -0800
@@ -0,0 +1,23 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o
+
+ib_mthca-objs := \
+    mthca_main.o \
+    mthca_cmd.o  \
+    mthca_profile.o \
+    mthca_reset.o \
+    mthca_allocator.o \
+    mthca_eq.o \
+    mthca_pd.o \
+    mthca_cq.o \
+    mthca_mr.o \
+    mthca_qp.o \
+    mthca_av.o \
+    mthca_mcg.o \
+    mthca_mad.o \
+    mthca_provider.o
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c	2004-11-19 08:36:02.277164308 -0800
@@ -0,0 +1,175 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $
+ */
+
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/bitmap.h> 
+
+#include "mthca_dev.h"
+
+/* Trivial bitmap-based allocator */
+u32 mthca_alloc(struct mthca_alloc *alloc)
+{
+	u32 obj;
+
+	spin_lock(&alloc->lock);
+	obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last);
+	if (obj >= alloc->max) {
+		alloc->top = (alloc->top + alloc->max) & alloc->mask;
+		obj = find_first_zero_bit(alloc->table, alloc->max);
+	}
+
+	if (obj < alloc->max) {
+		set_bit(obj, alloc->table);
+		obj |= alloc->top;
+	} else
+		obj = -1;
+
+	spin_unlock(&alloc->lock);
+
+	return obj;
+}
+
+void mthca_free(struct mthca_alloc *alloc, u32 obj)
+{
+	obj &= alloc->max - 1;
+	spin_lock(&alloc->lock);
+	clear_bit(obj, alloc->table);
+	alloc->last = min(alloc->last, obj);
+	alloc->top = (alloc->top + alloc->max) & alloc->mask;
+	spin_unlock(&alloc->lock);
+}
+
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved)
+{
+	int i;
+
+	/* num must be a power of 2 */
+	if (num != 1 << (ffs(num) - 1))
+		return -EINVAL;
+
+	alloc->last = 0;
+	alloc->top  = 0;
+	alloc->max  = num;
+	alloc->mask = mask;
+	spin_lock_init(&alloc->lock);
+	alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long),
+			       GFP_KERNEL);
+	if (!alloc->table)
+		return -ENOMEM;
+
+	bitmap_zero(alloc->table, num);
+	for (i = 0; i < reserved; ++i)
+		set_bit(i, alloc->table);
+
+	return 0;
+}
+
+void mthca_alloc_cleanup(struct mthca_alloc *alloc)
+{
+	kfree(alloc->table);
+}
+
+/*
+ * Array of pointers with lazy allocation of leaf pages.  Callers of
+ * _get, _set and _clear methods must use a lock or otherwise
+ * serialize access to the array.
+ */
+
+void *mthca_array_get(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (array->page_list[p].page) {
+		int i = index & (PAGE_SIZE / sizeof (void *) - 1);
+		return array->page_list[p].page[i];
+	} else
+		return NULL;
+}
+
+int mthca_array_set(struct mthca_array *array, int index, void *value)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
+	if (!array->page_list[p].page)
+		array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC);
+
+	if (!array->page_list[p].page)
+		return -ENOMEM;
+
+	array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] =
+		value;
+	++array->page_list[p].used;
+
+	return 0;
+}
+
+void mthca_array_clear(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (--array->page_list[p].used == 0) {
+		free_page((unsigned long) array->page_list[p].page);
+		array->page_list[p].page = NULL;
+	}
+
+	if (array->page_list[p].used < 0)
+		pr_debug("Array %p index %d page %d with ref count %d < 0\n",
+			 array, index, p, array->page_list[p].used);
+}
+
+int mthca_array_init(struct mthca_array *array, int nent)
+{
+	int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE;
+	int i;
+
+	array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL);
+	if (!array->page_list)
+		return -ENOMEM;
+
+	for (i = 0; i < npage; ++i) {
+		array->page_list[i].page = NULL;
+		array->page_list[i].used = 0;
+	}
+
+	return 0;
+}
+
+void mthca_array_cleanup(struct mthca_array *array, int nent)
+{
+	int i;
+
+	for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i)
+		free_page((unsigned long) array->page_list[i].page);
+
+	kfree(array->page_list);
+}
+
+/*
+ * Local Variables:
+ *  c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_av.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c	2004-11-19 08:36:02.312159151 -0800
@@ -0,0 +1,212 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+
+#include "mthca_dev.h"
+
+struct mthca_av {
+	u32 port_pd;
+	u8  reserved1;
+	u8  g_slid;
+	u16 dlid;
+	u8  reserved2;
+	u8  gid_index;
+	u8  msg_sr;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u32 dgid[4];
+} __attribute__((packed));
+
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah)
+{
+	u32 index = -1;
+	struct mthca_av *av = NULL;
+
+	ah->on_hca = 0;
+
+	if (!atomic_read(&pd->sqp_count) &&
+	    !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		index = mthca_alloc(&dev->av_table.alloc);
+
+		/* fall back to allocate in host memory */
+		if (index == -1)
+			goto host_alloc;
+
+		av = kmalloc(sizeof *av, GFP_KERNEL);
+		if (!av)
+			goto host_alloc;
+			
+		ah->on_hca = 1;
+		ah->avdma  = dev->av_table.ddr_av_base +
+			index * MTHCA_AV_SIZE;
+	}
+
+ host_alloc:
+	if (!ah->on_hca) {
+		ah->av = pci_pool_alloc(dev->av_table.pool,
+					SLAB_KERNEL, &ah->avdma);
+		if (!ah->av)
+			return -ENOMEM;
+
+		av = ah->av;
+	}
+
+	ah->key = pd->ntmr.ibmr.lkey;
+
+	memset(av, 0, MTHCA_AV_SIZE);
+
+	av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24));
+	av->g_slid  = ah_attr->src_path_bits;
+	av->dlid    = cpu_to_be16(ah_attr->dlid);
+	av->msg_sr  = (3 << 4) | /* 2K message */
+		ah_attr->static_rate;
+	av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+	if (ah_attr->ah_flags & IB_AH_GRH) {
+		av->g_slid |= 0x80;
+		av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len +
+			ah_attr->grh.sgid_index;
+		av->hop_limit = ah_attr->grh.hop_limit;
+		av->sl_tclass_flowlabel |=
+			cpu_to_be32((ah_attr->grh.traffic_class << 20) |
+				    ah_attr->grh.flow_label);
+		memcpy(av->dgid, ah_attr->grh.dgid.raw, 16);
+	}
+
+	if (0) {
+		int j;
+		
+		mthca_dbg(dev, "Created UDAV at %p/%08lx:\n",
+			  av, (unsigned long) ah->avdma);
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) av)[j]));
+	}
+
+	if (ah->on_hca) {
+		memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
+			    av, MTHCA_AV_SIZE);
+		kfree(av);
+	}
+
+	return 0;
+}
+
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
+{
+	if (ah->on_hca)
+		mthca_free(&dev->av_table.alloc,
+ 			   (ah->avdma - dev->av_table.ddr_av_base) /
+			   MTHCA_AV_SIZE);
+	else
+		pci_pool_free(dev->av_table.pool, ah->av, ah->avdma);
+
+	return 0;
+}
+
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header)
+{
+	if (ah->on_hca)
+		return -EINVAL;
+
+	header->lrh.service_level   = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28;
+	header->lrh.destination_lid = ah->av->dlid;
+	header->lrh.source_lid      = ah->av->g_slid & 0x7f;
+	if (ah->av->g_slid & 0x80) {
+		header->grh_present = 1;
+		header->grh.traffic_class =
+			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
+		header->grh.flow_label    =
+			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		ib_cached_gid_get(&dev->ib_dev,
+				  be32_to_cpu(ah->av->port_pd) >> 24,
+				  ah->av->gid_index,
+				  &header->grh.source_gid);
+		memcpy(header->grh.destination_gid.raw,
+		       ah->av->dgid, 16);
+	} else {
+		header->grh_present = 0;
+	}
+
+	return 0;
+}
+
+int __devinit mthca_init_av_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->av_table.alloc,
+			       dev->av_table.num_ddr_avs,
+			       dev->av_table.num_ddr_avs - 1,
+			       0);
+	if (err)
+		return err;
+
+	dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev,
+					     MTHCA_AV_SIZE,
+					     MTHCA_AV_SIZE, 0);
+	if (!dev->av_table.pool)
+		goto out_free_alloc;
+
+	if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) +
+					       dev->av_table.ddr_av_base -
+					       dev->ddr_start,
+					       dev->av_table.num_ddr_avs *
+					       MTHCA_AV_SIZE);
+		if (!dev->av_table.av_map)
+			goto out_free_pool;
+	} else
+		dev->av_table.av_map = NULL;
+
+	return 0;
+
+ out_free_pool:
+	pci_pool_destroy(dev->av_table.pool);
+
+ out_free_alloc:
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+	return -ENOMEM;
+}
+
+void __devexit mthca_cleanup_av_table(struct mthca_dev *dev)
+{
+	if (dev->av_table.av_map)
+		iounmap(dev->av_table.av_map);
+	pci_pool_destroy(dev->av_table.pool);
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c	2004-11-19 08:36:02.355152815 -0800
@@ -0,0 +1,1522 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/sched.h>
+#include <linux/pci.h>
+#include <linux/errno.h>
+#include <asm/io.h>
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+
+#define CMD_POLL_TOKEN 0xffff
+
+enum {
+	HCR_IN_PARAM_OFFSET    = 0x00,
+	HCR_IN_MODIFIER_OFFSET = 0x08,
+	HCR_OUT_PARAM_OFFSET   = 0x0c,
+	HCR_TOKEN_OFFSET       = 0x14,
+	HCR_STATUS_OFFSET      = 0x18,
+
+	HCR_OPMOD_SHIFT        = 12,
+	HCA_E_BIT              = 22,
+	HCR_GO_BIT             = 23
+};
+
+enum {
+	/* initialization and general commands */
+	CMD_SYS_EN          = 0x1,
+	CMD_SYS_DIS         = 0x2,
+	CMD_MAP_FA          = 0xfff,
+	CMD_UNMAP_FA        = 0xffe,
+	CMD_RUN_FW          = 0xff6,
+	CMD_MOD_STAT_CFG    = 0x34,
+	CMD_QUERY_DEV_LIM   = 0x3,
+	CMD_QUERY_FW        = 0x4,
+	CMD_ENABLE_LAM      = 0xff8,
+	CMD_DISABLE_LAM     = 0xff7,
+	CMD_QUERY_DDR       = 0x5,
+	CMD_QUERY_ADAPTER   = 0x6,
+	CMD_INIT_HCA        = 0x7,
+	CMD_CLOSE_HCA       = 0x8,
+	CMD_INIT_IB         = 0x9,
+	CMD_CLOSE_IB        = 0xa,
+	CMD_QUERY_HCA       = 0xb,
+	CMD_SET_IB          = 0xc,
+	CMD_ACCESS_DDR      = 0x2e,
+	CMD_MAP_ICM         = 0xffa,
+	CMD_UNMAP_ICM       = 0xff9,
+	CMD_MAP_ICM_AUX     = 0xffc,
+	CMD_UNMAP_ICM_AUX   = 0xffb,
+	CMD_SET_ICM_SIZE    = 0xffd,
+
+	/* TPT commands */
+	CMD_SW2HW_MPT 	    = 0xd,
+	CMD_QUERY_MPT 	    = 0xe,
+	CMD_HW2SW_MPT 	    = 0xf,
+	CMD_READ_MTT        = 0x10,
+	CMD_WRITE_MTT       = 0x11,
+	CMD_SYNC_TPT        = 0x2f,
+
+	/* EQ commands */
+	CMD_MAP_EQ          = 0x12,
+	CMD_SW2HW_EQ 	    = 0x13,
+	CMD_HW2SW_EQ 	    = 0x14,
+	CMD_QUERY_EQ        = 0x15,
+
+	/* CQ commands */
+	CMD_SW2HW_CQ 	    = 0x16,
+	CMD_HW2SW_CQ 	    = 0x17,
+	CMD_QUERY_CQ 	    = 0x18,
+	CMD_RESIZE_CQ       = 0x2c,
+
+	/* SRQ commands */
+	CMD_SW2HW_SRQ 	    = 0x35,
+	CMD_HW2SW_SRQ 	    = 0x36,
+	CMD_QUERY_SRQ       = 0x37,
+
+	/* QP/EE commands */
+	CMD_RST2INIT_QPEE   = 0x19,
+	CMD_INIT2RTR_QPEE   = 0x1a,
+	CMD_RTR2RTS_QPEE    = 0x1b,
+	CMD_RTS2RTS_QPEE    = 0x1c,
+	CMD_SQERR2RTS_QPEE  = 0x1d,
+	CMD_2ERR_QPEE       = 0x1e,
+	CMD_RTS2SQD_QPEE    = 0x1f,
+	CMD_SQD2SQD_QPEE    = 0x38,
+	CMD_SQD2RTS_QPEE    = 0x20,
+	CMD_ERR2RST_QPEE    = 0x21,
+	CMD_QUERY_QPEE      = 0x22,
+	CMD_INIT2INIT_QPEE  = 0x2d,
+	CMD_SUSPEND_QPEE    = 0x32,
+	CMD_UNSUSPEND_QPEE  = 0x33,
+	/* special QPs and management commands */
+	CMD_CONF_SPECIAL_QP = 0x23,
+	CMD_MAD_IFC         = 0x24,
+
+	/* multicast commands */
+	CMD_READ_MGM        = 0x25,
+	CMD_WRITE_MGM       = 0x26,
+	CMD_MGID_HASH       = 0x27,
+
+	/* miscellaneous commands */
+	CMD_DIAG_RPRT       = 0x30,
+	CMD_NOP             = 0x31,
+
+	/* debug commands */
+	CMD_QUERY_DEBUG_MSG = 0x2a,
+	CMD_SET_DEBUG_MSG   = 0x2b,
+};
+
+/*
+ * According to Mellanox code, FW may be starved and never complete
+ * commands.  So we can't use strict timeouts described in PRM -- we
+ * just arbitrarily select 60 seconds for now.
+ */
+#if 0
+/*
+ * Round up and add 1 to make sure we get the full wait time (since we
+ * will be starting in the middle of a jiffy)
+ */
+enum {
+	CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1,
+	CMD_TIME_CLASS_B = (HZ +  99) /  100 + 1,
+	CMD_TIME_CLASS_C = (HZ +   9) /   10 + 1
+};
+#else
+enum {
+	CMD_TIME_CLASS_A = 60 * HZ,
+	CMD_TIME_CLASS_B = 60 * HZ,
+	CMD_TIME_CLASS_C = 60 * HZ
+};
+#endif
+
+enum {
+	GO_BIT_TIMEOUT = HZ * 10
+};
+
+struct mthca_cmd_context {
+	struct completion done;
+	struct timer_list timer;
+	int               result;
+	int               next;
+	u64               out_param;
+	u16               token;
+	u8                status;
+};
+
+static inline int go_bit(struct mthca_dev *dev)
+{
+	return readl(dev->hcr + HCR_STATUS_OFFSET) &
+		swab32(1 << HCR_GO_BIT);
+}
+
+static int mthca_cmd_post(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 out_param,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  u16 token,
+			  int event)
+{
+	int err = 0;
+	
+	if (down_interruptible(&dev->cmd.hcr_sem))
+		return -EINTR;
+
+	if (event) {
+		unsigned long end = jiffies + GO_BIT_TIMEOUT;
+
+		while (go_bit(dev) && time_before(jiffies, end)) {
+			set_current_state(TASK_RUNNING);
+			schedule();
+		}
+	}
+
+	if (go_bit(dev)) {
+		err = -EAGAIN;
+		goto out;
+	}
+
+	/*
+	 * We use writel (instead of something like memcpy_toio)
+	 * because writes of less than 32 bits to the HCR don't work
+	 * (and some architectures such as ia64 implement memcpy_toio
+	 * in terms of writeb).
+	 */
+	__raw_writel(cpu_to_be32(in_param >> 32),           dev->hcr + 0 * 4);
+	__raw_writel(cpu_to_be32(in_param & 0xfffffffful),  dev->hcr + 1 * 4);
+	__raw_writel(cpu_to_be32(in_modifier),              dev->hcr + 2 * 4);
+	__raw_writel(cpu_to_be32(out_param >> 32),          dev->hcr + 3 * 4);
+	__raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4);
+	__raw_writel(cpu_to_be32(token << 16),              dev->hcr + 5 * 4);
+
+	/*
+	 * Flush posted writes so GO bit is written last (needed with
+	 * __raw_writel, which may not order writes).
+	 */
+	readl(dev->hcr + HCR_STATUS_OFFSET);	
+
+	__raw_writel(cpu_to_be32((1 << HCR_GO_BIT)                |
+				 (event ? (1 << HCA_E_BIT) : 0)   |
+				 (op_modifier << HCR_OPMOD_SHIFT) |
+				 op),                       dev->hcr + 6 * 4);
+
+out:
+	up(&dev->cmd.hcr_sem);
+	return err;
+}
+
+static int mthca_cmd_poll(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	unsigned long end;
+
+	if (down_interruptible(&dev->cmd.poll_sem))
+		return -EINTR;
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, CMD_POLL_TOKEN, 0);
+	if (err)
+		goto out;
+
+	end = timeout + jiffies;
+	while (go_bit(dev) && time_before(jiffies, end)) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+
+	if (go_bit(dev)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	if (out_is_imm) {
+		memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64));
+		be64_to_cpus(out_param);
+	}
+
+	*status = readb(dev->hcr + HCR_STATUS_OFFSET);
+
+out:
+	up(&dev->cmd.poll_sem);
+	return err;
+}
+
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param)
+{
+	struct mthca_cmd_context *context =
+		&dev->cmd.context[token & dev->cmd.token_mask];
+
+	/* previously timed out command completing at long last */
+	if (token != context->token)
+		return;
+
+	context->result    = 0;
+	context->status    = status;
+	context->out_param = out_param;
+
+	context->token += dev->cmd.token_mask + 1;
+
+	complete(&context->done);
+}
+
+static void event_timeout(unsigned long context_ptr)
+{
+	struct mthca_cmd_context *context =
+		(struct mthca_cmd_context *) context_ptr;
+
+	context->result = -EBUSY;
+	complete(&context->done);
+}
+
+static int mthca_cmd_wait(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	struct mthca_cmd_context *context;
+
+	if (down_interruptible(&dev->cmd.event_sem))
+		return -EINTR;
+
+	spin_lock(&dev->cmd.context_lock);
+	BUG_ON(dev->cmd.free_head < 0);
+	context = &dev->cmd.context[dev->cmd.free_head];
+	dev->cmd.free_head = context->next;
+	spin_unlock(&dev->cmd.context_lock);
+
+	init_completion(&context->done);
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, context->token, 1);
+	if (err)
+		goto out;
+
+	context->timer.expires  = jiffies + timeout;
+	add_timer(&context->timer);
+
+	wait_for_completion(&context->done);
+	del_timer_sync(&context->timer);
+
+	err = context->result;
+	if (err)
+		goto out;
+
+	*status = context->status;
+	if (*status)
+		mthca_dbg(dev, "Command %02x completed with status %02x\n",
+			  op, *status);
+
+	if (out_is_imm)
+		*out_param = context->out_param;
+
+out:
+	spin_lock(&dev->cmd.context_lock);
+	context->next = dev->cmd.free_head;
+	dev->cmd.free_head = context - dev->cmd.context;
+	spin_unlock(&dev->cmd.context_lock);
+
+	up(&dev->cmd.event_sem);
+	return err;
+}
+
+/* Invoke a command with an output mailbox */
+static int mthca_cmd_box(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/* Invoke a command with no output parameter */
+static int mthca_cmd(struct mthca_dev *dev,
+		     u64 in_param,
+		     u32 in_modifier,
+		     u8 op_modifier,
+		     u16 op,
+		     unsigned long timeout,
+		     u8 *status)
+{
+	return mthca_cmd_box(dev, in_param, 0, in_modifier,
+			     op_modifier, op, timeout, status);
+}
+
+/*
+ * Invoke a command with an immediate output parameter (and copy the
+ * output into the caller's out_param pointer after the command
+ * executes).
+ */
+static int mthca_cmd_imm(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 *out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/*
+ * Switch to using events to issue FW commands (should be called after
+ * event queue to command events has been initialized).
+ */
+int mthca_cmd_use_events(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.context = kmalloc(dev->cmd.max_cmds *
+				   sizeof (struct mthca_cmd_context),
+				   GFP_KERNEL);
+	if (!dev->cmd.context)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i) {
+		dev->cmd.context[i].token = i;
+		dev->cmd.context[i].next = i + 1;
+		init_timer(&dev->cmd.context[i].timer);
+		dev->cmd.context[i].timer.data     =
+			(unsigned long) &dev->cmd.context[i];
+		dev->cmd.context[i].timer.function = event_timeout;
+	}
+
+	dev->cmd.context[dev->cmd.max_cmds - 1].next = -1;
+	dev->cmd.free_head = 0;
+
+	sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds);
+	spin_lock_init(&dev->cmd.context_lock);
+
+	for (dev->cmd.token_mask = 1;
+	     dev->cmd.token_mask < dev->cmd.max_cmds;
+	     dev->cmd.token_mask <<= 1)
+		; /* nothing */
+	--dev->cmd.token_mask;
+
+	dev->cmd.use_events = 1;
+	down(&dev->cmd.poll_sem);
+
+	return 0;
+}
+
+/*
+ * Switch back to polling (used when shutting down the device)
+ */
+void mthca_cmd_use_polling(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.use_events = 0;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i)
+		down(&dev->cmd.event_sem);
+
+	kfree(dev->cmd.context);
+
+	up(&dev->cmd.poll_sem);
+}
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status)
+{
+	u64 out;
+	int ret;
+
+	ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status);
+
+	if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR)
+		mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, "
+			   "sladdr=%d, SPD source=%s\n",
+			   (int) (out >> 6) & 0xf, (int) (out >> 4) & 3,
+			   (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM");
+
+	return ret;
+}
+
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status);
+}
+
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int lg;
+	int nent = 0;
+	int i, j;
+	int err = 0;
+	int ts = 0;
+
+	inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma);
+	memset(inbox, 0, PAGE_SIZE);
+
+	for (i = 0; i < count; ++i) {
+		/*
+		 * We have to pass pages that are aligned to their
+		 * size, so find the least significant 1 in the
+		 * address or size and use that as our log2 size.
+		 */
+		lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1;
+		if (lg < 12) {
+			mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n",
+				   (unsigned long long) sg_dma_address(sglist + i),
+				   sg_dma_len(sglist + i));
+			err = -EINVAL;
+			goto out;
+		}
+		for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) {
+			*((__be64 *) (inbox + nent * 4 + 2)) =
+				cpu_to_be64((sg_dma_address(sglist + i) +
+					     (j << lg)) |
+					    (lg - 12));
+			ts += 1 << (lg - 10);
+			if (nent == PAGE_SIZE / 16) {
+				err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+						CMD_TIME_CLASS_B, status);
+				if (err || *status)
+					goto out;
+				nent = 0;
+			}
+		}
+	}
+
+	if (nent) {
+		err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+				CMD_TIME_CLASS_B, status);
+	}
+
+	mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts);
+
+out:
+	pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status);
+}
+
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+	u8 lg;
+
+#define QUERY_FW_OUT_SIZE             0x100
+#define QUERY_FW_VER_OFFSET            0x00
+#define QUERY_FW_MAX_CMD_OFFSET        0x0f
+#define QUERY_FW_ERR_START_OFFSET      0x30
+#define QUERY_FW_ERR_SIZE_OFFSET       0x38
+
+#define QUERY_FW_START_OFFSET          0x20
+#define QUERY_FW_END_OFFSET            0x28
+
+#define QUERY_FW_SIZE_OFFSET           0x00
+#define QUERY_FW_CLR_INT_BASE_OFFSET   0x20
+#define QUERY_FW_EQ_ARM_BASE_OFFSET    0x40
+#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma);
+	if (!outbox) {
+		return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->fw_ver,   outbox, QUERY_FW_VER_OFFSET);
+	/*
+	 * FW subminor version is at more signifant bits than minor
+	 * version, so swap here.
+	 */
+	dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) |
+		((dev->fw_ver & 0xffff0000ull) >> 16) |
+		((dev->fw_ver & 0x0000ffffull) << 16);
+
+	MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
+	dev->cmd.max_cmds = 1 << lg;
+
+	mthca_dbg(dev, "FW version %012llx, max commands %d\n",
+		  (unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
+
+	if (dev->hca_type == ARBEL_NATIVE) {
+		MTHCA_GET(dev->fw.arbel.fw_pages,       outbox, QUERY_FW_SIZE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.clr_int_base,   outbox, QUERY_FW_CLR_INT_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_arm_base,    outbox, QUERY_FW_EQ_ARM_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET);
+		mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2);
+
+		mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n",
+			  (unsigned long long) dev->fw.arbel.clr_int_base,
+			  (unsigned long long) dev->fw.arbel.eq_arm_base,
+			  (unsigned long long) dev->fw.arbel.eq_set_ci_base);
+	} else {
+		MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET);
+		MTHCA_GET(dev->fw.tavor.fw_end,   outbox, QUERY_FW_END_OFFSET);
+
+		mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n",
+			  (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10),
+			  (unsigned long long) dev->fw.tavor.fw_start,
+			  (unsigned long long) dev->fw.tavor.fw_end);
+	}
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define ENABLE_LAM_OUT_SIZE         0x100
+#define ENABLE_LAM_START_OFFSET     0x00
+#define ENABLE_LAM_END_OFFSET       0x08
+#define ENABLE_LAM_INFO_OFFSET      0x13
+
+#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4)
+#define ENABLE_LAM_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM,
+			    CMD_TIME_CLASS_C, status);
+
+	if (err)
+		goto out;
+
+	if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, ENABLE_LAM_END_OFFSET);
+	MTHCA_GET(info,           outbox, ENABLE_LAM_INFO_OFFSET);
+
+	if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & ENABLE_LAM_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status);
+}
+
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define QUERY_DDR_OUT_SIZE         0x100
+#define QUERY_DDR_START_OFFSET     0x00
+#define QUERY_DDR_END_OFFSET       0x08
+#define QUERY_DDR_INFO_OFFSET      0x13
+
+#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4)
+#define QUERY_DDR_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, QUERY_DDR_END_OFFSET);
+	MTHCA_GET(info,           outbox, QUERY_DDR_INFO_OFFSET);
+
+	if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & QUERY_DDR_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & QUERY_DDR_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	u8 field;
+	u16 size;
+	int err;
+
+#define QUERY_DEV_LIM_OUT_SIZE             0x100
+#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET    0x10
+#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET     0x11
+#define QUERY_DEV_LIM_RSVD_QP_OFFSET       0x12
+#define QUERY_DEV_LIM_MAX_QP_OFFSET        0x13
+#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET      0x14
+#define QUERY_DEV_LIM_MAX_SRQ_OFFSET       0x15
+#define QUERY_DEV_LIM_RSVD_EEC_OFFSET      0x16
+#define QUERY_DEV_LIM_MAX_EEC_OFFSET       0x17
+#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET     0x19
+#define QUERY_DEV_LIM_RSVD_CQ_OFFSET       0x1a
+#define QUERY_DEV_LIM_MAX_CQ_OFFSET        0x1b
+#define QUERY_DEV_LIM_MAX_MPT_OFFSET       0x1d
+#define QUERY_DEV_LIM_RSVD_EQ_OFFSET       0x1e
+#define QUERY_DEV_LIM_MAX_EQ_OFFSET        0x1f
+#define QUERY_DEV_LIM_RSVD_MTT_OFFSET      0x20
+#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET    0x21
+#define QUERY_DEV_LIM_RSVD_MRW_OFFSET      0x22
+#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET   0x23
+#define QUERY_DEV_LIM_MAX_AV_OFFSET        0x27
+#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET    0x29
+#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET    0x2b
+#define QUERY_DEV_LIM_MAX_RDMA_OFFSET      0x2f
+#define QUERY_DEV_LIM_ACK_DELAY_OFFSET     0x35
+#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET     0x36
+#define QUERY_DEV_LIM_VL_PORT_OFFSET       0x37
+#define QUERY_DEV_LIM_MAX_GID_OFFSET       0x3b
+#define QUERY_DEV_LIM_MAX_PKEY_OFFSET      0x3f
+#define QUERY_DEV_LIM_FLAGS_OFFSET         0x44
+#define QUERY_DEV_LIM_RSVD_UAR_OFFSET      0x48
+#define QUERY_DEV_LIM_UAR_SZ_OFFSET        0x49
+#define QUERY_DEV_LIM_PAGE_SZ_OFFSET       0x4b
+#define QUERY_DEV_LIM_MAX_SG_OFFSET        0x51
+#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET   0x52
+#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET    0x61
+#define QUERY_DEV_LIM_RSVD_MCG_OFFSET      0x62
+#define QUERY_DEV_LIM_MAX_MCG_OFFSET       0x63
+#define QUERY_DEV_LIM_RSVD_PD_OFFSET       0x64
+#define QUERY_DEV_LIM_MAX_PD_OFFSET        0x65
+#define QUERY_DEV_LIM_RSVD_RDD_OFFSET      0x66
+#define QUERY_DEV_LIM_MAX_RDD_OFFSET       0x67
+#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET  0x80
+#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET  0x82
+#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84
+#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86
+#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET  0x88
+#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET  0x8a
+#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET  0x8c
+#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET  0x8e
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET);
+	dev_lim->max_srq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET);
+	dev_lim->max_qp_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET);
+	dev_lim->reserved_qps = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET);
+	dev_lim->max_qps = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET);
+	dev_lim->reserved_srqs = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET);
+	dev_lim->max_srqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET);
+	dev_lim->reserved_eecs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET);
+	dev_lim->max_eecs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET);
+	dev_lim->max_cq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET);
+	dev_lim->reserved_cqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET);
+	dev_lim->max_cqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET);
+	dev_lim->max_mpts = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET);
+	dev_lim->reserved_eqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET);
+	dev_lim->max_eqs = 1 << (field & 0x7);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET);
+	dev_lim->reserved_mtts = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET);
+	dev_lim->max_mrw_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET);
+	dev_lim->reserved_mrws = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET);
+	dev_lim->max_mtt_seg = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET);
+	dev_lim->max_avs = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET);
+	dev_lim->max_requester_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET);
+	dev_lim->max_responder_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET);
+	dev_lim->max_rdma_global = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET);
+	dev_lim->local_ca_ack_delay = field & 0x1f;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET);
+	dev_lim->max_mtu        = field >> 4;
+	dev_lim->max_port_width = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET);
+	dev_lim->max_vl    = field >> 4;
+	dev_lim->num_ports = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET);
+	dev_lim->max_gids = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET);
+	dev_lim->max_pkeys = 1 << (field & 0xf);
+	MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET);
+	dev_lim->reserved_uars = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET);
+	dev_lim->uar_size = 1 << ((field & 0x3f) + 20);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET);
+	dev_lim->min_page_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET);
+	dev_lim->max_sg = field;
+	
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET);
+	dev_lim->max_desc_sz = size;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET);
+	dev_lim->max_qp_per_mcg = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET);
+	dev_lim->reserved_mgms = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET);
+	dev_lim->max_mcgs = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET);
+	dev_lim->reserved_pds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET);
+	dev_lim->max_pds = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET);
+	dev_lim->reserved_rdds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET);
+	dev_lim->max_rdds = 1 << (field & 0x3f);
+
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET);
+	dev_lim->eec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET);
+	dev_lim->qpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET);
+	dev_lim->eeec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET);
+	dev_lim->eqpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET);
+	dev_lim->eqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET);
+	dev_lim->cqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET);
+	dev_lim->srq_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET);
+	dev_lim->uar_scratch_entry_sz = size;
+
+	mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n",
+		  dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz);
+	mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n",
+		  dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz);
+	mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n",
+		  dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz);
+	mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n",
+		  dev_lim->reserved_mrws, dev_lim->reserved_mtts);
+	mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars);
+	mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_mgms);
+
+	mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err;
+
+#define QUERY_ADAPTER_OUT_SIZE             0x100
+#define QUERY_ADAPTER_VENDOR_ID_OFFSET     0x00
+#define QUERY_ADAPTER_DEVICE_ID_OFFSET     0x04
+#define QUERY_ADAPTER_REVISION_ID_OFFSET   0x08
+#define QUERY_ADAPTER_INTA_PIN_OFFSET      0x10
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET);
+	MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET);
+	MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET);
+	MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+
+#define INIT_HCA_IN_SIZE             	 0x200
+#define INIT_HCA_FLAGS_OFFSET        	 0x014
+#define INIT_HCA_QPC_OFFSET          	 0x020
+#define  INIT_HCA_QPC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x10)
+#define  INIT_HCA_LOG_QP_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x17)
+#define  INIT_HCA_EEC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x20)
+#define  INIT_HCA_LOG_EEC_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x27)
+#define  INIT_HCA_SRQC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x28)
+#define  INIT_HCA_LOG_SRQ_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x2f)
+#define  INIT_HCA_CQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x30)
+#define  INIT_HCA_LOG_CQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x37)
+#define  INIT_HCA_EQPC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x40)
+#define  INIT_HCA_EEEC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x50)
+#define  INIT_HCA_EQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x60)
+#define  INIT_HCA_LOG_EQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x67)
+#define  INIT_HCA_RDB_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x70)
+#define INIT_HCA_UDAV_OFFSET         	 0x0b0
+#define  INIT_HCA_UDAV_LKEY_OFFSET   	 (INIT_HCA_UDAV_OFFSET + 0x0)
+#define  INIT_HCA_UDAV_PD_OFFSET     	 (INIT_HCA_UDAV_OFFSET + 0x4)
+#define INIT_HCA_MCAST_OFFSET        	 0x0c0
+#define  INIT_HCA_MC_BASE_OFFSET         (INIT_HCA_MCAST_OFFSET + 0x00)
+#define  INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12)
+#define  INIT_HCA_MC_HASH_SZ_OFFSET      (INIT_HCA_MCAST_OFFSET + 0x16)
+#define  INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b)
+#define INIT_HCA_TPT_OFFSET              0x0f0
+#define  INIT_HCA_MPT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x00)
+#define  INIT_HCA_MTT_SEG_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x09)
+#define  INIT_HCA_LOG_MPT_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x0b)
+#define  INIT_HCA_MTT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x10)
+#define INIT_HCA_UAR_OFFSET              0x120
+#define  INIT_HCA_UAR_BASE_OFFSET        (INIT_HCA_UAR_OFFSET + 0x00)
+#define  INIT_HCA_UAR_PAGE_SZ_OFFSET     (INIT_HCA_UAR_OFFSET + 0x0b)
+#define  INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10)
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_HCA_IN_SIZE);
+
+#if defined(__LITTLE_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1);
+#elif defined(__BIG_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1);
+#else
+#error Host endianness not defined
+#endif
+	/* Check port for UD address vector: */
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1);
+
+	/* We leave wqe_quota, responder_exu, etc as 0 (default) */
+
+	/* QPC/EEC/CQC/EQC/RDB attributes */
+
+	MTHCA_PUT(inbox, param->qpc_base,     INIT_HCA_QPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_qps,  INIT_HCA_LOG_QP_OFFSET);
+	MTHCA_PUT(inbox, param->eec_base,     INIT_HCA_EEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET);
+	MTHCA_PUT(inbox, param->srqc_base,    INIT_HCA_SRQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET);
+	MTHCA_PUT(inbox, param->cqc_base,     INIT_HCA_CQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_cqs,  INIT_HCA_LOG_CQ_OFFSET);
+	MTHCA_PUT(inbox, param->eqpc_base,    INIT_HCA_EQPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eeec_base,    INIT_HCA_EEEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eqc_base,     INIT_HCA_EQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eqs,  INIT_HCA_LOG_EQ_OFFSET);
+	MTHCA_PUT(inbox, param->rdb_base,     INIT_HCA_RDB_BASE_OFFSET);
+
+	/* UD AV attributes */
+
+	/* multicast attributes */
+
+	MTHCA_PUT(inbox, param->mc_base,         INIT_HCA_MC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mc_hash_sz,      INIT_HCA_MC_HASH_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET);
+
+	/* TPT attributes */
+
+	MTHCA_PUT(inbox, param->mpt_base,   INIT_HCA_MPT_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_base,   INIT_HCA_MTT_BASE_OFFSET);
+
+	/* UAR attributes */
+	{
+		u8 uar_page_sz = PAGE_SHIFT - 12;
+		MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET);
+		MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET);
+	}
+
+	err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA,
+			HZ, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+	u32 flags;
+
+#define INIT_IB_IN_SIZE          56
+#define INIT_IB_FLAGS_OFFSET     0x00
+#define INIT_IB_FLAG_SIG         (1 << 18)
+#define INIT_IB_FLAG_NG          (1 << 17)
+#define INIT_IB_FLAG_G0          (1 << 16)
+#define INIT_IB_FLAG_1X          (1 << 8)
+#define INIT_IB_FLAG_4X          (1 << 9)
+#define INIT_IB_FLAG_12X         (1 << 11)
+#define INIT_IB_VL_SHIFT         4
+#define INIT_IB_MTU_SHIFT        12
+#define INIT_IB_MAX_GID_OFFSET   0x06
+#define INIT_IB_MAX_PKEY_OFFSET  0x0a
+#define INIT_IB_GUID0_OFFSET     0x10
+#define INIT_IB_NODE_GUID_OFFSET 0x18
+#define INIT_IB_SI_GUID_OFFSET   0x20
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_IB_IN_SIZE);
+
+	flags = 0;
+	flags |= param->enable_1x     ? INIT_IB_FLAG_1X  : 0;
+	flags |= param->enable_4x     ? INIT_IB_FLAG_4X  : 0;
+	flags |= param->set_guid0     ? INIT_IB_FLAG_G0  : 0;
+	flags |= param->set_node_guid ? INIT_IB_FLAG_NG  : 0;
+	flags |= param->set_si_guid   ? INIT_IB_FLAG_SIG : 0;
+	flags |= param->vl_cap << INIT_IB_VL_SHIFT;
+	flags |= param->mtu_cap << INIT_IB_MTU_SHIFT;
+	MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET);
+
+	MTHCA_PUT(inbox, param->gid_cap,   INIT_IB_MAX_GID_OFFSET);
+	MTHCA_PUT(inbox, param->pkey_cap,  INIT_IB_MAX_PKEY_OFFSET);
+	MTHCA_PUT(inbox, param->guid0,     INIT_IB_GUID0_OFFSET);
+	MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET);
+	MTHCA_PUT(inbox, param->si_guid,   INIT_IB_SI_GUID_OFFSET);
+
+	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
+			CMD_TIME_CLASS_A, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status)
+{
+	return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status);
+}
+
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status);
+}
+
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mpt_entry,
+			       MTHCA_MPT_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	if (mpt_entry) {
+		outdma = pci_map_single(dev->pdev, mpt_entry,
+					MTHCA_MPT_ENTRY_SIZE,
+					PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(outdma))
+			return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry,
+			    CMD_HW2SW_MPT,
+			    CMD_TIME_CLASS_B, status);
+
+	if (mpt_entry)
+		pci_unmap_single(dev->pdev, outdma,
+				 MTHCA_MPT_ENTRY_SIZE,
+				 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mtt_entry,
+			       (num_mtt + 2) * 8,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 (num_mtt + 2) * 8, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status)
+{
+	mthca_dbg(dev, "%s mask %016llx for eqn %d\n",
+		  unmap ? "Clearing" : "Setting",
+		  (unsigned long long) event_mask, eq_num);
+	return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num,
+			 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, eq_context,
+			       MTHCA_EQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, eq_context,
+				MTHCA_EQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, eq_num, 0,
+			    CMD_HW2SW_EQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_EQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, cq_context,
+			       MTHCA_CQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, cq_context,
+				MTHCA_CQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, cq_num, 0,
+			    CMD_HW2SW_CQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_CQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status)
+{
+	static const u16 op[] = {
+		[MTHCA_TRANS_RST2INIT]  = CMD_RST2INIT_QPEE,
+		[MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE,
+		[MTHCA_TRANS_INIT2RTR]  = CMD_INIT2RTR_QPEE,
+		[MTHCA_TRANS_RTR2RTS]   = CMD_RTR2RTS_QPEE,
+		[MTHCA_TRANS_RTS2RTS]   = CMD_RTS2RTS_QPEE,
+		[MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE,
+		[MTHCA_TRANS_ANY2ERR]   = CMD_2ERR_QPEE,
+		[MTHCA_TRANS_RTS2SQD]   = CMD_RTS2SQD_QPEE,
+		[MTHCA_TRANS_SQD2SQD]   = CMD_SQD2SQD_QPEE,
+		[MTHCA_TRANS_SQD2RTS]   = CMD_SQD2RTS_QPEE,
+		[MTHCA_TRANS_ANY2RST]   = CMD_ERR2RST_QPEE
+	};
+	u8 op_mod = 0;
+
+	dma_addr_t indma;
+	int err;
+
+	if (trans < 0 || trans >= ARRAY_SIZE(op))
+		return -EINVAL;
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		indma  = 0;
+		op_mod = 3;	/* don't write outbox, any->reset */
+
+		/* For debugging */
+		qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+						  &indma);
+		op_mod = 2;	/* write outbox, any->reset */
+	} else {
+		indma = pci_map_single(dev->pdev, qp_context,
+				       MTHCA_QP_CONTEXT_SIZE,
+				       PCI_DMA_TODEVICE);
+		if (pci_dma_mapping_error(indma))
+			return -ENOMEM;
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+	}
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num,
+				    op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+
+	} else
+		err = mthca_cmd(dev, indma, (!!is_ee << 24) | num,
+				op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+	if (trans != MTHCA_TRANS_ANY2RST)
+		pci_unmap_single(dev->pdev, indma,
+				 MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	else
+		pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+				    qp_context, indma);
+	return err;
+}
+
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, qp_context,
+				MTHCA_QP_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0,
+			    CMD_QUERY_QPEE,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_QP_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status)
+{
+	u8 op_mod;
+
+	switch (type) {
+	case IB_QPT_SMI:
+		op_mod = 0;
+		break;
+	case IB_QPT_GSI:
+		op_mod = 1;
+		break;
+	case IB_QPT_RAW_IPV6:
+		op_mod = 2;
+		break;
+	case IB_QPT_RAW_ETY:
+		op_mod = 3;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP,
+			 CMD_TIME_CLASS_B, status);
+}
+
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status) {
+	void *box;
+	dma_addr_t dma;
+	int err;
+
+#define MAD_IFC_BOX_SIZE 512
+
+	box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma);
+	if (!box)
+		return -ENOMEM;
+
+	memcpy(box, in_mad, 256);
+
+	err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey,
+			    CMD_MAD_IFC, CMD_TIME_CLASS_C, status);
+
+	if (!err && !*status)
+		memcpy(response_mad, box + 256, 256);
+
+	pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma);
+	return err;
+}
+
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, mgm,
+				MTHCA_MGM_ENTRY_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, index, 0,
+			    CMD_READ_MGM,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_MGM_ENTRY_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mgm,
+			       MTHCA_MGM_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	u64 imm;
+	int err;
+
+	indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH,
+			    CMD_TIME_CLASS_A, status);
+	*hash = imm;
+
+	pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE);
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h	2004-11-19 08:36:02.381148984 -0800
@@ -0,0 +1,260 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_CMD_H
+#define MTHCA_CMD_H
+
+#include <ib_verbs.h>
+
+#define MTHCA_CMD_MAILBOX_ALIGN 16UL
+#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1)
+
+enum {
+	/* command completed successfully: */
+	MTHCA_CMD_STAT_OK 	      = 0x00,
+	/* Internal error (such as a bus error) occurred while processing command: */
+	MTHCA_CMD_STAT_INTERNAL_ERR   = 0x01,
+	/* Operation/command not supported or opcode modifier not supported: */
+	MTHCA_CMD_STAT_BAD_OP 	      = 0x02,
+	/* Parameter not supported or parameter out of range: */
+	MTHCA_CMD_STAT_BAD_PARAM      = 0x03,
+	/* System not enabled or bad system state: */
+	MTHCA_CMD_STAT_BAD_SYS_STATE  = 0x04,
+	/* Attempt to access reserved or unallocaterd resource: */
+	MTHCA_CMD_STAT_BAD_RESOURCE   = 0x05,
+	/* Requested resource is currently executing a command, or is otherwise busy: */
+	MTHCA_CMD_STAT_RESOURCE_BUSY  = 0x06,
+	/* memory error: */
+	MTHCA_CMD_STAT_DDR_MEM_ERR    = 0x07,
+	/* Required capability exceeds device limits: */
+	MTHCA_CMD_STAT_EXCEED_LIM     = 0x08,
+	/* Resource is not in the appropriate state or ownership: */
+	MTHCA_CMD_STAT_BAD_RES_STATE  = 0x09,
+	/* Index out of range: */
+	MTHCA_CMD_STAT_BAD_INDEX      = 0x0a,
+	/* FW image corrupted: */
+	MTHCA_CMD_STAT_BAD_NVMEM      = 0x0b,
+	/* Attempt to modify a QP/EE which is not in the presumed state: */
+	MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10,
+	/* Bad segment parameters (Address/Size): */
+	MTHCA_CMD_STAT_BAD_SEG_PARAM  = 0x20,
+	/* Memory Region has Memory Windows bound to: */
+	MTHCA_CMD_STAT_REG_BOUND      = 0x21,
+	/* HCA local attached memory not present: */
+	MTHCA_CMD_STAT_LAM_NOT_PRE    = 0x22,
+        /* Bad management packet (silently discarded): */
+	MTHCA_CMD_STAT_BAD_PKT 	      = 0x30,
+        /* More outstanding CQEs in CQ than new CQ size: */
+	MTHCA_CMD_STAT_BAD_SIZE       = 0x40
+};
+
+enum {
+	MTHCA_TRANS_INVALID = 0,
+	MTHCA_TRANS_RST2INIT,
+	MTHCA_TRANS_INIT2INIT,
+	MTHCA_TRANS_INIT2RTR,
+	MTHCA_TRANS_RTR2RTS,
+	MTHCA_TRANS_RTS2RTS,
+	MTHCA_TRANS_SQERR2RTS,
+	MTHCA_TRANS_ANY2ERR,
+	MTHCA_TRANS_RTS2SQD,
+	MTHCA_TRANS_SQD2SQD,
+	MTHCA_TRANS_SQD2RTS,
+	MTHCA_TRANS_ANY2RST,
+};
+
+enum {
+	DEV_LIM_FLAG_SRQ = 1 << 6
+};
+
+struct mthca_dev_lim {
+	int max_srq_sz;
+	int max_qp_sz;
+	int reserved_qps;
+	int max_qps;
+	int reserved_srqs;
+	int max_srqs;
+	int reserved_eecs;
+	int max_eecs;
+	int max_cq_sz;
+	int reserved_cqs;
+	int max_cqs;
+	int max_mpts;
+	int reserved_eqs;
+	int max_eqs;
+	int reserved_mtts;
+	int max_mrw_sz;
+	int reserved_mrws;
+	int max_mtt_seg;
+	int max_avs;
+	int max_requester_per_qp;
+	int max_responder_per_qp;
+	int max_rdma_global;
+	int local_ca_ack_delay;
+	int max_mtu;
+	int max_port_width;
+	int max_vl;
+	int num_ports;
+	int max_gids;
+	int max_pkeys;
+	u32 flags;
+	int reserved_uars;
+	int uar_size;
+	int min_page_sz;
+	int max_sg;
+	int max_desc_sz;
+	int max_qp_per_mcg;
+	int reserved_mgms;
+	int max_mcgs;
+	int reserved_pds;
+	int max_pds;
+	int reserved_rdds;
+	int max_rdds;
+	int eec_entry_sz;
+	int qpc_entry_sz;
+	int eeec_entry_sz;
+	int eqpc_entry_sz;
+	int eqc_entry_sz;
+	int cqc_entry_sz;
+	int srq_entry_sz;
+	int uar_scratch_entry_sz;
+};
+
+struct mthca_adapter {
+	u32 vendor_id;
+	u32 device_id;
+	u32 revision_id;
+	u8  inta_pin;
+};
+
+struct mthca_init_hca_param {
+	u64 qpc_base;
+	u8  log_num_qps;
+	u64 eec_base;
+	u8  log_num_eecs;
+	u64 srqc_base;
+	u8  log_num_srqs;
+	u64 cqc_base;
+	u8  log_num_cqs;
+	u64 eqpc_base;
+	u64 eeec_base;
+	u64 eqc_base;
+	u8  log_num_eqs;
+	u64 rdb_base;
+	u64 mc_base;
+	u16 log_mc_entry_sz;
+	u16 mc_hash_sz;
+	u8  log_mc_table_sz;
+	u64 mpt_base;
+	u8  mtt_seg_sz;
+	u8  log_mpt_sz;
+	u64 mtt_base;
+	u64 uar_scratch_base;
+};
+
+struct mthca_init_ib_param {
+	int enable_1x;
+	int enable_4x;
+	int vl_cap;
+	int mtu_cap;
+	u16 gid_cap;
+	u16 pkey_cap;
+	int set_guid0;
+	u64 guid0;
+	int set_node_guid;
+	u64 node_guid;
+	int set_si_guid;
+	u64 si_guid;
+};
+
+int mthca_cmd_use_events(struct mthca_dev *dev);
+void mthca_cmd_use_polling(struct mthca_dev *dev);
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param);
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status);
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status);
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status);
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status);
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status);
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status);
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status);
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status);
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status);
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status);
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status);
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status);
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status);
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status);
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status);
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status);
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status);
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status);
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status);
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status);
+
+#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN))
+
+#endif /* MTHCA_CMD_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h	2004-11-19 08:36:02.406145301 -0800
@@ -0,0 +1,51 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $
+ */
+
+#ifndef MTHCA_CONFIG_REG_H
+#define MTHCA_CONFIG_REG_H
+
+#include <asm/page.h>
+
+#define MTHCA_HCR_BASE         0x80680
+#define MTHCA_HCR_SIZE         0x0001c
+#define MTHCA_ECR_BASE         0x80700
+#define MTHCA_ECR_SIZE         0x00008
+#define MTHCA_ECR_CLR_BASE     0x80708
+#define MTHCA_ECR_CLR_SIZE     0x00008
+#define MTHCA_ECR_OFFSET       (MTHCA_ECR_BASE     - MTHCA_HCR_BASE)
+#define MTHCA_ECR_CLR_OFFSET   (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE)
+#define MTHCA_CLR_INT_BASE     0xf00d8
+#define MTHCA_CLR_INT_SIZE     0x00008
+
+#define MTHCA_MAP_HCR_SIZE     (MTHCA_ECR_CLR_BASE   + \
+			        MTHCA_ECR_CLR_SIZE   - \
+			        MTHCA_HCR_BASE)
+
+#endif /* MTHCA_CONFIG_REG_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c	2004-11-19 08:36:02.451138670 -0800
@@ -0,0 +1,821 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE
+};
+
+enum {
+	MTHCA_CQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_cq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 error_eqn;
+	u32 comp_eqn;
+	u32 pd;
+	u32 lkey;
+	u32 last_notified_index;
+	u32 solicit_producer_index;
+	u32 consumer_index;
+	u32 producer_index;
+	u32 cqn;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_CQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_CQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_CQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_CQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_CQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_CQ_STATE_DISARMED     ( 0 <<  8)
+#define MTHCA_CQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_CQ_STATE_ARMED_SOL    ( 4 <<  8)
+#define MTHCA_EQ_STATE_FIRED        (10 <<  8)
+
+enum {
+	MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe
+};
+
+enum {
+	SYNDROME_LOCAL_LENGTH_ERR 	 = 0x01,
+	SYNDROME_LOCAL_QP_OP_ERR  	 = 0x02,
+	SYNDROME_LOCAL_EEC_OP_ERR 	 = 0x03,
+	SYNDROME_LOCAL_PROT_ERR   	 = 0x04,
+	SYNDROME_WR_FLUSH_ERR     	 = 0x05,
+	SYNDROME_MW_BIND_ERR      	 = 0x06,
+	SYNDROME_BAD_RESP_ERR     	 = 0x10,
+	SYNDROME_LOCAL_ACCESS_ERR 	 = 0x11,
+	SYNDROME_REMOTE_INVAL_REQ_ERR 	 = 0x12,
+	SYNDROME_REMOTE_ACCESS_ERR 	 = 0x13,
+	SYNDROME_REMOTE_OP_ERR     	 = 0x14,
+	SYNDROME_RETRY_EXC_ERR 		 = 0x15,
+	SYNDROME_RNR_RETRY_EXC_ERR 	 = 0x16,
+	SYNDROME_LOCAL_RDD_VIOL_ERR 	 = 0x20,
+	SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21,
+	SYNDROME_REMOTE_ABORTED_ERR 	 = 0x22,
+	SYNDROME_INVAL_EECN_ERR 	 = 0x23,
+	SYNDROME_INVAL_EEC_STATE_ERR 	 = 0x24
+};
+
+struct mthca_cqe {
+	u32 my_qpn;
+	u32 my_ee;
+	u32 rqpn;
+	u16 sl_g_mlpath;
+	u16 rlid;
+	u32 imm_etype_pkey_eec;
+	u32 byte_cnt;
+	u32 wqe;
+	u8  opcode;
+	u8  is_send;
+	u8  reserved;
+	u8  owner;
+} __attribute__((packed));
+
+struct mthca_err_cqe {
+	u32 my_qpn;
+	u32 reserved1[3];
+	u8  syndrome;
+	u8  reserved2;
+	u16 db_cnt;
+	u32 reserved3;
+	u32 wqe;
+	u8  opcode;
+	u8  reserved4[2];
+	u8  owner;
+} __attribute__((packed));
+
+#define MTHCA_CQ_ENTRY_OWNER_SW      (0 << 7)
+#define MTHCA_CQ_ENTRY_OWNER_HW      (1 << 7)
+
+#define MTHCA_CQ_DB_INC_CI       (1 << 24)
+#define MTHCA_CQ_DB_REQ_NOT      (2 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_SOL  (3 << 24)
+#define MTHCA_CQ_DB_SET_CI       (4 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24)
+
+static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry)
+{
+	if (cq->is_direct)
+		return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE);
+	else
+		return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf
+			+ (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int cqe_sw(struct mthca_cq *cq, int i)
+{
+	return !(MTHCA_CQ_ENTRY_OWNER_HW &
+		 get_cqe(cq, i)->owner);
+}
+
+static inline int next_cqe_sw(struct mthca_cq *cq)
+{
+	return cqe_sw(cq, cq->cons_index);
+}
+
+static inline void set_cqe_hw(struct mthca_cq *cq, int entry)
+{
+	get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW;
+}
+
+static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq,
+				  int nent)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn);
+	doorbell[1] = cpu_to_be32(nent - 1);
+		
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn)
+{
+	struct mthca_cq *cq;
+
+	spin_lock(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock(&dev->cq_table.lock);
+
+	if (!cq) {
+		mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn);
+		return;
+	}
+
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn)
+{
+	struct mthca_cq *cq;
+	struct mthca_cqe *cqe;
+	int prod_index;
+	int nfreed = 0;
+
+	spin_lock_irq(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	if (!cq)
+		return;
+
+	spin_lock_irq(&cq->lock);
+
+	/*
+	 * First we need to find the current producer index, so we
+	 * know where to start cleaning from.  It doesn't matter if HW
+	 * adds new entries after this loop -- the QP we're worried
+	 * about is already in RESET, so the new entries won't come
+	 * from our QP and therefore don't need to be checked.
+	 */
+	for (prod_index = cq->cons_index;
+	     cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1));
+	     ++prod_index)
+		if (prod_index == cq->cons_index + cq->ibcq.cqe - 1)
+			break;
+
+	if (0)
+		mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n",
+			  qpn, cqn, cq->cons_index, prod_index);
+
+	/*
+	 * Now sweep backwards through the CQ, removing CQ entries
+	 * that match our QP by copying older entries on top of them.
+	 */
+	while (prod_index > cq->cons_index) {
+		cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1));
+		if (cqe->my_qpn == cpu_to_be32(qpn))
+			++nfreed;
+		else if (nfreed)
+			memcpy(get_cqe(cq, (prod_index - 1 + nfreed) &
+				       (cq->ibcq.cqe - 1)),
+			       cqe,
+			       MTHCA_CQ_ENTRY_SIZE);
+		--prod_index;
+	}
+
+	if (nfreed) {
+		wmb();
+		inc_cons_index(dev, cq, nfreed);
+		cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1);
+	}
+
+	spin_unlock_irq(&cq->lock);
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq,
+			    struct mthca_qp *qp, int wqe_index, int is_send,
+			    struct mthca_err_cqe *cqe,
+			    struct ib_wc *entry, int *free_cqe)
+{
+	int err;
+	int dbd;
+	u32 new_wqe;
+
+	if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) {
+		int j;
+		
+		mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+	}
+
+	/*
+	 * For completions in error, only work request ID, status (and
+	 * freed resource count for RD) have to be set.
+	 */
+	switch (cqe->syndrome) {
+	case SYNDROME_LOCAL_LENGTH_ERR:
+		entry->status = IB_WC_LOC_LEN_ERR;
+		break;
+	case SYNDROME_LOCAL_QP_OP_ERR:
+		entry->status = IB_WC_LOC_QP_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_EEC_OP_ERR:
+		entry->status = IB_WC_LOC_EEC_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_PROT_ERR:
+		entry->status = IB_WC_LOC_PROT_ERR;
+		break;
+	case SYNDROME_WR_FLUSH_ERR:
+		entry->status = IB_WC_WR_FLUSH_ERR;
+		break;
+	case SYNDROME_MW_BIND_ERR:
+		entry->status = IB_WC_MW_BIND_ERR;
+		break;
+	case SYNDROME_BAD_RESP_ERR:
+		entry->status = IB_WC_BAD_RESP_ERR;
+		break;
+	case SYNDROME_LOCAL_ACCESS_ERR:
+		entry->status = IB_WC_LOC_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_REQ_ERR:
+		entry->status = IB_WC_REM_INV_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ACCESS_ERR:
+		entry->status = IB_WC_REM_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_OP_ERR:
+		entry->status = IB_WC_REM_OP_ERR;
+		break;
+	case SYNDROME_RETRY_EXC_ERR:
+		entry->status = IB_WC_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_RNR_RETRY_EXC_ERR:
+		entry->status = IB_WC_RNR_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_LOCAL_RDD_VIOL_ERR:
+		entry->status = IB_WC_LOC_RDD_VIOL_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_RD_REQ_ERR:
+		entry->status = IB_WC_REM_INV_RD_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ABORTED_ERR:
+		entry->status = IB_WC_REM_ABORT_ERR;
+		break;
+	case SYNDROME_INVAL_EECN_ERR:
+		entry->status = IB_WC_INV_EECN_ERR;
+		break;
+	case SYNDROME_INVAL_EEC_STATE_ERR:
+		entry->status = IB_WC_INV_EEC_STATE_ERR;
+		break;
+	default:
+		entry->status = IB_WC_GENERAL_ERR;
+		break;
+	}
+
+	err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe);
+	if (err)
+		return err;
+
+	/*
+	 * If we're at the end of the WQE chain, or we've used up our
+	 * doorbell count, free the CQE.  Otherwise just update it for
+	 * the next poll operation.
+	 */
+	if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd))
+		return 0;
+
+	cqe->db_cnt   = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd);
+	cqe->wqe      = new_wqe;
+	cqe->syndrome = SYNDROME_WR_FLUSH_ERR;
+
+	*free_cqe = 0;
+
+	return 0;
+}
+
+static void dump_cqe(struct mthca_cqe *cqe)
+{
+	int j;
+
+	for (j = 0; j < 8; ++j)
+		printk(KERN_DEBUG "  [%2x] %08x\n",
+		       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+}
+
+static inline int mthca_poll_one(struct mthca_dev *dev,
+				 struct mthca_cq *cq,
+				 struct mthca_qp **cur_qp,
+				 int *freed,
+				 struct ib_wc *entry)
+{
+	struct mthca_wq *wq;
+	struct mthca_cqe *cqe;
+	int wqe_index;
+	int is_error = 0;
+	int is_send;
+	int free_cqe = 1;
+	int err = 0;
+
+	if (!next_cqe_sw(cq))
+		return -EAGAIN;
+
+	rmb();
+
+	cqe = get_cqe(cq, cq->cons_index);
+
+	if (0) {
+		mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		dump_cqe(cqe);
+	}
+
+	if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) ==
+	    MTHCA_ERROR_CQE_OPCODE_MASK) {
+		is_error = 1;
+		is_send = cqe->opcode & 1;
+	} else
+		is_send = cqe->is_send & 0x80;
+
+	if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) {
+		if (*cur_qp) {
+			spin_unlock(&(*cur_qp)->lock);
+			if (atomic_dec_and_test(&(*cur_qp)->refcount))
+				wake_up(&(*cur_qp)->wait);
+		}
+
+		spin_lock(&dev->qp_table.lock);
+		*cur_qp = mthca_array_get(&dev->qp_table.qp,
+					  be32_to_cpu(cqe->my_qpn) &
+					  (dev->limits.num_qps - 1));
+		if (*cur_qp)
+			atomic_inc(&(*cur_qp)->refcount);
+		spin_unlock(&dev->qp_table.lock);
+
+		if (!*cur_qp) {
+			mthca_warn(dev, "CQ entry for unknown QP %06x\n",
+				   be32_to_cpu(cqe->my_qpn) & 0xffffff);
+			err = -EINVAL;
+			goto out;
+		}
+
+		spin_lock(&(*cur_qp)->lock);
+	}
+
+	if (is_send) {
+		wq = &(*cur_qp)->sq;
+		wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset)
+			     >> wq->wqe_shift);
+		entry->wr_id = (*cur_qp)->wrid[wqe_index +
+					       (*cur_qp)->rq.max];
+	} else {
+		wq = &(*cur_qp)->rq;
+		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		entry->wr_id = (*cur_qp)->wrid[wqe_index];
+	}
+
+	if (wq->last_comp < wqe_index)
+		wq->cur -= wqe_index - wq->last_comp;
+	else
+		wq->cur -= wq->max - wq->last_comp + wqe_index;
+
+	wq->last_comp = wqe_index;
+
+	if (0)
+		mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n",
+			  is_send ? "Send" : "Receive",
+			  (*cur_qp)->qpn, wqe_index, wq->max);
+
+	if (is_error) {
+		err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send,
+				       (struct mthca_err_cqe *) cqe,
+				       entry, &free_cqe);
+		goto out;
+	}
+
+	if (is_send) {
+		entry->opcode = IB_WC_SEND; /* XXX */
+	} else {
+		entry->byte_len = be32_to_cpu(cqe->byte_cnt);
+		switch (cqe->opcode & 0x1f) {
+		case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV;
+			break;
+		case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV_RDMA_WITH_IMM;
+			break;
+		default:
+			entry->wc_flags = 0;
+			entry->opcode = IB_WC_RECV;
+			break;
+		}
+		entry->slid 	   = be16_to_cpu(cqe->rlid);
+		entry->sl   	   = be16_to_cpu(cqe->sl_g_mlpath) >> 12;
+		entry->src_qp 	   = be32_to_cpu(cqe->rqpn) & 0xffffff;
+		entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f;
+		entry->pkey_index  = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16;
+		entry->wc_flags   |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ?
+					IB_WC_GRH : 0;
+	}
+
+	entry->status = IB_WC_SUCCESS;
+
+ out:
+	if (free_cqe) {
+		set_cqe_hw(cq, cq->cons_index);
+		++(*freed);
+		cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1);
+	}
+
+	return err;
+}
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry)
+{
+	struct mthca_dev *dev = to_mdev(ibcq->device);
+	struct mthca_cq *cq = to_mcq(ibcq);
+	struct mthca_qp *qp = NULL;
+	unsigned long flags;
+	int err = 0;
+	int freed = 0;
+	int npolled;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+		err = mthca_poll_one(dev, cq, &qp,
+				     &freed, entry + npolled);
+		if (err)
+			break;
+	}
+
+	if (qp) {
+		spin_unlock(&qp->lock);
+		if (atomic_dec_and_test(&qp->refcount))
+			wake_up(&qp->wait);
+	}
+
+	wmb();
+	inc_cons_index(dev, cq, freed);
+
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return err == 0 || err == -EAGAIN ? npolled : err;
+}
+
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited)
+{
+	u32 doorbell[2];
+
+	doorbell[0] =  cpu_to_be32((solicited ?
+				    MTHCA_CQ_DB_REQ_NOT_SOL :
+				    MTHCA_CQ_DB_REQ_NOT)      |
+				   cq->cqn);
+	doorbell[1] = 0xffffffff;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq)
+{
+	int size = nent * MTHCA_CQ_ENTRY_SIZE;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	int npages, shift;
+	u64 *dma_list = NULL;
+	struct mthca_cq_context *cq_context;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out;
+
+	cq_context = MAILBOX_ALIGN(mailbox);
+
+	if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) {
+		if (0)
+			mthca_dbg(dev, "Creating direct CQ of size %d\n", size);
+
+		cq->is_direct = 1;
+		npages        = 1;
+		shift         = get_order(size) + PAGE_SHIFT;
+
+		cq->queue.direct.buf = pci_alloc_consistent(dev->pdev,
+							    size, &t);
+		if (!cq->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&cq->queue.direct, mapping, t);
+
+		memset(cq->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		cq->is_direct = 0;
+		npages        = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+		shift         = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list,
+					      GFP_KERNEL);
+		if (!cq->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i)
+			cq->queue.page_list[i].buf = NULL;
+
+		for (i = 0; i < npages; ++i) {
+			cq->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!cq->queue.page_list[i].buf)
+				goto err_out_free;
+			
+			dma_list[i] = t;
+			pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t);
+
+			memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE);
+		}
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_cqe_hw(cq, i);
+
+	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
+	if (cq->cqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, shift, npages,
+				  0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &cq->mr);
+	if (err)
+		goto err_out_free_cq;
+
+	spin_lock_init(&cq->lock);
+	atomic_set(&cq->refcount, 1);
+	init_waitqueue_head(&cq->wait);
+
+	memset(cq_context, 0, sizeof *cq_context);
+	cq_context->flags           = cpu_to_be32(MTHCA_CQ_STATUS_OK      |
+						  MTHCA_CQ_STATE_DISARMED |
+						  MTHCA_CQ_FLAG_TR);
+	cq_context->start           = cpu_to_be64(0);
+	cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
+	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
+	cq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	cq_context->lkey            = cpu_to_be32(cq->mr.ibmr.lkey);
+	cq_context->cqn             = cpu_to_be32(cq->cqn);
+
+	err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+
+	if (status) {
+		mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	if (mthca_array_set(&dev->cq_table.cq,
+			    cq->cqn & (dev->limits.num_cqs - 1),
+			    cq)) {
+		spin_unlock_irq(&dev->cq_table.lock);
+		goto err_out_free_mr;
+	}
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	cq->cons_index = 0;
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return 0;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &cq->mr);
+
+ err_out_free_cq:
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+
+ err_out_free:
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev, size,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct, mapping));
+	else {
+		for (i = 0; i < npages; ++i)
+			if (cq->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    cq->queue.page_list[i].buf,
+						    pci_unmap_addr(&cq->queue.page_list[i],
+								   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+ err_out:
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return err;
+}
+
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq)
+{
+	void *mailbox;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_warn(dev, "No memory for mailbox to free CQ.\n");
+		return;
+	}
+
+	err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		u32 *ctx = MAILBOX_ALIGN(mailbox);
+		int j;
+		
+		printk(KERN_ERR "context for CQN %x\n", cq->cqn);
+		for (j = 0; j < 16; ++j)
+			printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j]));
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	mthca_array_clear(&dev->cq_table.cq,
+			  cq->cqn & (dev->limits.num_cqs - 1));
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	atomic_dec(&cq->refcount);
+	wait_event(cq->wait, !atomic_read(&cq->refcount));
+
+	mthca_free_mr(dev, &cq->mr);
+
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev,
+				    cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct,
+						   mapping));
+	else {
+		int i;
+
+		for (i = 0;
+		     i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+			     PAGE_SIZE;
+		     ++i)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    cq->queue.page_list[i].buf,
+					    pci_unmap_addr(&cq->queue.page_list[i],
+							   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+	kfree(mailbox);
+}
+
+int __devinit mthca_init_cq_table(struct mthca_dev *dev)
+{
+	int err;
+
+	spin_lock_init(&dev->cq_table.lock);
+
+	err = mthca_alloc_init(&dev->cq_table.alloc,
+			       dev->limits.num_cqs,
+			       (1 << 24) - 1,
+			       dev->limits.reserved_cqs);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->cq_table.cq,
+			       dev->limits.num_cqs);
+	if (err)
+		mthca_alloc_cleanup(&dev->cq_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev)
+{
+	mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs);
+	mthca_alloc_cleanup(&dev->cq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h	2004-11-19 08:36:02.478134692 -0800
@@ -0,0 +1,386 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_DEV_H
+#define MTHCA_DEV_H
+
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <asm/semaphore.h>
+#include <asm/scatterlist.h>
+
+#include "mthca_provider.h"
+#include "mthca_doorbell.h"
+
+#define DRV_NAME	"ib_mthca"
+#define PFX		DRV_NAME ": "
+#define DRV_VERSION	"0.06-pre"
+#define DRV_RELDATE	"November 8, 2004"
+
+/* Types of supported HCA */
+enum {
+	TAVOR,			/* MT23108                        */
+	ARBEL_COMPAT,		/* MT25208 in Tavor compat mode   */
+	ARBEL_NATIVE		/* MT25208 with extended features */
+};
+
+enum {
+	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
+	MTHCA_FLAG_SRQ        = 1 << 2,
+	MTHCA_FLAG_MSI        = 1 << 3,
+	MTHCA_FLAG_MSI_X      = 1 << 4,
+	MTHCA_FLAG_NO_LAM     = 1 << 5
+};
+
+enum {
+	MTHCA_KAR_PAGE  = 1,
+	MTHCA_MAX_PORTS = 2
+};
+
+enum {
+	MTHCA_MPT_ENTRY_SIZE  =  0x40,
+	MTHCA_EQ_CONTEXT_SIZE =  0x40,
+	MTHCA_CQ_CONTEXT_SIZE =  0x40,
+	MTHCA_QP_CONTEXT_SIZE = 0x200,
+	MTHCA_AV_SIZE         =  0x20,
+	MTHCA_MGM_ENTRY_SIZE  =  0x40
+};
+
+enum {
+	MTHCA_EQ_CMD,
+	MTHCA_EQ_ASYNC,
+	MTHCA_EQ_COMP,
+	MTHCA_NUM_EQ
+};
+
+struct mthca_cmd {
+	int                       use_events;
+	struct semaphore          hcr_sem;
+	struct semaphore 	  poll_sem;
+	struct semaphore 	  event_sem;
+	int              	  max_cmds;
+	spinlock_t                context_lock;
+	int                       free_head;
+	struct mthca_cmd_context *context;
+	u16                       token_mask;
+};
+
+struct mthca_limits {
+	int      num_ports;
+	int      vl_cap;
+	int      mtu_cap;
+	int      gid_table_len;
+	int      pkey_table_len;
+	int      local_ca_ack_delay;
+	int      max_sg;
+	int      num_qps;
+	int      reserved_qps;
+	int      num_srqs;
+	int      reserved_srqs;
+	int      num_eecs;
+	int      reserved_eecs;
+	int      num_cqs;
+	int      reserved_cqs;
+	int      num_eqs;
+	int      reserved_eqs;
+	int      num_mpts;
+	int      num_mtt_segs;
+	int      mtt_seg_size;
+	int      reserved_mtts;
+	int      reserved_mrws;
+	int      num_rdbs;
+	int      reserved_uars;
+	int      num_mgms;
+	int      num_amgms;
+	int      reserved_mcgs;
+	int      num_pds;
+	int      reserved_pds;
+};
+
+struct mthca_alloc {
+	u32            last;
+	u32            top;
+	u32            max;
+	u32            mask;
+	spinlock_t     lock;
+	unsigned long *table;
+};
+
+struct mthca_array {
+	struct {
+		void    **page;
+		int       used;
+	} *page_list;
+};
+
+struct mthca_pd_table {
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mr_table {
+	struct mthca_alloc mpt_alloc;
+	int                max_mtt_order;
+	unsigned long    **mtt_buddy;
+	u64                mtt_base;
+};
+
+struct mthca_eq_table {
+	struct mthca_alloc alloc;
+	void __iomem      *clr_int;
+	u32                clr_mask;
+	struct mthca_eq    eq[MTHCA_NUM_EQ];
+	int                have_irq;
+	u8                 inta_pin;
+};
+
+struct mthca_cq_table {
+	struct mthca_alloc alloc;
+	spinlock_t         lock;
+	struct mthca_array cq;
+};
+
+struct mthca_qp_table {
+	struct mthca_alloc alloc;
+	int                sqp_start;
+	spinlock_t         lock;
+	struct mthca_array qp;
+};
+
+struct mthca_av_table {
+	struct pci_pool   *pool;
+	int                num_ddr_avs;
+	u64                ddr_av_base;
+	void __iomem      *av_map;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mcg_table {
+	struct semaphore   sem;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_dev {
+	struct ib_device  ib_dev;
+	struct pci_dev   *pdev;
+
+	int          	 hca_type;
+	unsigned long	 mthca_flags;
+
+	u32              rev_id;
+
+	/* firmware info */
+	u64              fw_ver;
+	union {
+		struct {
+			u64 fw_start;
+			u64 fw_end;
+		}        tavor;
+		struct {
+			u64 clr_int_base;
+			u64 eq_arm_base;
+			u64 eq_set_ci_base;
+			struct scatterlist *mem;
+			u16 fw_pages;
+		}        arbel;
+	}                fw;
+
+	u64              ddr_start;
+	u64              ddr_end;
+
+	MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock)
+
+	void __iomem    *hcr;
+	void __iomem    *clr_base;
+	void __iomem    *kar;
+
+	struct mthca_cmd    cmd;
+	struct mthca_limits limits;
+
+	struct mthca_pd_table  pd_table;
+	struct mthca_mr_table  mr_table;
+	struct mthca_eq_table  eq_table;
+	struct mthca_cq_table  cq_table;
+	struct mthca_qp_table  qp_table;
+	struct mthca_av_table  av_table;
+	struct mthca_mcg_table mcg_table;
+
+	struct mthca_pd       driver_pd;
+	struct mthca_mr       driver_mr;
+
+	struct ib_mad_agent  *send_agent[MTHCA_MAX_PORTS][2];
+	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
+	spinlock_t            sm_lock;
+};
+
+#define mthca_dbg(mdev, format, arg...) \
+	dev_dbg(&mdev->pdev->dev, format, ## arg)
+#define mthca_err(mdev, format, arg...) \
+	dev_err(&mdev->pdev->dev, format, ## arg)
+#define mthca_info(mdev, format, arg...) \
+	dev_info(&mdev->pdev->dev, format, ## arg)
+#define mthca_warn(mdev, format, arg...) \
+	dev_warn(&mdev->pdev->dev, format, ## arg)
+
+extern void __buggy_use_of_MTHCA_GET(void);
+extern void __buggy_use_of_MTHCA_PUT(void);
+
+#define MTHCA_GET(dest, source, offset)                               \
+	do {                                                          \
+		void *__p = (char *) (source) + (offset);             \
+		switch (sizeof (dest)) {                              \
+			case 1: (dest) = *(u8 *) __p;       break;    \
+			case 2: (dest) = be16_to_cpup(__p); break;    \
+			case 4: (dest) = be32_to_cpup(__p); break;    \
+			case 8: (dest) = be64_to_cpup(__p); break;    \
+			default: __buggy_use_of_MTHCA_GET();          \
+		}                                                     \
+	} while (0)
+
+#define MTHCA_PUT(dest, source, offset)                               \
+	do {                                                          \
+		__typeof__(source) *__p =                             \
+			(__typeof__(source) *) ((char *) (dest) + (offset)); \
+		switch (sizeof(source)) {                             \
+			case 1: *__p = (source);            break;    \
+			case 2: *__p = cpu_to_be16(source); break;    \
+			case 4: *__p = cpu_to_be32(source); break;    \
+			case 8: *__p = cpu_to_be64(source); break;    \
+			default: __buggy_use_of_MTHCA_PUT();          \
+		}                                                     \
+	} while (0)
+
+int mthca_reset(struct mthca_dev *mdev);
+
+u32 mthca_alloc(struct mthca_alloc *alloc);
+void mthca_free(struct mthca_alloc *alloc, u32 obj);
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved);
+void mthca_alloc_cleanup(struct mthca_alloc *alloc);
+void *mthca_array_get(struct mthca_array *array, int index);
+int mthca_array_set(struct mthca_array *array, int index, void *value);
+void mthca_array_clear(struct mthca_array *array, int index);
+int mthca_array_init(struct mthca_array *array, int nent);
+void mthca_array_cleanup(struct mthca_array *array, int nent);
+
+int mthca_init_pd_table(struct mthca_dev *dev);
+int mthca_init_mr_table(struct mthca_dev *dev);
+int mthca_init_eq_table(struct mthca_dev *dev);
+int mthca_init_cq_table(struct mthca_dev *dev);
+int mthca_init_qp_table(struct mthca_dev *dev);
+int mthca_init_av_table(struct mthca_dev *dev);
+int mthca_init_mcg_table(struct mthca_dev *dev);
+
+void mthca_cleanup_pd_table(struct mthca_dev *dev);
+void mthca_cleanup_mr_table(struct mthca_dev *dev);
+void mthca_cleanup_eq_table(struct mthca_dev *dev);
+void mthca_cleanup_cq_table(struct mthca_dev *dev);
+void mthca_cleanup_qp_table(struct mthca_dev *dev);
+void mthca_cleanup_av_table(struct mthca_dev *dev);
+void mthca_cleanup_mcg_table(struct mthca_dev *dev);
+
+int mthca_register_device(struct mthca_dev *dev);
+void mthca_unregister_device(struct mthca_dev *dev);
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd);
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr);
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr);
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr);
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry);
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited);
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq);
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq);
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn);
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn);
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type);
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr);
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr);
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe);
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp);
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp);
+void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp);
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah);
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah);
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header);
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad);
+int mthca_create_agents(struct mthca_dev *dev);
+void mthca_free_agents(struct mthca_dev *dev);
+
+static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct mthca_dev, ib_dev);
+}
+
+#endif /* MTHCA_DEV_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h	2004-11-19 08:36:02.515129240 -0800
@@ -0,0 +1,119 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/preempt.h>
+
+#define MTHCA_RD_DOORBELL      0x00
+#define MTHCA_SEND_DOORBELL    0x10
+#define MTHCA_RECEIVE_DOORBELL 0x18
+#define MTHCA_CQ_DOORBELL      0x20
+#define MTHCA_EQ_DOORBELL      0x28
+
+#if BITS_PER_LONG == 64
+/*
+ * Assume that we can just write a 64-bit doorbell atomically.  s390
+ * actually doesn't have writeq() but S/390 systems don't even have
+ * PCI so we won't worry about it.
+ */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	__raw_writeq(*(u64 *) val, dest);
+}
+
+#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL)
+/* Use SSE to write 64 bits atomically without a lock. */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline unsigned long mthca_get_fpu(void)
+{
+	unsigned long cr0;
+
+	preempt_disable();
+	asm volatile("mov %%cr0,%0; clts" : "=r" (cr0));
+	return cr0;
+}
+
+static inline void mthca_put_fpu(unsigned long cr0)
+{
+	asm volatile("mov %0,%%cr0" : : "r" (cr0));
+	preempt_enable();
+}
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	/* i386 stack is aligned to 8 bytes, so this should be OK: */
+	u8 xmmsave[8] __attribute__((aligned(8)));
+	unsigned long cr0;
+
+	cr0 = mthca_get_fpu();
+
+	asm volatile (
+		"movlps %%xmm0,(%0); \n\t"
+		"movlps (%1),%%xmm0; \n\t"
+		"movlps %%xmm0,(%2); \n\t"
+		"movlps (%0),%%xmm0; \n\t"
+		:
+		: "r" (xmmsave), "r" (val), "r" (dest)
+		: "memory" );
+
+	mthca_put_fpu(cr0);
+}
+
+#else
+/* Just fall back to a spinlock to protect the doorbell */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name;
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)     spin_lock_init(ptr)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (ptr)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(doorbell_lock, flags);
+	__raw_writel(val[0], dest);
+	__raw_writel(val[1], dest + 4);
+	spin_unlock_irqrestore(doorbell_lock, flags);
+}
+
+#endif
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c	2004-11-19 08:36:02.559122757 -0800
@@ -0,0 +1,650 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+#include "mthca_config_reg.h"
+
+enum {
+	MTHCA_NUM_ASYNC_EQE = 0x80,
+	MTHCA_NUM_CMD_EQE   = 0x80,
+	MTHCA_EQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_eq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 pd;
+	u8  reserved1[3];
+	u8  intr;
+	u32 lost_count;
+	u32 lkey;
+	u32 reserved2[2];
+	u32 consumer_index;
+	u32 producer_index;
+	u32 reserved3[4];
+} __attribute__((packed));
+
+#define MTHCA_EQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_EQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_EQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_EQ_OWNER_SW           ( 0 << 24)
+#define MTHCA_EQ_OWNER_HW           ( 1 << 24)
+#define MTHCA_EQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_EQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_EQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_EQ_STATE_FIRED        ( 2 <<  8)
+#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 <<  8)
+
+enum {
+	MTHCA_EVENT_TYPE_COMP       	    = 0x00,
+	MTHCA_EVENT_TYPE_PATH_MIG   	    = 0x01,
+	MTHCA_EVENT_TYPE_COMM_EST   	    = 0x02,
+	MTHCA_EVENT_TYPE_SQ_DRAINED 	    = 0x03,
+	MTHCA_EVENT_TYPE_SRQ_LAST_WQE       = 0x13,
+	MTHCA_EVENT_TYPE_CQ_ERROR   	    = 0x04,
+	MTHCA_EVENT_TYPE_WQ_CATAS_ERROR     = 0x05,
+	MTHCA_EVENT_TYPE_EEC_CATAS_ERROR    = 0x06,
+	MTHCA_EVENT_TYPE_PATH_MIG_FAILED    = 0x07,
+	MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
+	MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR    = 0x11,
+	MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR    = 0x12,
+	MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR  = 0x08,
+	MTHCA_EVENT_TYPE_PORT_CHANGE        = 0x09,
+	MTHCA_EVENT_TYPE_EQ_OVERFLOW        = 0x0f,
+	MTHCA_EVENT_TYPE_ECC_DETECT         = 0x0e,
+	MTHCA_EVENT_TYPE_CMD                = 0x0a
+};
+
+#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG)           | \
+				(1ULL << MTHCA_EVENT_TYPE_COMM_EST)           | \
+				(1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED)         | \
+				(1ULL << MTHCA_EVENT_TYPE_CQ_ERROR)           | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR)     | \
+				(1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED)    | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR)  | \
+				(1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE)        | \
+				(1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW)        | \
+				(1ULL << MTHCA_EVENT_TYPE_ECC_DETECT))
+#define MTHCA_SRQ_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE)
+#define MTHCA_CMD_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_CMD)
+
+#define MTHCA_EQ_DB_INC_CI     (1 << 24)
+#define MTHCA_EQ_DB_REQ_NOT    (2 << 24)
+#define MTHCA_EQ_DB_DISARM_CQ  (3 << 24)
+#define MTHCA_EQ_DB_SET_CI     (4 << 24)
+#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24)
+
+struct mthca_eqe {
+	u8 reserved1;
+	u8 type;
+	u8 reserved2;
+	u8 subtype;
+	union {
+		u32 raw[6];
+		struct {
+			u32 cqn;
+		} __attribute__((packed)) comp;
+		struct {
+			u16 reserved1;
+			u16 token;
+			u32 reserved2;
+			u8  reserved3[3];
+			u8  status;
+			u64 out_param;
+		} __attribute__((packed)) cmd;
+		struct {
+			u32 qpn;
+		} __attribute__((packed)) qp;
+		struct {
+			u32 reserved1[2];
+			u32 port;
+		} __attribute__((packed)) port_change;
+	} event;
+	u8 reserved3[3];
+	u8 owner;
+} __attribute__((packed));
+
+#define  MTHCA_EQ_ENTRY_OWNER_SW      (0 << 7)
+#define  MTHCA_EQ_ENTRY_OWNER_HW      (1 << 7)
+
+static inline u64 async_mask(struct mthca_dev *dev)
+{
+	return dev->mthca_flags & MTHCA_FLAG_SRQ ?
+		MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK :
+		MTHCA_ASYNC_EVENT_MASK;
+}
+
+static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn);
+	doorbell[1] = cpu_to_be32(ci);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void eq_req_not(struct mthca_dev *dev, int eqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
+	doorbell[1] = 0;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
+	doorbell[1] = cpu_to_be32(cqn);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry)
+{
+	return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf
+		+ (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int next_eqe_sw(struct mthca_eq *eq)
+{
+	return !(MTHCA_EQ_ENTRY_OWNER_HW &
+		 get_eqe(eq, eq->cons_index)->owner);
+}
+
+static inline void set_eqe_hw(struct mthca_eq *eq, int entry)
+{
+	get_eqe(eq, entry)->owner =  MTHCA_EQ_ENTRY_OWNER_HW;
+}
+
+static void port_change(struct mthca_dev *dev, int port, int active)
+{
+	struct ib_event record;
+
+	mthca_dbg(dev, "Port change to %s for port %d\n",
+		  active ? "active" : "down", port);
+
+	record.device = &dev->ib_dev;
+	record.event  = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR;
+	record.element.port_num = port;
+
+	ib_dispatch_event(&record);
+}
+
+static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq)
+{
+	struct mthca_eqe *eqe;
+	int disarm_cqn;
+	int work = 0;
+
+	while (1) {
+		if (!next_eqe_sw(eq))
+			break;
+
+		eqe = get_eqe(eq, eq->cons_index);
+		work = 1;
+
+		switch (eqe->type) {
+		case MTHCA_EVENT_TYPE_COMP:
+			disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
+			disarm_cq(dev, eq->eqn, disarm_cqn);
+			mthca_cq_event(dev, disarm_cqn);
+			break;
+			
+		case MTHCA_EVENT_TYPE_PATH_MIG:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG);
+			break;
+
+		case MTHCA_EVENT_TYPE_COMM_EST:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_COMM_EST);
+			break;
+
+		case MTHCA_EVENT_TYPE_SQ_DRAINED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_SQ_DRAINED);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_FATAL);
+			break;
+
+		case MTHCA_EVENT_TYPE_PATH_MIG_FAILED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_REQ_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_ACCESS_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_CMD:
+			mthca_cmd_event(dev,
+					be16_to_cpu(eqe->event.cmd.token),
+					eqe->event.cmd.status,
+					be64_to_cpu(eqe->event.cmd.out_param));
+			break;
+
+		case MTHCA_EVENT_TYPE_PORT_CHANGE:
+			port_change(dev,
+				    (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3,
+				    eqe->subtype == 0x4);
+			break;
+
+		case MTHCA_EVENT_TYPE_CQ_ERROR:
+		case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_EQ_OVERFLOW:
+		case MTHCA_EVENT_TYPE_ECC_DETECT:
+		default:
+			mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n",
+				   eqe->type, eqe->subtype, eq->eqn);
+			break;
+		};
+
+		set_eqe_hw(eq, eq->cons_index);
+		eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1);
+	}
+
+	if (work) {
+		wmb();
+		set_eq_ci(dev, eq->eqn, eq->cons_index);
+	}
+
+	eq_req_not(dev, eq->eqn);
+}
+
+static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs)
+{
+	struct mthca_dev *dev = dev_ptr;
+	u32 ecr;
+	int work = 0;
+	int i;
+
+	if (dev->eq_table.clr_mask)
+		writel(dev->eq_table.clr_mask, dev->eq_table.clr_int);
+
+	while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) {
+		work = 1;
+
+		writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i)
+			if (ecr & dev->eq_table.eq[i].ecr_mask)
+				mthca_eq_int(dev, &dev->eq_table.eq[i]);
+	}
+
+	return IRQ_RETVAL(work);
+}
+
+static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr,
+					 struct pt_regs *regs)
+{
+	struct mthca_eq  *eq  = eq_ptr;
+	struct mthca_dev *dev = eq->dev;
+
+	writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+	mthca_eq_int(dev, eq);
+
+	/* MSI-X vectors always belong to us */
+	return IRQ_HANDLED;
+}
+
+static int __devinit mthca_create_eq(struct mthca_dev *dev,
+				     int nent,
+				     u8 intr,
+				     struct mthca_eq *eq)
+{
+	int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	u64 *dma_list = NULL;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	struct mthca_eq_context *eq_context;
+	int err = -ENOMEM;
+	int i;
+	u8 status;
+
+	eq->dev = dev;
+
+	eq->page_list = kmalloc(npages * sizeof *eq->page_list,
+				GFP_KERNEL);
+	if (!eq->page_list)
+		goto err_out;
+
+	for (i = 0; i < npages; ++i)
+		eq->page_list[i].buf = NULL;
+
+	dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+	if (!dma_list)
+		goto err_out_free;
+
+	mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free;
+	eq_context = MAILBOX_ALIGN(mailbox);
+
+	for (i = 0; i < npages; ++i) {
+		eq->page_list[i].buf = pci_alloc_consistent(dev->pdev,
+							    PAGE_SIZE, &t);
+		if (!eq->page_list[i].buf)
+			goto err_out_free;
+
+		dma_list[i] = t;
+		pci_unmap_addr_set(&eq->page_list[i], mapping, t);
+
+		memset(eq->page_list[i].buf, 0, PAGE_SIZE);
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_eqe_hw(eq, i);
+
+	eq->eqn = mthca_alloc(&dev->eq_table.alloc);
+	if (eq->eqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, PAGE_SHIFT, npages,
+				  0, npages * PAGE_SIZE,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &eq->mr);
+	if (err)
+		goto err_out_free_eq;
+
+	eq->nent = nent;
+
+	memset(eq_context, 0, sizeof *eq_context);
+	eq_context->flags           = cpu_to_be32(MTHCA_EQ_STATUS_OK   |
+						  MTHCA_EQ_OWNER_HW    |
+						  MTHCA_EQ_STATE_ARMED |
+						  MTHCA_EQ_FLAG_TR);
+	eq_context->start           = cpu_to_be64(0);
+	eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	eq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	eq_context->intr            = intr;
+	eq_context->lkey            = cpu_to_be32(eq->mr.ibmr.lkey);
+
+	err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+	if (status) {
+		mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	eq->ecr_mask   = swab32(1 << eq->eqn);
+	eq->cons_index = 0;
+
+	eq_req_not(dev, eq->eqn);
+
+	mthca_dbg(dev, "Allocated EQ %d with %d entries\n",
+		  eq->eqn, nent);
+
+	return err;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &eq->mr);
+
+ err_out_free_eq:
+	mthca_free(&dev->eq_table.alloc, eq->eqn);
+
+ err_out_free:
+	for (i = 0; i < npages; ++i)
+		if (eq->page_list[i].buf)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    eq->page_list[i].buf,
+					    pci_unmap_addr(&eq->page_list[i],
+							   mapping));
+
+	kfree(eq->page_list);
+	kfree(dma_list);
+	kfree(mailbox);
+
+ err_out:
+	return err;
+}
+
+static void mthca_free_eq(struct mthca_dev *dev,
+			  struct mthca_eq *eq)
+{
+	void *mailbox = NULL;
+	int err;
+	u8 status;
+	int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	int i;
+
+	mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		return;
+
+	err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox),
+			     eq->eqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err);
+	if (status)
+		mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn);
+		for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+
+	mthca_free_mr(dev, &eq->mr);
+	for (i = 0; i < npages; ++i)
+		pci_free_consistent(dev->pdev, PAGE_SIZE,
+				    eq->page_list[i].buf,
+				    pci_unmap_addr(&eq->page_list[i], mapping));
+
+	kfree(eq->page_list);
+	kfree(mailbox);
+}
+
+static void mthca_free_irqs(struct mthca_dev *dev)
+{
+	int i;
+
+	if (dev->eq_table.have_irq)
+		free_irq(dev->pdev->irq, dev);
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		if (dev->eq_table.eq[i].have_irq)
+			free_irq(dev->eq_table.eq[i].msi_x_vector,
+				 dev->eq_table.eq + i);
+}
+
+int __devinit mthca_init_eq_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	u8 intr;
+	int i;
+
+	err = mthca_alloc_init(&dev->eq_table.alloc,
+			       dev->limits.num_eqs,
+			       dev->limits.num_eqs - 1,
+			       dev->limits.reserved_eqs);
+	if (err)
+		return err;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI ||
+	    dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		dev->eq_table.clr_mask = 0;
+	} else {
+		dev->eq_table.clr_mask =
+			swab32(1 << (dev->eq_table.inta_pin & 31));
+		dev->eq_table.clr_int  = dev->clr_base +
+			(dev->eq_table.inta_pin < 31 ? 4 : 0);
+	}
+
+	intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ?
+		128 : dev->eq_table.inta_pin;
+
+	err = mthca_create_eq(dev, dev->limits.num_cqs,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_COMP]);
+	if (err)
+		goto err_out_free;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+	if (err)
+		goto err_out_comp;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_CMD]);
+	if (err)
+		goto err_out_async;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		static const char *eq_name[] = {
+			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
+			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
+			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)" 
+		};
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
+					  mthca_msi_x_interrupt, 0,
+					  eq_name[i], dev->eq_table.eq + i);
+			if (err)
+				goto err_out_cmd;
+			dev->eq_table.eq[i].have_irq = 1;
+		}
+	} else {
+		err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ,
+				  DRV_NAME, dev);
+		if (err)
+			goto err_out_cmd;
+		dev->eq_table.have_irq = 1;
+	}
+
+	err = mthca_MAP_EQ(dev, async_mask(dev),
+			   0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status);
+
+	err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+			   0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status);
+
+	return 0;
+
+err_out_cmd:
+	mthca_free_irqs(dev);
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]);
+
+err_out_async:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+
+err_out_comp:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]);
+
+err_out_free:
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+	return err;
+}
+
+void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev)
+{
+	u8 status;
+	int i;
+
+	mthca_free_irqs(dev);
+
+	mthca_MAP_EQ(dev, async_mask(dev),
+		     1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+		     1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		mthca_free_eq(dev, &dev->eq_table.eq[i]);
+
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c	2004-11-19 08:36:02.587118631 -0800
@@ -0,0 +1,321 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $
+ */
+
+#include <ib_verbs.h>
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	IB_SM_PORT_INFO        = 0x0015,
+	IB_SM_PKEY_TABLE       = 0x0016,
+	IB_SM_SM_INFO          = 0x0020,
+	IB_SM_VENDOR_START     = 0xff00
+};
+
+enum {
+	MTHCA_VENDOR_CLASS1 = 0x9,
+	MTHCA_VENDOR_CLASS2 = 0xa
+};
+
+struct mthca_trap_mad {
+	struct ib_mad *mad;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+static void update_sm_ah(struct mthca_dev *dev,
+			 u8 port_num, u16 lid, u8 sl)
+{
+	struct ib_ah *new_ah;
+	struct ib_ah_attr ah_attr;
+	unsigned long flags;
+
+	if (!dev->send_agent[port_num - 1][0])
+		return;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid     = lid;
+	ah_attr.sl       = sl;
+	ah_attr.port_num = port_num;
+
+	new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd,
+			      &ah_attr);
+	if (IS_ERR(new_ah))
+		return;
+
+	spin_lock_irqsave(&dev->sm_lock, flags);
+	if (dev->sm_ah[port_num - 1])
+		ib_destroy_ah(dev->sm_ah[port_num - 1]);
+	dev->sm_ah[port_num - 1] = new_ah;
+	spin_unlock_irqrestore(&dev->sm_lock, flags);
+}
+
+/*
+ * Snoop SM MADs for port info and P_Key table sets, so we can
+ * synthesize LID change and P_Key change events.
+ */
+static void smp_snoop(struct ib_device *ibdev,
+		      u8 port_num,
+		      struct ib_mad *mad)
+{
+	struct ib_event event;
+
+	if ((mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+	    mad->mad_hdr.method     == IB_MGMT_METHOD_SET) {
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) {
+			update_sm_ah(to_mdev(ibdev), port_num,
+				     be16_to_cpup((__be16 *) (mad->data + 58)),
+				     (*(u8 *) (mad->data + 76)) & 0xf);
+
+			event.device           = ibdev;
+			event.event            = IB_EVENT_LID_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) {
+			event.device           = ibdev;
+			event.event            = IB_EVENT_PKEY_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+	}
+}
+
+static void forward_trap(struct mthca_dev *dev,
+			 u8 port_num,
+			 struct ib_mad *mad)
+{
+	int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	struct mthca_trap_mad *tmad;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+		.wr	     = {
+			 .ud = {
+				 .remote_qpn  = qpn,
+				 .remote_qkey = qpn ? IB_QP1_QKEY : 0,
+				 .timeout_ms  = 0
+			 }
+		 }
+	};
+	struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn];
+	int ret;
+	unsigned long flags;
+
+	if (agent) {
+		tmad = kmalloc(sizeof *tmad, GFP_KERNEL);
+		if (!tmad)
+			return;
+
+		tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL);
+		if (!tmad->mad) {
+			kfree(tmad);
+			return;
+		}
+
+		memcpy(tmad->mad, mad, sizeof *mad);
+
+		wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr;
+		wr.wr_id         = (unsigned long) tmad;
+
+		gather_list.addr   = pci_map_single(agent->device->dma_device,
+						    tmad->mad,
+						    sizeof *tmad->mad,
+						    PCI_DMA_TODEVICE);
+		gather_list.length = sizeof *tmad->mad;
+		gather_list.lkey   = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey;
+		pci_unmap_addr_set(tmad, mapping, gather_list.addr);
+		
+		/*
+		 * We rely here on the fact that MLX QPs don't use the
+		 * address handle after the send is posted (this is
+		 * wrong following the IB spec strictly, but we know
+		 * it's OK for our devices).
+		 */
+		spin_lock_irqsave(&dev->sm_lock, flags);
+		wr.wr.ud.ah      = dev->sm_ah[port_num - 1];
+		if (wr.wr.ud.ah)
+			ret = ib_post_send_mad(agent, &wr, &bad_wr);
+		else
+			ret = -EINVAL;
+		spin_unlock_irqrestore(&dev->sm_lock, flags);
+
+		if (ret) {
+			pci_unmap_single(agent->device->dma_device,
+					 pci_unmap_addr(tmad, mapping),
+					 sizeof *tmad->mad,
+					 PCI_DMA_TODEVICE);
+			kfree(tmad->mad);
+			kfree(tmad);
+		}
+	}
+}
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad)
+{
+	int err;
+	u8 status;
+
+	/* Forward locally generated traps to the SM */
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
+	    slid == 0) {
+		forward_trap(to_mdev(ibdev), port_num, in_mad);
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+	}
+
+	/*
+	 * Only handle SM gets, sets and trap represses for SM class
+	 *
+	 * Only handle PMA and Mellanox vendor-specific class gets and
+	 * sets for other classes.
+	 */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || 
+	    in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		if (in_mad->mad_hdr.method   != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_SET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_TRAP_REPRESS)
+			return IB_MAD_RESULT_SUCCESS;
+
+		/* 
+		 * Don't process SMInfo queries or vendor-specific
+		 * MADs -- the SMA can't handle them.
+		 */
+		if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO ||
+		    be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START)
+			return IB_MAD_RESULT_SUCCESS;
+	} else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT ||
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1     || 
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) {
+		if (in_mad->mad_hdr.method  != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method  != IB_MGMT_METHOD_SET)
+			return IB_MAD_RESULT_SUCCESS;
+	} else
+		return IB_MAD_RESULT_SUCCESS;
+
+	err = mthca_MAD_IFC(to_mdev(ibdev),
+			    !!(mad_flags & IB_MAD_IGNORE_MKEY),
+			    port_num, in_mad, out_mad,
+			    &status);
+	if (err) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC failed\n");
+		return IB_MAD_RESULT_FAILURE;
+	}
+	if (status == MTHCA_CMD_STAT_BAD_PKT)
+		return IB_MAD_RESULT_SUCCESS;
+	if (status) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n",
+			  status);
+		return IB_MAD_RESULT_FAILURE;
+	}
+
+	if (!out_mad->mad_hdr.status)
+		smp_snoop(ibdev, port_num, in_mad);
+
+	/* set return bit in status of directed route responses */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		out_mad->mad_hdr.status |= cpu_to_be16(1 << 15);
+
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS)
+		/* no response for trap repress */
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+
+	return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct mthca_trap_mad *tmad =
+		(void *) (unsigned long) mad_send_wc->wr_id;
+
+	pci_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(tmad, mapping),
+			 sizeof *tmad->mad,
+			 PCI_DMA_TODEVICE);
+	kfree(tmad->mad);
+	kfree(tmad);
+}
+
+int mthca_create_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	spin_lock_init(&dev->sm_lock);
+
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q) {
+			agent = ib_register_mad_agent(&dev->ib_dev, p + 1,
+						      q ? IB_QPT_GSI : IB_QPT_SMI,
+						      NULL, 0, send_handler,
+						      NULL, NULL);
+			if (IS_ERR(agent))
+				goto err;
+			dev->send_agent[p][q] = agent;
+		}
+
+	return 0;
+
+err:
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q)
+			if (dev->send_agent[p][q])
+				ib_unregister_mad_agent(dev->send_agent[p][q]);
+
+	return PTR_ERR(agent);
+}
+
+void mthca_free_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	for (p = 0; p < dev->limits.num_ports; ++p) {
+		for (q = 0; q <= 1; ++q) {
+			agent = dev->send_agent[p][q];
+			dev->send_agent[p][q] = NULL;
+			ib_unregister_mad_agent(agent);
+		}
+
+		if (dev->sm_ah[p])
+			ib_destroy_ah(dev->sm_ah[p]);
+	}
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_main.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c	2004-11-19 08:36:02.665107138 -0800
@@ -0,0 +1,889 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
+
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+#include <asm/cpufeature.h>
+#endif
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+#include "mthca_profile.h"
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+#ifdef CONFIG_PCI_MSI
+
+static int msi_x = 0;
+module_param(msi_x, int, 0444);
+MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
+
+static int msi = 0;
+module_param(msi, int, 0444);
+MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero");
+
+#else /* CONFIG_PCI_MSI */
+
+#define msi_x (0)
+#define msi   (0)
+
+#endif /* CONFIG_PCI_MSI */
+
+static const char mthca_version[] __devinitdata =
+	"ib_mthca: Mellanox InfiniBand HCA driver v"
+	DRV_VERSION " (" DRV_RELDATE ")\n";
+
+static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
+{
+	int cap;
+	u16 val;
+
+	/* First try to max out Read Byte Count */
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) {
+			mthca_err(mdev, "Couldn't read PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) {
+			mthca_err(mdev, "Couldn't write PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == TAVOR)
+		mthca_info(mdev, "No PCI-X capability, not setting RBC.\n");
+
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) {
+			mthca_err(mdev, "Couldn't read PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) {
+			mthca_err(mdev, "Couldn't write PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == ARBEL_NATIVE ||
+		   mdev->hca_type == ARBEL_COMPAT)
+		mthca_info(mdev, "No PCI Express capability, "
+			   "not setting Max Read Request Size.\n");
+
+	return 0;
+}
+
+static int __devinit mthca_init_tavor(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	struct mthca_dev_lim        dev_lim;
+	struct mthca_init_hca_param init_hca;
+	struct mthca_adapter        adapter;
+
+	err = mthca_SYS_EN(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "SYS_EN command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "SYS_EN returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DDR(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DDR command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DDR returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	if (dev_lim.min_page_sz > PAGE_SIZE) {
+		mthca_err(mdev, "HCA minimum page size of %d bigger than "
+			  "kernel PAGE_SIZE of %ld, aborting.\n",
+			  dev_lim.min_page_sz, PAGE_SIZE);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+	if (dev_lim.num_ports > MTHCA_MAX_PORTS) {
+		mthca_err(mdev, "HCA has %d ports, but we only support %d, "
+			  "aborting.\n",
+			  dev_lim.num_ports, MTHCA_MAX_PORTS);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+
+	mdev->limits.num_ports      	= dev_lim.num_ports;
+	mdev->limits.vl_cap             = dev_lim.max_vl;
+	mdev->limits.mtu_cap            = dev_lim.max_mtu;
+	mdev->limits.gid_table_len  	= dev_lim.max_gids;
+	mdev->limits.pkey_table_len 	= dev_lim.max_pkeys;
+	mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay;
+	mdev->limits.max_sg             = dev_lim.max_sg;
+	mdev->limits.reserved_qps       = dev_lim.reserved_qps;
+	mdev->limits.reserved_srqs      = dev_lim.reserved_srqs;
+	mdev->limits.reserved_eecs      = dev_lim.reserved_eecs;
+	mdev->limits.reserved_cqs       = dev_lim.reserved_cqs;
+	mdev->limits.reserved_eqs       = dev_lim.reserved_eqs;
+	mdev->limits.reserved_mtts      = dev_lim.reserved_mtts;
+	mdev->limits.reserved_mrws      = dev_lim.reserved_mrws;
+	mdev->limits.reserved_uars      = dev_lim.reserved_uars;
+	mdev->limits.reserved_pds       = dev_lim.reserved_pds;
+
+	if (dev_lim.flags & DEV_LIM_FLAG_SRQ)
+		mdev->mthca_flags |= MTHCA_FLAG_SRQ;
+	
+	err = mthca_make_profile(mdev, &dev_lim, &init_hca);
+	if (err)
+		goto err_out_disable;
+
+	err = mthca_INIT_HCA(mdev, &init_hca, &status);
+	if (err) {
+		mthca_err(mdev, "INIT_HCA command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "INIT_HCA returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+
+	err = mthca_QUERY_ADAPTER(mdev, &adapter, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_close;
+	}
+
+	mdev->eq_table.inta_pin = adapter.inta_pin;
+	mdev->rev_id            = adapter.revision_id;
+
+	return 0;
+
+err_out_close:
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+err_out_disable:
+	mthca_SYS_DIS(mdev, &status);
+
+	return err;
+}
+
+static int __devinit mthca_load_fw(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	int num_sg;
+	int i;
+
+	/* FIXME: use HCA-attached memory for FW if present */
+
+	mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem *
+				     mdev->fw.arbel.fw_pages,
+				     GFP_KERNEL);
+	if (!mdev->fw.arbel.mem) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		return -ENOMEM;
+	}
+
+	memset(mdev->fw.arbel.mem, 0,
+	       sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages);
+
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) {
+		mdev->fw.arbel.mem[i].page   = alloc_page(GFP_HIGHUSER);
+		mdev->fw.arbel.mem[i].length = PAGE_SIZE;
+		if (!mdev->fw.arbel.mem[i].page) {
+			mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+	}
+	num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem,
+					   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+	if (num_sg <= 0) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status);
+	if (err) {
+		mthca_err(mdev, "MAP_FA command failed, aborting.\n");
+		goto err_unmap;
+	}
+	if (status) {
+		mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap;
+	}
+
+	err = mthca_RUN_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "RUN_FW command failed, aborting.\n");
+		goto err_unmap_fa;
+	}
+	if (status) {
+		mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap_fa;
+	}
+
+	return 0;
+
+err_unmap_fa:
+	mthca_UNMAP_FA(mdev, &status);
+
+err_unmap:
+	pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+		   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+err_free:
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+		if (mdev->fw.arbel.mem[i].page)
+			__free_page(mdev->fw.arbel.mem[i].page);
+	kfree(mdev->fw.arbel.mem);
+	return err;
+}
+
+static int __devinit mthca_init_arbel(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_ENABLE_LAM(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n");
+		return err;
+	}
+	if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) {
+		mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n");
+		mdev->mthca_flags |= MTHCA_FLAG_NO_LAM;
+	} else if (status) {
+		mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_load_fw(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to start FW, aborting.\n");
+		goto err_out_disable;
+	}
+
+	mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, "
+		   "aborting.\n");
+	return -ENODEV;
+
+err_out_disable:
+	if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+		mthca_DISABLE_LAM(mdev, &status);
+	return err;
+}
+
+static int __devinit mthca_init_hca(struct mthca_dev *mdev)
+{
+	if (mdev->hca_type == ARBEL_NATIVE)
+		return mthca_init_arbel(mdev);
+	else
+		return mthca_init_tavor(mdev);
+}
+
+static int __devinit mthca_setup_hca(struct mthca_dev *dev)
+{
+	int err;
+
+	MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock);
+
+	err = mthca_init_pd_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "protection domain table, aborting.\n");
+		return err;
+	}
+
+	err = mthca_init_mr_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "memory region table, aborting.\n");
+		goto err_out_pd_table_free;
+	}
+
+	err = mthca_pd_alloc(dev, &dev->driver_pd);
+	if (err) {
+		mthca_err(dev, "Failed to create driver PD, "
+			  "aborting.\n");
+		goto err_out_mr_table_free;
+	}
+
+	err = mthca_init_eq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "event queue table, aborting.\n");
+		goto err_out_pd_free;
+	}
+
+	err = mthca_cmd_use_events(dev);
+	if (err) {
+		mthca_err(dev, "Failed to switch to event-driven "
+			  "firmware commands, aborting.\n");
+		goto err_out_eq_table_free;
+	}
+
+	err = mthca_init_cq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "completion queue table, aborting.\n");
+		goto err_out_cmd_poll;
+	}
+
+	err = mthca_init_qp_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "queue pair table, aborting.\n");
+		goto err_out_cq_table_free;
+	}
+
+	err = mthca_init_av_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "address vector table, aborting.\n");
+		goto err_out_qp_table_free;
+	}
+
+	err = mthca_init_mcg_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "multicast group table, aborting.\n");
+		goto err_out_av_table_free;
+	}
+
+	return 0;
+
+err_out_av_table_free:
+	mthca_cleanup_av_table(dev);
+
+err_out_qp_table_free:
+	mthca_cleanup_qp_table(dev);
+
+err_out_cq_table_free:
+	mthca_cleanup_cq_table(dev);
+
+err_out_cmd_poll:
+	mthca_cmd_use_polling(dev);
+
+err_out_eq_table_free:
+	mthca_cleanup_eq_table(dev);
+
+err_out_pd_free:
+	mthca_pd_free(dev, &dev->driver_pd);
+
+err_out_mr_table_free:
+	mthca_cleanup_mr_table(dev);
+
+err_out_pd_table_free:
+	mthca_cleanup_pd_table(dev);
+	return err;
+}
+
+static int __devinit mthca_request_regions(struct pci_dev *pdev,
+					   int ddr_hidden)
+{
+	int err;
+
+	/*
+	 * We request our first BAR in two chunks, since the MSI-X
+	 * vector table is right in the middle.
+	 *
+	 * This is why we can't just use pci_request_regions() -- if
+	 * we did then setting up MSI-X would fail, since the PCI core
+	 * wants to do request_mem_region on the MSI-X vector table.
+	 */
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_HCR_BASE,
+				MTHCA_MAP_HCR_SIZE,
+				DRV_NAME))
+		return -EBUSY;
+
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_CLR_INT_BASE,
+				MTHCA_CLR_INT_SIZE,
+				DRV_NAME)) {
+		err = -EBUSY;
+		goto err_out_bar0_beg;
+	}
+
+	err = pci_request_region(pdev, 2, DRV_NAME);
+	if (err)
+		goto err_out_bar0_end;
+
+	if (!ddr_hidden) {
+		err = pci_request_region(pdev, 4, DRV_NAME);
+		if (err)
+			goto err_out_bar2;
+	}
+
+	return 0;
+
+err_out_bar0_beg:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+
+err_out_bar0_end:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+
+err_out_bar2:
+	pci_release_region(pdev, 2);
+	return err;
+}
+
+static void mthca_release_regions(struct pci_dev *pdev,
+				  int ddr_hidden)
+{
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+	pci_release_region(pdev, 2);
+	if (!ddr_hidden)
+		pci_release_region(pdev, 4);
+}
+
+static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev)
+{
+	struct msix_entry entries[3];
+	int err;
+
+	entries[0].entry = 0;
+	entries[1].entry = 1;
+	entries[2].entry = 2;
+
+	err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries));
+	if (err) {
+		if (err > 0)
+			mthca_info(mdev, "Only %d MSI-X vectors available, "
+				   "not using MSI-X\n", err);
+		return err;
+	}
+
+	mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector;
+	mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector;
+	mdev->eq_table.eq[MTHCA_EQ_CMD  ].msi_x_vector = entries[2].vector;
+
+	return 0;
+}
+
+static void mthca_close_hca(struct mthca_dev *mdev)
+{
+	u8 status;
+	int i;
+
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+	if (mdev->hca_type == ARBEL_NATIVE) {
+		mthca_UNMAP_FA(mdev, &status);
+
+		pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+			     mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+
+		for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+			__free_page(mdev->fw.arbel.mem[i].page);
+		kfree(mdev->fw.arbel.mem);
+
+		if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+			mthca_DISABLE_LAM(mdev, &status);
+	} else
+		mthca_SYS_DIS(mdev, &status);
+}
+
+static int __devinit mthca_init_one(struct pci_dev *pdev,
+				    const struct pci_device_id *id)
+{
+	static int mthca_version_printed = 0;
+	int ddr_hidden = 0;
+	int err;
+	unsigned long mthca_base;
+	struct mthca_dev *mdev;
+
+	if (!mthca_version_printed) {
+		printk(KERN_INFO "%s", mthca_version);
+		++mthca_version_printed;
+	}
+
+	printk(KERN_INFO PFX "Initializing %s (%s)\n",
+	       pci_pretty_name(pdev), pci_name(pdev));
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot enable PCI device, "
+			"aborting.\n");
+		return err;
+	}
+
+	/*
+	 * Check for BARs.  We expect 0: 1MB, 2: 8MB, 4: DDR (may not
+	 * be present)
+	 */
+	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 0) != 1 << 20) {
+		dev_err(&pdev->dev, "Missing DCS, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 2) != 1 << 23) {
+		dev_err(&pdev->dev, "Missing UAR, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM))
+		ddr_hidden = 1;
+
+	err = mthca_request_regions(pdev, ddr_hidden);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot obtain PCI resources, "
+			"aborting.\n");
+		goto err_out_disable_pdev;
+	}
+
+	pci_set_master(pdev);
+
+	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n");
+		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+	err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit "
+			 "consistent PCI DMA mask.\n");
+		err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, "
+				"aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+
+	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
+	if (!mdev) {
+		dev_err(&pdev->dev, "Device struct alloc failed, "
+			"aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_res;
+	}
+
+	mdev->pdev     = pdev;
+	mdev->hca_type = id->driver_data;
+
+	if (ddr_hidden)
+		mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN;
+
+	/*
+	 * Now reset the HCA before we touch the PCI capabilities or
+	 * attempt a firmware command, since a boot ROM may have left
+	 * the HCA in an undefined state.
+	 */
+	err = mthca_reset(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to reset HCA, aborting.\n");
+		goto err_out_free_dev;
+	}
+
+	if (msi_x && !mthca_enable_msi_x(mdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
+	if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) &&
+	    !pci_enable_msi(pdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI;
+
+	sema_init(&mdev->cmd.hcr_sem, 1);
+	sema_init(&mdev->cmd.poll_sem, 1);
+	mdev->cmd.use_events = 0;
+
+	mthca_base = pci_resource_start(pdev, 0);
+	mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE);
+	if (!mdev->hcr) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_dev;
+	}
+	mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE,
+				 MTHCA_CLR_INT_SIZE);
+	if (!mdev->clr_base) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap;
+	}
+
+	mthca_base = pci_resource_start(pdev, 2);
+	mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE);
+	if (!mdev->kar) {
+		mthca_err(mdev, "Couldn't map kernel access region, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap_clr;
+	}
+
+	err = mthca_tune_pci(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_init_hca(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_setup_hca(mdev);
+	if (err)
+		goto err_out_close;
+
+	err = mthca_register_device(mdev);
+	if (err)
+		goto err_out_cleanup;
+
+	err = mthca_create_agents(mdev);
+	if (err)
+		goto err_out_unregister;
+
+	pci_set_drvdata(pdev, mdev);
+
+	return 0;
+
+err_out_unregister:
+	mthca_unregister_device(mdev);
+
+err_out_cleanup:
+	mthca_cleanup_mcg_table(mdev);
+	mthca_cleanup_av_table(mdev);
+	mthca_cleanup_qp_table(mdev);
+	mthca_cleanup_cq_table(mdev);
+	mthca_cmd_use_polling(mdev);
+	mthca_cleanup_eq_table(mdev);
+
+	mthca_pd_free(mdev, &mdev->driver_pd);
+
+	mthca_cleanup_mr_table(mdev);
+	mthca_cleanup_pd_table(mdev);
+
+err_out_close:
+	mthca_close_hca(mdev);
+
+err_out_iounmap_kar:
+	iounmap(mdev->kar);
+
+err_out_iounmap_clr:
+	iounmap(mdev->clr_base);
+
+err_out_iounmap:
+	iounmap(mdev->hcr);
+
+err_out_free_dev:
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+		pci_disable_msi(pdev);
+
+	ib_dealloc_device(&mdev->ib_dev);
+
+err_out_free_res:
+	mthca_release_regions(pdev, ddr_hidden);
+
+err_out_disable_pdev:
+	pci_disable_device(pdev);
+	pci_set_drvdata(pdev, NULL);
+	return err;
+}
+
+static void __devexit mthca_remove_one(struct pci_dev *pdev)
+{
+	struct mthca_dev *mdev = pci_get_drvdata(pdev);
+	u8 status;
+	int p;
+
+	if (mdev) {
+		mthca_free_agents(mdev);
+		mthca_unregister_device(mdev);
+
+		for (p = 1; p <= mdev->limits.num_ports; ++p)
+			mthca_CLOSE_IB(mdev, p, &status);
+
+		mthca_cleanup_mcg_table(mdev);
+		mthca_cleanup_av_table(mdev);
+		mthca_cleanup_qp_table(mdev);
+		mthca_cleanup_cq_table(mdev);
+		mthca_cmd_use_polling(mdev);
+		mthca_cleanup_eq_table(mdev);
+
+		mthca_pd_free(mdev, &mdev->driver_pd);
+
+		mthca_cleanup_mr_table(mdev);
+		mthca_cleanup_pd_table(mdev);
+
+		mthca_close_hca(mdev);
+
+		iounmap(mdev->hcr);
+		iounmap(mdev->clr_base);
+
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+			pci_disable_msix(pdev);
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+			pci_disable_msi(pdev);
+
+		ib_dealloc_device(&mdev->ib_dev);
+		mthca_release_regions(pdev, mdev->mthca_flags &
+				      MTHCA_FLAG_DDR_HIDDEN);
+		pci_disable_device(pdev);
+		pci_set_drvdata(pdev, NULL);
+	}
+}
+
+static struct pci_device_id mthca_pci_table[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ 0, }
+};
+
+MODULE_DEVICE_TABLE(pci, mthca_pci_table);
+
+static struct pci_driver mthca_driver = {
+	.name		= "ib_mthca",
+	.id_table	= mthca_pci_table,
+	.probe		= mthca_init_one,
+	.remove		= __devexit_p(mthca_remove_one)
+};
+
+static int __init mthca_init(void)
+{
+	int ret;
+
+	/*
+	 * TODO: measure whether dynamically choosing doorbell code at
+	 * runtime affects our performance.  Is there a "magic" way to
+	 * choose without having to follow a function pointer every
+	 * time we ring a doorbell?
+	 */
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+	if (!cpu_has_xmm) {
+		printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n");
+		printk(KERN_ERR PFX "the current CPU does not support SSE.\n");
+		printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL "
+		       "and recompile.\n");
+		return -ENODEV;
+	}
+#endif
+
+	ret = pci_register_driver(&mthca_driver);
+	return ret < 0 ? ret : 0;
+}
+
+static void __exit mthca_cleanup(void)
+{
+	pci_unregister_driver(&mthca_driver);
+}
+
+module_init(mthca_init);
+module_exit(mthca_cleanup);
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c	2004-11-19 08:36:02.691103307 -0800
@@ -0,0 +1,372 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $
+ */
+
+#include <linux/init.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2)
+};
+
+struct mthca_mgm {
+	u32 next_gid_index;
+	u32 reserved[3];
+	u8  gid[16];
+	u32 qp[MTHCA_QP_PER_MGM];
+} __attribute__((packed));
+
+static const u8 zero_gid[16];	/* automatically initialized to 0 */
+
+/*
+ * Caller must hold MCG table semaphore.  gid and mgm parameters must
+ * be properly aligned for command interface.
+ *
+ *  Returns 0 unless a firmware command error occurs.
+ *
+ * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1
+ * and *mgm holds MGM entry.
+ *
+ * if GID is found in AMGM, *index = index in AMGM, *prev = index of
+ * previous entry in hash chain and *mgm holds AMGM entry.
+ *
+ * If no AMGM exists for given gid, *index = -1, *prev = index of last
+ * entry in hash chain and *mgm holds end of hash chain.
+ */
+static int find_mgm(struct mthca_dev *dev,
+		    u8 *gid, struct mthca_mgm *mgm,
+		    u16 *hash, int *prev, int *index)
+{
+	void *mailbox;
+	u8 *mgid;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgid = MAILBOX_ALIGN(mailbox);
+
+	memcpy(mgid, gid, 16);
+
+	err = mthca_MGID_HASH(dev, mgid, hash, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "MGID_HASH returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (0)
+		mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:"
+			  "%04x:%04x:%04x:%04x is %04x\n",
+			  be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]),
+			  be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]),
+			  be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]),
+			  be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]),
+			  *hash);
+
+	*index = *hash;
+	*prev  = -1;
+
+	do {
+		err = mthca_READ_MGM(dev, *index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			return -EINVAL;
+		}
+
+		if (!memcmp(mgm->gid, zero_gid, 16)) {
+			if (*index != *hash) {
+				mthca_err(dev, "Found zero MGID in AMGM.\n");
+				err = -EINVAL;
+			}
+			goto out;
+		}
+
+		if (!memcmp(mgm->gid, gid, 16))
+			goto out;
+
+		*prev = *index;
+		*index = be32_to_cpu(mgm->next_gid_index) >> 5;
+	} while (*index);
+
+	*index = -1;
+
+ out:
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int index, prev;
+	int link = 0;
+	int i;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index != -1) {
+		if (!memcmp(mgm->gid, zero_gid, 16))
+			memcpy(mgm->gid, gid->raw, 16);
+	} else {
+		link = 1;
+
+		index = mthca_alloc(&dev->mcg_table.alloc);
+		if (index == -1) {
+			mthca_err(dev, "No AMGM entries left\n");
+			err = -ENOMEM;
+			goto out;
+		}
+
+		err = mthca_READ_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		memcpy(mgm->gid, gid->raw, 16);
+		mgm->next_gid_index = 0;
+	}
+
+	for (i = 0; i < MTHCA_QP_PER_MGM; ++i)
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) {
+			mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31));
+			break;
+		}
+
+	if (i == MTHCA_QP_PER_MGM) {
+		mthca_err(dev, "MGM at index %x is full.\n", index);
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+	if (!link)
+		goto out;
+
+	err = mthca_READ_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "READ_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->next_gid_index = cpu_to_be32(index << 5);
+
+	err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int prev, index;
+	int i, loc;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index == -1) {	
+		mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x "
+			  "not found\n",
+			  be16_to_cpu(((u16 *) gid->raw)[0]),
+			  be16_to_cpu(((u16 *) gid->raw)[1]),
+			  be16_to_cpu(((u16 *) gid->raw)[2]),
+			  be16_to_cpu(((u16 *) gid->raw)[3]),
+			  be16_to_cpu(((u16 *) gid->raw)[4]),
+			  be16_to_cpu(((u16 *) gid->raw)[5]),
+			  be16_to_cpu(((u16 *) gid->raw)[6]),
+			  be16_to_cpu(((u16 *) gid->raw)[7]));
+		err = -EINVAL;
+		goto out;
+	}
+
+	for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) {
+		if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31)))
+			loc = i;
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31)))
+			break;
+	}
+
+	if (loc == -1) {
+		mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->qp[loc]   = mgm->qp[i - 1];
+	mgm->qp[i - 1] = 0;
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (i != 1)
+		goto out;
+
+	goto out;
+
+	if (prev == -1) {
+		/* Remove entry from MGM */
+		if (be32_to_cpu(mgm->next_gid_index) >> 5) {
+			err = mthca_READ_MGM(dev,
+					     be32_to_cpu(mgm->next_gid_index) >> 5,
+					     mgm, &status);
+			if (err)
+				goto out;
+			if (status) {
+				mthca_err(dev, "READ_MGM returned status %02x\n",
+					  status);
+				err = -EINVAL;
+				goto out;
+			}
+		} else
+			memset(mgm->gid, 0, 16);
+
+		err = mthca_WRITE_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		/* Remove entry from AMGM */
+		index = be32_to_cpu(mgm->next_gid_index) >> 5;
+		err = mthca_READ_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		mgm->next_gid_index = cpu_to_be32(index << 5);
+
+		err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int __devinit mthca_init_mcg_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->mcg_table.alloc,
+			       dev->limits.num_amgms,
+			       dev->limits.num_amgms - 1,
+			       0);
+	if (err)
+		return err;
+
+	init_MUTEX(&dev->mcg_table.sem);
+
+	return 0;
+}
+
+void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev)
+{
+	mthca_alloc_cleanup(&dev->mcg_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c	2004-11-19 08:36:02.735096824 -0800
@@ -0,0 +1,389 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+struct mthca_mpt_entry {
+	u32 flags;
+	u32 page_size;
+	u32 key;
+	u32 pd;
+	u64 start;
+	u64 length;
+	u32 lkey;
+	u32 window_count;
+	u32 window_count_limit;
+	u64 mtt_seg;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_MPT_FLAG_SW_OWNS       (0xfUL << 28)
+#define MTHCA_MPT_FLAG_MIO           (1 << 17)
+#define MTHCA_MPT_FLAG_BIND_ENABLE   (1 << 15)
+#define MTHCA_MPT_FLAG_PHYSICAL      (1 <<  9)
+#define MTHCA_MPT_FLAG_REGION        (1 <<  8)
+
+#define MTHCA_MTT_FLAG_PRESENT       1
+
+/*
+ * Buddy allocator for MTT segments (currently not very efficient
+ * since it doesn't keep a free list and just searches linearly
+ * through the bitmaps)
+ */
+
+static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+{
+	int o;
+	int m;
+	u32 seg;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	for (o = order; o <= dev->mr_table.max_mtt_order; ++o) {
+		m = 1 << (dev->mr_table.max_mtt_order - o);
+		seg = find_first_bit(dev->mr_table.mtt_buddy[o], m);
+		if (seg < m)
+			goto found;
+	}
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	return -1;
+
+ found:
+	clear_bit(seg, dev->mr_table.mtt_buddy[o]);
+
+	while (o > order) {
+		--o;
+		seg <<= 1;
+		set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]);
+	}
+					  
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+
+	seg <<= order;
+
+	return seg;
+}
+
+static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+{
+	seg >>= order;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) {
+		clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]);
+		seg >>= 1;
+		++order;
+	}
+
+	set_bit(seg, dev->mr_table.mtt_buddy[order]);
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+}
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	struct mthca_mpt_entry *mpt_entry;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mr->order = -1;
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+		return -ENOMEM;
+	}
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_PHYSICAL    |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+	mpt_entry->page_size = 0;
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = 0;
+	mpt_entry->length    = ~0ULL;
+
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	u64 *mtt_entry;
+	struct mthca_mpt_entry *mpt_entry;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+	WARN_ON(buffer_size_shift >= 32);
+
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	for (i = dev->limits.mtt_seg_size / 8, mr->order = 0;
+	     i < list_len;
+	     i <<= 1, ++mr->order)
+		/* nothing */ ;
+
+	mr->first_seg = mthca_alloc_mtt(dev, mr->order);
+	if (mr->first_seg == -1)
+		goto err_out_mpt_free;
+
+	/*
+	 * If list_len is odd, we add one more dummy entry for
+	 * firmware efficiency.
+	 */
+	mailbox = kmalloc(max(sizeof *mpt_entry,
+			      (size_t) 8 * (list_len + (list_len & 1) + 2)) +
+			  MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free_mtt;
+
+	mtt_entry = MAILBOX_ALIGN(mailbox);
+
+	mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base +
+				   mr->first_seg * dev->limits.mtt_seg_size);
+	mtt_entry[1] = 0;
+	for (i = 0; i < list_len; ++i)
+		mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+	if (list_len & 1) {
+		mtt_entry[i + 2] = 0;
+		++list_len;
+	}
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry\n");
+		for (i = 0; i < list_len + 2; ++i)
+			printk(KERN_ERR "[%2d] %016llx\n",
+			       i, (unsigned long long) be64_to_cpu(mtt_entry[i]));
+	}
+
+	err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status);
+	if (err) {
+		mthca_warn(dev, "WRITE_MTT failed (%d)\n", err);
+		goto err_out_mailbox_free;
+	}
+	if (status) {
+		mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_mailbox_free;
+	}
+
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+
+	mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12);
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = cpu_to_be64(iova);
+	mpt_entry->length    = cpu_to_be64(total_size);
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+	mpt_entry->mtt_seg   = cpu_to_be64(dev->mr_table.mtt_base +
+					   mr->first_seg * dev->limits.mtt_seg_size);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
+		for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i]));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+
+ err_out_mailbox_free:
+	kfree(mailbox);
+
+ err_out_free_mtt:
+	mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+ err_out_mpt_free:
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	return err;
+}
+
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr)
+{
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	err = mthca_HW2SW_MPT(dev, NULL,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n",
+			   status);
+
+	if (mr->order >= 0)
+		mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);		   
+}
+
+int __devinit mthca_init_mr_table(struct mthca_dev *dev)
+{
+	int err;
+	int i, s;
+
+	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
+			       dev->limits.num_mpts,
+			       ~0, dev->limits.reserved_mrws);
+	if (err)
+		return err;
+
+	err = -ENOMEM;
+
+	for (i = 1, dev->mr_table.max_mtt_order = 0;
+	     i < dev->limits.num_mtt_segs;
+	     i <<= 1, ++dev->mr_table.max_mtt_order)
+		/* nothing */ ;
+
+	dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) *
+					  sizeof (long *),
+					  GFP_KERNEL);
+	if (!dev->mr_table.mtt_buddy)
+		goto err_out;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		dev->mr_table.mtt_buddy[i] = NULL;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) {
+		s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i));
+		dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long),
+						     GFP_KERNEL);
+		if (!dev->mr_table.mtt_buddy[i])
+			goto err_out_free;
+		bitmap_zero(dev->mr_table.mtt_buddy[i],
+			    1 << (dev->mr_table.max_mtt_order - i));
+	}
+
+	set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]);
+
+	for (i = 0; i < dev->mr_table.max_mtt_order; ++i)
+		if (1 << i >= dev->limits.reserved_mtts)
+			break;
+
+	if (i == dev->mr_table.max_mtt_order) {
+		mthca_err(dev, "MTT table of order %d is "
+			  "too small.\n", i);
+		goto err_out_free;
+	}
+
+	(void) mthca_alloc_mtt(dev, i);
+
+	return 0;
+
+ err_out_free:
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+
+ err_out:
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev)
+{
+	int i;
+
+	/* XXX check if any MRs are still allocated? */
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+	kfree(dev->mr_table.mtt_buddy);
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c	2004-11-19 08:36:02.775090930 -0800
@@ -0,0 +1,76 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	int err;
+
+	might_sleep();
+
+	atomic_set(&pd->sqp_count, 0);
+	pd->pd_num = mthca_alloc(&dev->pd_table.alloc);
+	if (pd->pd_num == -1)
+		return -ENOMEM;
+
+	err = mthca_mr_alloc_notrans(dev, pd->pd_num,
+				     MTHCA_MPT_FLAG_LOCAL_READ |
+				     MTHCA_MPT_FLAG_LOCAL_WRITE,
+				     &pd->ntmr);
+	if (err)
+		mthca_free(&dev->pd_table.alloc, pd->pd_num);
+
+	return err;
+}
+
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	might_sleep();
+	mthca_free_mr(dev, &pd->ntmr);
+	mthca_free(&dev->pd_table.alloc, pd->pd_num);
+}
+
+int __devinit mthca_init_pd_table(struct mthca_dev *dev)
+{
+	return mthca_alloc_init(&dev->pd_table.alloc,
+				dev->limits.num_pds,
+				(1 << 24) - 1,
+				dev->limits.reserved_pds);
+}
+
+void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev)
+{
+	/* XXX check if any PDs are still allocated? */
+	mthca_alloc_cleanup(&dev->pd_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c	2004-11-19 08:36:02.802086952 -0800
@@ -0,0 +1,222 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include "mthca_profile.h"
+
+static int default_profile[MTHCA_RES_NUM] = {
+	[MTHCA_RES_QP]    = 1 << 16,
+	[MTHCA_RES_EQP]   = 1 << 16,
+	[MTHCA_RES_CQ]    = 1 << 16,
+	[MTHCA_RES_EQ]    = 32,
+	[MTHCA_RES_RDB]   = 1 << 18,
+	[MTHCA_RES_MCG]   = 1 << 13,
+	[MTHCA_RES_MPT]   = 1 << 17,
+	[MTHCA_RES_MTT]   = 1 << 20,
+	[MTHCA_RES_UDAV]  = 1 << 15
+};
+
+enum {
+	MTHCA_RDB_ENTRY_SIZE = 32,
+	MTHCA_MTT_SEG_SIZE   = 64
+};
+
+enum {
+	MTHCA_NUM_PDS = 1 << 15
+};
+
+int mthca_make_profile(struct mthca_dev *dev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca)
+{
+	/* just use default profile for now */
+	struct mthca_resource {
+		u64 size;
+		u64 start;
+		int type;
+		int num;
+		int log_num;
+	};
+
+	u64 total_size = 0;
+	struct mthca_resource *profile;
+	struct mthca_resource tmp;
+	int i, j;
+
+	default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE;
+
+	profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL);
+	if (!profile)
+		return -ENOMEM;
+
+	profile[MTHCA_RES_QP].size   = dev_lim->qpc_entry_sz;
+	profile[MTHCA_RES_EEC].size  = dev_lim->eec_entry_sz;
+	profile[MTHCA_RES_SRQ].size  = dev_lim->srq_entry_sz;
+	profile[MTHCA_RES_CQ].size   = dev_lim->cqc_entry_sz;
+	profile[MTHCA_RES_EQP].size  = dev_lim->eqpc_entry_sz;
+	profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz;
+	profile[MTHCA_RES_EQ].size   = dev_lim->eqc_entry_sz;
+	profile[MTHCA_RES_RDB].size  = MTHCA_RDB_ENTRY_SIZE;
+	profile[MTHCA_RES_MCG].size  = MTHCA_MGM_ENTRY_SIZE;
+	profile[MTHCA_RES_MPT].size  = MTHCA_MPT_ENTRY_SIZE;
+	profile[MTHCA_RES_MTT].size  = MTHCA_MTT_SEG_SIZE;
+	profile[MTHCA_RES_UAR].size  = dev_lim->uar_scratch_entry_sz;
+	profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE;
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		profile[i].type     = i;
+		profile[i].num      = default_profile[i];
+		profile[i].log_num  = max(ffs(default_profile[i]) - 1, 0);
+		profile[i].size    *= default_profile[i];
+	}
+
+	/* 
+	 * Sort the resources in decreasing order of size.  Since they
+	 * all have sizes that are powers of 2, we'll be able to keep
+	 * resources aligned to their size and pack them without gaps
+	 * using the sorted order.
+	 */
+	for (i = MTHCA_RES_NUM; i > 0; --i)
+		for (j = 1; j < i; ++j) {
+			if (profile[j].size > profile[j - 1].size) {
+				tmp            = profile[j];
+				profile[j]     = profile[j - 1];
+				profile[j - 1] = tmp;
+			}
+		}
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		if (profile[i].size) {
+			profile[i].start = dev->ddr_start + total_size;
+			total_size      += profile[i].size;
+		}
+		if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) {
+			mthca_err(dev, "Profile requires 0x%llx bytes; "
+				  "won't fit between DDR start at 0x%016llx "
+				  "and FW start at 0x%016llx.\n",
+				  (unsigned long long) total_size,
+				  (unsigned long long) dev->ddr_start,
+				  (unsigned long long) dev->fw.tavor.fw_start);
+			kfree(profile);
+			return -ENOMEM;
+		}
+
+		if (profile[i].size)
+			mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx "
+				  "(size 0x%8llx)\n",
+				  i, profile[i].type, profile[i].log_num,
+				  (unsigned long long) profile[i].start,
+				  (unsigned long long) profile[i].size);
+	}
+
+	mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n",
+		  (int) (total_size >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10));
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		switch (profile[i].type) {
+		case MTHCA_RES_QP:
+			dev->limits.num_qps   = profile[i].num;
+			init_hca->qpc_base    = profile[i].start;
+			init_hca->log_num_qps = profile[i].log_num;
+			break;
+		case MTHCA_RES_EEC:
+			dev->limits.num_eecs   = profile[i].num;
+			init_hca->eec_base     = profile[i].start;
+			init_hca->log_num_eecs = profile[i].log_num;
+			break;
+		case MTHCA_RES_SRQ:
+			dev->limits.num_srqs   = profile[i].num;
+			init_hca->srqc_base    = profile[i].start;
+			init_hca->log_num_srqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_CQ:
+			dev->limits.num_cqs   = profile[i].num;
+			init_hca->cqc_base    = profile[i].start;
+			init_hca->log_num_cqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_EQP:
+			init_hca->eqpc_base = profile[i].start;
+			break;
+		case MTHCA_RES_EEEC:
+			init_hca->eeec_base = profile[i].start;
+			break;
+		case MTHCA_RES_EQ:
+			dev->limits.num_eqs   = profile[i].num;
+			init_hca->eqc_base    = profile[i].start;
+			init_hca->log_num_eqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_RDB:
+			dev->limits.num_rdbs = profile[i].num;
+			init_hca->rdb_base   = profile[i].start;
+			break;
+		case MTHCA_RES_MCG:
+			dev->limits.num_mgms      = profile[i].num >> 1;
+			dev->limits.num_amgms     = profile[i].num >> 1;
+			init_hca->mc_base         = profile[i].start;
+			init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1;
+			init_hca->log_mc_table_sz = profile[i].log_num;
+			init_hca->mc_hash_sz      = 1 << (profile[i].log_num - 1);
+			break;
+		case MTHCA_RES_MPT:
+			dev->limits.num_mpts = profile[i].num;
+			init_hca->mpt_base   = profile[i].start;
+			init_hca->log_mpt_sz = profile[i].log_num;
+			break;
+		case MTHCA_RES_MTT:
+			dev->limits.num_mtt_segs = profile[i].num;
+			dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE;
+			dev->mr_table.mtt_base   = profile[i].start;
+			init_hca->mtt_base       = profile[i].start;
+			init_hca->mtt_seg_sz     = ffs(MTHCA_MTT_SEG_SIZE) - 7;
+			break;
+		case MTHCA_RES_UAR:
+			init_hca->uar_scratch_base = profile[i].start;
+			break;
+		case MTHCA_RES_UDAV:
+			dev->av_table.ddr_av_base = profile[i].start;
+			dev->av_table.num_ddr_avs = profile[i].num;
+		default:
+			break;
+		}
+	}
+
+	/*
+	 * PDs don't take any HCA memory, but we assign them as part
+	 * of the HCA profile anyway.
+	 */
+	dev->limits.num_pds = MTHCA_NUM_PDS;
+
+	kfree(profile);
+	return 0;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h	2004-11-19 08:36:02.826083415 -0800
@@ -0,0 +1,58 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $
+ */
+
+#ifndef MTHCA_PROFILE_H
+#define MTHCA_PROFILE_H
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_RES_QP,
+	MTHCA_RES_EEC,
+	MTHCA_RES_SRQ,
+	MTHCA_RES_CQ,
+	MTHCA_RES_EQP,
+	MTHCA_RES_EEEC,
+	MTHCA_RES_EQ,
+	MTHCA_RES_RDB,
+	MTHCA_RES_MCG,
+	MTHCA_RES_MPT,
+	MTHCA_RES_MTT,
+	MTHCA_RES_UAR,
+	MTHCA_RES_UDAV,
+	MTHCA_RES_NUM
+};
+
+int mthca_make_profile(struct mthca_dev *mdev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca);
+
+#endif /* MTHCA_PROFILE_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c	2004-11-19 08:36:02.865077669 -0800
@@ -0,0 +1,629 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $
+ */
+
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+/* Temporary until we get core support straightened out */
+enum {
+	IB_SMP_ATTRIB_NODE_INFO        = 0x0011,
+	IB_SMP_ATTRIB_GUID_INFO        = 0x0014,
+	IB_SMP_ATTRIB_PORT_INFO        = 0x0015,
+	IB_SMP_ATTRIB_PKEY_TABLE       = 0x0016
+};
+
+static int mthca_query_device(struct ib_device *ibdev,
+			      struct ib_device_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	props->fw_ver        = to_mdev(ibdev)->fw_ver;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    1, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->vendor_id      = be32_to_cpup((u32 *) (out_mad->data + 76)) &
+		0xffffff;
+	props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70));
+	props->hw_ver         = be16_to_cpup((u16 *) (out_mad->data + 72));
+	memcpy(&props->sys_image_guid, out_mad->data + 44, 8);
+	memcpy(&props->node_guid,      out_mad->data + 52, 8);
+
+	err = 0;
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_port(struct ib_device *ibdev,
+			    u8 port, struct ib_port_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->lid               = be16_to_cpup((u16 *) (out_mad->data + 56));
+	props->lmc               = (*(u8 *) (out_mad->data + 74)) & 0x7;
+	props->sm_lid            = be16_to_cpup((u16 *) (out_mad->data + 58));
+	props->sm_sl             = (*(u8 *) (out_mad->data + 76)) & 0xf;
+	props->state             = (*(u8 *) (out_mad->data + 72)) & 0xf;
+	props->port_cap_flags    = be32_to_cpup((u32 *) (out_mad->data + 60));
+	props->gid_tbl_len       = to_mdev(ibdev)->limits.gid_table_len;
+	props->pkey_tbl_len      = to_mdev(ibdev)->limits.pkey_table_len;
+	props->qkey_viol_cntr    = be16_to_cpup((u16 *) (out_mad->data + 88));
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_modify_port(struct ib_device *ibdev,
+			     u8 port, int port_modify_mask,
+			     struct ib_port_modify *props)
+{
+	return 0;
+}
+
+static int mthca_query_pkey(struct ib_device *ibdev,
+			    u8 port, u16 index, u16 *pkey)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 32);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	*pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_gid(struct ib_device *ibdev, u8 port,
+			   int index, union ib_gid *gid)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw, out_mad->data + 48, 8);
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 8);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev)
+{
+	struct mthca_pd *pd;
+	int err;
+
+	pd = kmalloc(sizeof *pd, GFP_KERNEL);
+	if (!pd)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_pd_alloc(to_mdev(ibdev), pd);
+	if (err) {
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	return &pd->ibpd;
+}
+
+static int mthca_dealloc_pd(struct ib_pd *pd)
+{
+	mthca_pd_free(to_mdev(pd->device), to_mpd(pd));
+	kfree(pd);
+
+	return 0;
+}
+
+static struct ib_ah *mthca_ah_create(struct ib_pd *pd,
+				     struct ib_ah_attr *ah_attr)
+{
+	int err;
+	struct mthca_ah *ah;
+
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah);
+	if (err) {
+		kfree(ah);
+		return ERR_PTR(err);
+	}
+
+	return &ah->ibah;
+}
+
+static int mthca_ah_destroy(struct ib_ah *ah)
+{
+	mthca_destroy_ah(to_mdev(ah->device), to_mah(ah));
+	kfree(ah);
+
+	return 0;
+}
+
+static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
+				     struct ib_qp_init_attr *init_attr)
+{
+	struct mthca_qp *qp;
+	int err;
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_RC:
+	case IB_QPT_UC:
+	case IB_QPT_UD:
+	{
+		qp = kmalloc(sizeof *qp, GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd),
+				     to_mcq(init_attr->send_cq),
+				     to_mcq(init_attr->recv_cq),
+				     init_attr->qp_type, init_attr->sq_sig_type,
+				     init_attr->rq_sig_type, qp);
+		qp->ibqp.qp_num = qp->qpn;
+		break;
+	}
+	case IB_QPT_SMI:
+	case IB_QPT_GSI:
+	{
+		qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1;
+
+		err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
+				      to_mcq(init_attr->send_cq),
+				      to_mcq(init_attr->recv_cq),
+				      init_attr->sq_sig_type, init_attr->rq_sig_type,
+				      qp->ibqp.qp_num, init_attr->port_num,
+				      to_msqp(qp));
+		break;
+	}
+	default:
+		/* Don't support raw QPs */
+		return ERR_PTR(-ENOSYS);
+	}
+
+	if (err) {
+		kfree(qp);
+		return ERR_PTR(err);
+	}
+
+        init_attr->cap.max_inline_data = 0;
+
+	return &qp->ibqp;
+}
+
+static int mthca_destroy_qp(struct ib_qp *qp)
+{
+	mthca_free_qp(to_mdev(qp->device), to_mqp(qp));
+	kfree(qp);
+	return 0;
+}
+
+static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries)
+{
+	struct mthca_cq *cq;
+	int nent;
+	int err;
+
+	cq = kmalloc(sizeof *cq, GFP_KERNEL);
+	if (!cq)
+		return ERR_PTR(-ENOMEM);
+
+	for (nent = 1; nent < entries; nent <<= 1)
+		; /* nothing */
+
+	err = mthca_init_cq(to_mdev(ibdev), nent, cq);
+	if (err) {
+		kfree(cq);
+		cq = ERR_PTR(err);
+	} else
+		cq->ibcq.cqe = nent;
+
+	return &cq->ibcq;
+}
+
+static int mthca_destroy_cq(struct ib_cq *cq)
+{
+	mthca_free_cq(to_mdev(cq->device), to_mcq(cq));
+	kfree(cq);
+
+	return 0;
+}
+
+static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+{
+	mthca_arm_cq(to_mdev(cq->device), to_mcq(cq),
+		     notify == IB_CQ_SOLICITED);
+	return 0;
+}
+
+static inline u32 convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC       : 0) |
+	       (acc & IB_ACCESS_REMOTE_WRITE  ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) |
+	       (acc & IB_ACCESS_REMOTE_READ   ? MTHCA_MPT_FLAG_REMOTE_READ  : 0) |
+	       (acc & IB_ACCESS_LOCAL_WRITE   ? MTHCA_MPT_FLAG_LOCAL_WRITE  : 0) |
+	       MTHCA_MPT_FLAG_LOCAL_READ;
+}
+
+static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct mthca_mr *mr;
+	int err;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_mr_alloc_notrans(to_mdev(pd->device),
+				     to_mpd(pd)->pd_num,
+				     convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+}
+
+static struct ib_mr *mthca_reg_phys_mr(struct ib_pd       *pd,
+				       struct ib_phys_buf *buffer_list,
+				       int                 num_phys_buf,
+				       int                 acc,
+				       u64                *iova_start)
+{
+	struct mthca_mr *mr;
+	u64 *page_list;
+	u64 total_size;
+	u64 mask;
+	int shift;
+	int npages;
+	int err;
+	int i, j, n;
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	mask = 0;
+	total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (buffer_list[i].addr & ~PAGE_MASK)
+			return ERR_PTR(-EINVAL);
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return ERR_PTR(-EINVAL);
+
+		total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	/* Find largest page shift we can use to cover buffers */
+	for (shift = PAGE_SHIFT; shift < 31; ++shift)
+		if (num_phys_buf > 1) {
+			if ((1ULL << shift) & mask)
+				break;
+		} else {
+			if (1ULL << shift >= 
+			    buffer_list[0].size + 
+			    (buffer_list[0].addr & ((1ULL << shift) - 1)))
+				break;
+		}
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1);
+	buffer_list[0].addr &= ~0ull << shift;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+
+	if (!npages)
+		return &mr->ibmr;
+
+	page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL);
+	if (!page_list) {
+		kfree(mr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+		     ++j)
+			page_list[n++] = buffer_list[i].addr + ((u64) j << shift);
+
+	mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) "
+		  "in PD %x; shift %d, npages %d.\n",
+		  (unsigned long long) buffer_list[0].addr,
+		  (unsigned long long) *iova_start,
+		  to_mpd(pd)->pd_num,
+		  shift, npages);
+
+	err = mthca_mr_alloc_phys(to_mdev(pd->device),
+				  to_mpd(pd)->pd_num,
+				  page_list, shift, npages,
+				  *iova_start, total_size,
+				  convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	kfree(page_list);
+	return &mr->ibmr;
+}
+
+static int mthca_dereg_mr(struct ib_mr *mr)
+{
+	mthca_free_mr(to_mdev(mr->device), to_mmr(mr));
+	kfree(mr);
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x\n", dev->rev_id);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32),
+		       (int) (dev->fw_ver >> 16) & 0xffff,
+		       (int) dev->fw_ver & 0xffff);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	switch (dev->hca_type) {
+	case TAVOR:        return sprintf(buf, "MT23108\n");
+	case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n");
+	case ARBEL_NATIVE: return sprintf(buf, "MT25208\n");
+	default:           return sprintf(buf, "unknown\n");
+	}
+}
+
+static CLASS_DEVICE_ATTR(hw_rev,   S_IRUGO, show_rev,    NULL);
+static CLASS_DEVICE_ATTR(fw_ver,   S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca,    NULL);
+
+static struct class_device_attribute *mthca_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type
+};
+
+int mthca_register_device(struct mthca_dev *dev)
+{
+	int ret;
+	int i;
+
+	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
+	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
+	dev->ib_dev.dma_device           = dev->pdev;
+	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
+	dev->ib_dev.query_device         = mthca_query_device;
+	dev->ib_dev.query_port           = mthca_query_port;
+	dev->ib_dev.modify_port          = mthca_modify_port;
+	dev->ib_dev.query_pkey           = mthca_query_pkey;
+	dev->ib_dev.query_gid            = mthca_query_gid;
+	dev->ib_dev.alloc_pd             = mthca_alloc_pd;
+	dev->ib_dev.dealloc_pd           = mthca_dealloc_pd;
+	dev->ib_dev.create_ah            = mthca_ah_create;
+	dev->ib_dev.destroy_ah           = mthca_ah_destroy;
+	dev->ib_dev.create_qp            = mthca_create_qp;
+	dev->ib_dev.modify_qp            = mthca_modify_qp;
+	dev->ib_dev.destroy_qp           = mthca_destroy_qp;
+	dev->ib_dev.post_send            = mthca_post_send;
+	dev->ib_dev.post_recv            = mthca_post_receive;
+	dev->ib_dev.create_cq            = mthca_create_cq;
+	dev->ib_dev.destroy_cq           = mthca_destroy_cq;
+	dev->ib_dev.poll_cq              = mthca_poll_cq;
+	dev->ib_dev.req_notify_cq        = mthca_req_notify_cq;
+	dev->ib_dev.get_dma_mr           = mthca_get_dma_mr;
+	dev->ib_dev.reg_phys_mr          = mthca_reg_phys_mr;
+	dev->ib_dev.dereg_mr             = mthca_dereg_mr;
+	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
+	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
+	dev->ib_dev.process_mad          = mthca_process_mad;
+
+	ret = ib_register_device(&dev->ib_dev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ib_dev.class_dev,
+					       mthca_class_attributes[i]);
+		if (ret) {
+			ib_unregister_device(&dev->ib_dev);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+void mthca_unregister_device(struct mthca_dev *dev)
+{
+	ib_unregister_device(&dev->ib_dev);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h	2004-11-19 08:36:02.912070743 -0800
@@ -0,0 +1,221 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $
+ */
+
+#ifndef MTHCA_PROVIDER_H
+#define MTHCA_PROVIDER_H
+
+#include <ib_verbs.h>
+#include <ib_pack.h>
+
+#define MTHCA_MPT_FLAG_ATOMIC        (1 << 14)
+#define MTHCA_MPT_FLAG_REMOTE_WRITE  (1 << 13)
+#define MTHCA_MPT_FLAG_REMOTE_READ   (1 << 12)
+#define MTHCA_MPT_FLAG_LOCAL_WRITE   (1 << 11)
+#define MTHCA_MPT_FLAG_LOCAL_READ    (1 << 10)
+
+struct mthca_buf_list {
+	void *buf;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+struct mthca_mr {
+	struct ib_mr ibmr;
+	int order;
+	u32 first_seg;
+};
+
+struct mthca_pd {
+	struct ib_pd    ibpd;
+	u32             pd_num;
+	atomic_t        sqp_count;
+	struct mthca_mr ntmr;
+};
+
+struct mthca_eq {
+	struct mthca_dev      *dev;
+	int                    eqn;
+	u32                    ecr_mask;
+	u16                    msi_x_vector;
+	u16                    msi_x_entry;
+	int                    have_irq;
+	int                    nent;
+	int                    cons_index;
+	struct mthca_buf_list *page_list;
+	struct mthca_mr        mr;
+};
+
+struct mthca_av;
+
+struct mthca_ah {
+	struct ib_ah     ibah;
+	int              on_hca;
+	u32              key;
+	struct mthca_av *av;
+	dma_addr_t       avdma;
+};
+
+/*
+ * Quick description of our CQ/QP locking scheme:
+ *
+ * We have one global lock that protects dev->cq/qp_table.  Each
+ * struct mthca_cq/qp also has its own lock.  An individual qp lock
+ * may be taken inside of an individual cq lock.  Both cqs attached to
+ * a qp may be locked, with the send cq locked first.  No other
+ * nesting should be done.
+ *
+ * Each struct mthca_cq/qp also has an atomic_t ref count.  The
+ * pointer from the cq/qp_table to the struct counts as one reference.
+ * This reference also is good for access through the consumer API, so
+ * modifying the CQ/QP etc doesn't need to take another reference.
+ * Access because of a completion being polled does need a reference.
+ *
+ * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the
+ * destroy function to sleep on.
+ *
+ * This means that access from the consumer API requires nothing but
+ * taking the struct's lock.
+ *
+ * Access because of a completion event should go as follows:
+ * - lock cq/qp_table and look up struct
+ * - increment ref count in struct
+ * - drop cq/qp_table lock
+ * - lock struct, do your thing, and unlock struct
+ * - decrement ref count; if zero, wake up waiters
+ *
+ * To destroy a CQ/QP, we can do the following:
+ * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock
+ * - decrement ref count
+ * - wait_event until ref count is zero
+ *
+ * It is the consumer's responsibilty to make sure that no QP
+ * operations (WQE posting or state modification) are pending when the
+ * QP is destroyed.  Also, the consumer must make sure that calls to
+ * qp_modify are serialized.
+ *
+ * Possible optimizations (wait for profile data to see if/where we
+ * have locks bouncing between CPUs):
+ * - split cq/qp table lock into n separate (cache-aligned) locks,
+ *   indexed (say) by the page in the table
+ * - split QP struct lock into three (one for common info, one for the
+ *   send queue and one for the receive queue)
+ */
+
+struct mthca_cq {
+	struct ib_cq           ibcq;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	int                    cqn;
+	int                    cons_index;
+	int                    is_direct;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+	struct mthca_mr        mr;
+	wait_queue_head_t      wait;
+};
+
+struct mthca_wq {
+	int   max;
+	int   cur;
+	int   next;
+	int   last_comp;
+	void *last;
+	int   max_gs;
+	int   wqe_shift;
+	enum ib_sig_type policy;
+};
+
+struct mthca_qp {
+	struct ib_qp           ibqp;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	u32                    qpn;
+	int                    transport;
+	enum ib_qp_state       state;
+	int                    is_direct;
+	struct mthca_mr        mr;
+
+	struct mthca_wq        rq;
+	struct mthca_wq        sq;
+	int                    send_wqe_offset;
+
+	u64                   *wrid;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+
+	wait_queue_head_t      wait;
+};
+
+struct mthca_sqp {
+	struct mthca_qp qp;
+	int             port;
+	int             pkey_index;
+	u32             qkey;
+	u32             send_psn;
+	struct ib_ud_header ud_header;
+	int             header_buf_size;
+	void           *header_buf;
+	dma_addr_t      header_dma;
+};
+
+static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct mthca_mr, ibmr);
+}
+
+static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct mthca_pd, ibpd);
+}
+
+static inline struct mthca_ah *to_mah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct mthca_ah, ibah);
+}
+
+static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct mthca_cq, ibcq);
+}
+
+static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct mthca_qp, ibqp);
+}
+
+static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp)
+{
+	return container_of(qp, struct mthca_sqp, qp);
+}
+
+#endif /* MTHCA_PROVIDER_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c	2004-11-19 08:36:02.958063966 -0800
@@ -0,0 +1,1485 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_qp.c 1270 2004-11-18 21:47:31Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE,
+	MTHCA_ACK_REQ_FREQ       = 10,
+	MTHCA_FLIGHT_LIMIT       = 9,
+	MTHCA_UD_HEADER_SIZE     = 72 /* largest UD header possible */
+};
+
+enum {
+	MTHCA_QP_STATE_RST  = 0,
+	MTHCA_QP_STATE_INIT = 1,
+	MTHCA_QP_STATE_RTR  = 2,
+	MTHCA_QP_STATE_RTS  = 3,
+	MTHCA_QP_STATE_SQE  = 4,
+	MTHCA_QP_STATE_SQD  = 5,
+	MTHCA_QP_STATE_ERR  = 6,
+	MTHCA_QP_STATE_DRAINING = 7
+};
+
+enum {
+	MTHCA_QP_ST_RC 	= 0x0,
+	MTHCA_QP_ST_UC 	= 0x1,
+	MTHCA_QP_ST_RD 	= 0x2,
+	MTHCA_QP_ST_UD 	= 0x3,
+	MTHCA_QP_ST_MLX = 0x7
+};
+
+enum {
+	MTHCA_QP_PM_MIGRATED = 0x3,
+	MTHCA_QP_PM_ARMED    = 0x0,
+	MTHCA_QP_PM_REARM    = 0x1
+};
+
+enum {
+	/* qp_context flags */
+	MTHCA_QP_BIT_DE  = 1 <<  8,
+	/* params1 */
+	MTHCA_QP_BIT_SRE = 1 << 15,
+	MTHCA_QP_BIT_SWE = 1 << 14,
+	MTHCA_QP_BIT_SAE = 1 << 13,
+	MTHCA_QP_BIT_SIC = 1 <<  4,
+	MTHCA_QP_BIT_SSC = 1 <<  3,
+	/* params2 */
+	MTHCA_QP_BIT_RRE = 1 << 15,
+	MTHCA_QP_BIT_RWE = 1 << 14,
+	MTHCA_QP_BIT_RAE = 1 << 13,
+	MTHCA_QP_BIT_RIC = 1 <<  4,
+	MTHCA_QP_BIT_RSC = 1 <<  3
+};
+
+struct mthca_qp_path {
+	u32 port_pkey;
+	u8  rnr_retry;
+	u8  g_mylmc;
+	u16 rlid;
+	u8  ackto;
+	u8  mgid_index;
+	u8  static_rate;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u8  rgid[16];
+} __attribute__((packed));
+
+struct mthca_qp_context {
+	u32 flags;
+	u32 sched_queue;
+	u32 mtu_msgmax;
+	u32 usr_page;
+	u32 local_qpn;
+	u32 remote_qpn;
+	u32 reserved1[2];
+	struct mthca_qp_path pri_path;
+	struct mthca_qp_path alt_path;
+	u32 rdd;
+	u32 pd;
+	u32 wqe_base;
+	u32 wqe_lkey;
+	u32 params1;
+	u32 reserved2;
+	u32 next_send_psn;
+	u32 cqn_snd;
+	u32 next_snd_wqe[2];
+	u32 last_acked_psn;
+	u32 ssn;
+	u32 params2;
+	u32 rnr_nextrecvpsn;
+	u32 ra_buff_indx;
+	u32 cqn_rcv;
+	u32 next_rcv_wqe[2];
+	u32 qkey;
+	u32 srqn;
+	u32 rmsn;
+	u32 reserved3[19];
+} __attribute__((packed));
+
+struct mthca_qp_param {
+	u32 opt_param_mask;
+	u32 reserved1;
+	struct mthca_qp_context context;
+	u32 reserved2[62];
+} __attribute__((packed));
+
+enum {
+	MTHCA_QP_OPTPAR_ALT_ADDR_PATH     = 1 << 0,
+	MTHCA_QP_OPTPAR_RRE               = 1 << 1,
+	MTHCA_QP_OPTPAR_RAE               = 1 << 2,
+	MTHCA_QP_OPTPAR_REW               = 1 << 3,
+	MTHCA_QP_OPTPAR_PKEY_INDEX        = 1 << 4,
+	MTHCA_QP_OPTPAR_Q_KEY             = 1 << 5,
+	MTHCA_QP_OPTPAR_RNR_TIMEOUT       = 1 << 6,
+	MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7,
+	MTHCA_QP_OPTPAR_SRA_MAX           = 1 << 8,
+	MTHCA_QP_OPTPAR_RRA_MAX           = 1 << 9,
+	MTHCA_QP_OPTPAR_PM_STATE          = 1 << 10,
+	MTHCA_QP_OPTPAR_PORT_NUM          = 1 << 11,
+	MTHCA_QP_OPTPAR_RETRY_COUNT       = 1 << 12,
+	MTHCA_QP_OPTPAR_ALT_RNR_RETRY     = 1 << 13,
+	MTHCA_QP_OPTPAR_ACK_TIMEOUT       = 1 << 14,
+	MTHCA_QP_OPTPAR_RNR_RETRY         = 1 << 15,
+	MTHCA_QP_OPTPAR_SCHED_QUEUE       = 1 << 16
+};
+
+enum {
+	MTHCA_OPCODE_NOP            = 0x00,
+	MTHCA_OPCODE_RDMA_WRITE     = 0x08,
+	MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09,
+	MTHCA_OPCODE_SEND           = 0x0a,
+	MTHCA_OPCODE_SEND_IMM       = 0x0b,
+	MTHCA_OPCODE_RDMA_READ      = 0x10,
+	MTHCA_OPCODE_ATOMIC_CS      = 0x11,
+	MTHCA_OPCODE_ATOMIC_FA      = 0x12,
+	MTHCA_OPCODE_BIND_MW        = 0x18,
+	MTHCA_OPCODE_INVALID        = 0xff
+};
+
+enum {
+	MTHCA_NEXT_DBD       = 1 << 7,
+	MTHCA_NEXT_FENCE     = 1 << 6,
+	MTHCA_NEXT_CQ_UPDATE = 1 << 3,
+	MTHCA_NEXT_EVENT_GEN = 1 << 2,
+	MTHCA_NEXT_SOLICIT   = 1 << 1,
+
+	MTHCA_MLX_VL15       = 1 << 17,
+	MTHCA_MLX_SLR        = 1 << 16
+};
+
+struct mthca_next_seg {
+	u32 nda_op;		/* [31:6] next WQE [4:0] next opcode */
+	u32 ee_nds;		/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
+	u32 flags;		/* [3] CQ [2] Event [1] Solicit */
+	u32 imm;		/* immediate data */
+} __attribute__((packed));
+
+struct mthca_ud_seg {
+	u32 reserved1;
+	u32 lkey;
+	u64 av_addr;
+	u32 reserved2[4];
+	u32 dqpn;
+	u32 qkey;
+	u32 reserved3[2];
+} __attribute__((packed));
+
+struct mthca_bind_seg {
+	u32 flags;		/* [31] Atomic [30] rem write [29] rem read */
+	u32 reserved;
+	u32 new_rkey;
+	u32 lkey;
+	u64 addr;
+	u64 length;
+} __attribute__((packed));
+
+struct mthca_raddr_seg {
+	u64 raddr;
+	u32 rkey;
+	u32 reserved;
+} __attribute__((packed));
+
+struct mthca_atomic_seg {
+	u64 swap_add;
+	u64 compare;
+} __attribute__((packed));
+
+struct mthca_data_seg {
+	u32 byte_count;
+	u32 lkey;
+	u64 addr;
+} __attribute__((packed));
+
+struct mthca_mlx_seg {
+	u32 nda_op;
+	u32 nds;
+	u32 flags;		/* [17] VL15 [16] SLR [14:12] static rate
+				   [11:8] SL [3] C [2] E */
+	u16 rlid;
+	u16 vcrc;
+} __attribute__((packed));
+
+static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 3;
+}
+
+static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 1;
+}
+
+static void *get_recv_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + (n << qp->rq.wqe_shift);
+	else
+		return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf +
+			((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1));
+}
+
+static void *get_send_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + qp->send_wqe_offset +
+			(n << qp->sq.wqe_shift);
+	else
+		return qp->queue.page_list[(qp->send_wqe_offset +
+					    (n << qp->sq.wqe_shift)) >>
+					   PAGE_SHIFT].buf +
+			((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) &
+			 (PAGE_SIZE - 1));
+}
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type)
+{
+	struct mthca_qp *qp;
+	struct ib_event event;
+
+	spin_lock(&dev->qp_table.lock);
+	qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1));
+	if (qp)
+		atomic_inc(&qp->refcount);
+	spin_unlock(&dev->qp_table.lock);
+
+	if (!qp) {
+		mthca_warn(dev, "Async event for bogus QP %08x\n", qpn);
+		return;
+	}
+
+	event.device      = &dev->ib_dev;
+	event.event       = event_type;
+	event.element.qp  = &qp->ibqp;
+	if (qp->ibqp.event_handler)
+		qp->ibqp.event_handler(&event, qp->ibqp.qp_context);
+
+	if (atomic_dec_and_test(&qp->refcount))
+		wake_up(&qp->wait);
+}
+
+static int to_mthca_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET: return MTHCA_QP_STATE_RST;
+	case IB_QPS_INIT:  return MTHCA_QP_STATE_INIT;
+	case IB_QPS_RTR:   return MTHCA_QP_STATE_RTR;
+	case IB_QPS_RTS:   return MTHCA_QP_STATE_RTS;
+	case IB_QPS_SQD:   return MTHCA_QP_STATE_SQD;
+	case IB_QPS_SQE:   return MTHCA_QP_STATE_SQE;
+	case IB_QPS_ERR:   return MTHCA_QP_STATE_ERR;
+	default:                return -1;
+	}
+}
+
+enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS };
+
+static int to_mthca_st(int transport)
+{
+	switch (transport) {
+	case RC:  return MTHCA_QP_ST_RC;
+	case UC:  return MTHCA_QP_ST_UC;
+	case UD:  return MTHCA_QP_ST_UD;
+	case RD:  return MTHCA_QP_ST_RD;
+	case MLX: return MTHCA_QP_ST_MLX;
+	default:  return -1;
+	}
+}
+
+static const struct {
+	int trans;
+	u32 req_param[NUM_TRANS];
+	u32 opt_param[NUM_TRANS];
+} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = {
+	[IB_QPS_RESET] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_RST2INIT,
+			.req_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			},
+			/* bug-for-bug compatibility with VAPI: */
+			.opt_param = {
+				[MLX] = IB_QP_PORT
+			}
+		},
+	},
+	[IB_QPS_INIT]  = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_INIT2INIT,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_RTR]   = {
+			.trans = MTHCA_TRANS_INIT2RTR,
+			.req_param = {
+				[RC]  = (IB_QP_AV                  |
+					 IB_QP_PATH_MTU            |
+					 IB_QP_DEST_QPN            |
+					 IB_QP_RQ_PSN              |
+					 IB_QP_MAX_DEST_RD_ATOMIC  |
+					 IB_QP_MIN_RNR_TIMER),
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ALT_PATH     |
+					 IB_QP_ACCESS_FLAGS |
+					 IB_QP_PKEY_INDEX),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTR]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTR2RTS,
+			.req_param = {
+				[UD]  = IB_QP_SQ_PSN,
+				[RC]  = (IB_QP_TIMEOUT           |
+					 IB_QP_RETRY_CNT         |
+					 IB_QP_RNR_RETRY         |
+					 IB_QP_SQ_PSN            |
+					 IB_QP_MAX_QP_RD_ATOMIC),
+				[MLX] = IB_QP_SQ_PSN,
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTS]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTS2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ACCESS_FLAGS          |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_PATH_MIG_STATE        |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_RTS2SQD,
+		},
+	},
+	[IB_QPS_SQD]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQD2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_SQD2SQD,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_AV                    |
+					 IB_QP_TIMEOUT               |
+					 IB_QP_RETRY_CNT             |
+					 IB_QP_RNR_RETRY             |
+					 IB_QP_MAX_QP_RD_ATOMIC      |
+					 IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_SQE]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQERR2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_ERR] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }
+	}
+};
+
+static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr,
+			int attr_mask)
+{
+	if (attr_mask & IB_QP_PKEY_INDEX)
+		sqp->pkey_index = attr->pkey_index;
+	if (attr_mask & IB_QP_QKEY)
+		sqp->qkey = attr->qkey;
+	if (attr_mask & IB_QP_SQ_PSN)
+		sqp->send_psn = attr->sq_psn;
+}
+
+static void init_port(struct mthca_dev *dev, int port)
+{
+	int err;
+	u8 status;
+	struct mthca_init_ib_param param;
+
+	memset(&param, 0, sizeof param);
+
+	param.enable_1x = 1;
+	param.enable_4x = 1;
+	param.vl_cap    = dev->limits.vl_cap;
+	param.mtu_cap   = dev->limits.mtu_cap;
+	param.gid_cap   = dev->limits.gid_table_len;
+	param.pkey_cap  = dev->limits.pkey_table_len;
+
+	err = mthca_INIT_IB(dev, &param, port, &status);
+	if (err)
+		mthca_warn(dev, "INIT_IB failed, return code %d.\n", err);
+	if (status)
+		mthca_warn(dev, "INIT_IB returned status %02x.\n", status);
+}
+
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	enum ib_qp_state cur_state, new_state;
+	void *mailbox = NULL;
+	struct mthca_qp_param *qp_param;
+	struct mthca_qp_context *qp_context;
+	u32 req_param, opt_param;
+	u8 status;
+	int err;
+
+	if (attr_mask & IB_QP_CUR_STATE) {
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			cur_state = attr->cur_qp_state;
+	} else {
+		spin_lock_irq(&qp->lock);
+		cur_state = qp->state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	if (attr_mask & IB_QP_STATE) {
+               if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
+			return -EINVAL;
+		new_state = attr->qp_state;
+	} else
+		new_state = cur_state;
+
+	if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) {
+		mthca_dbg(dev, "Illegal QP transition "
+			  "%d->%d\n", cur_state, new_state);
+		return -EINVAL;
+	}
+
+	req_param = state_table[cur_state][new_state].req_param[qp->transport];
+	opt_param = state_table[cur_state][new_state].opt_param[qp->transport];
+
+	if ((req_param & attr_mask) != req_param) {
+		mthca_dbg(dev, "QP transition "
+			  "%d->%d missing req attr 0x%08x\n",
+			  cur_state, new_state,
+			  req_param & ~attr_mask);
+		return -EINVAL;
+	}
+
+	if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) {
+		mthca_dbg(dev, "QP transition (transport %d) "
+			  "%d->%d has extra attr 0x%08x\n",
+			  qp->transport,
+			  cur_state, new_state,
+			  attr_mask & ~(req_param | opt_param |
+						 IB_QP_STATE));
+		return -EINVAL;
+	}
+
+	mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	qp_param = MAILBOX_ALIGN(mailbox);
+	qp_context = &qp_param->context;
+	memset(qp_param, 0, sizeof *qp_param);
+
+	qp_context->flags      = cpu_to_be32((to_mthca_state(new_state) << 28) |
+					     (to_mthca_st(qp->transport) << 16));
+	qp_context->flags     |= cpu_to_be32(MTHCA_QP_BIT_DE);
+	if (!(attr_mask & IB_QP_PATH_MIG_STATE))
+		qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+	else {
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE);
+		switch (attr->path_mig_state) {
+		case IB_MIG_MIGRATED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+			break;
+		case IB_MIG_REARM:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11);
+			break;
+		case IB_MIG_ARMED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11);
+			break;
+		}
+	}
+	/* leave sched_queue as 0 */
+	if (qp->transport == MLX || qp->transport == UD)
+		qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) |
+						     (11 << 24));
+	else if (attr_mask & IB_QP_PATH_MTU) {
+		qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) |
+						     (31 << 24));
+	}
+	qp_context->usr_page   = cpu_to_be32(MTHCA_KAR_PAGE);
+	qp_context->local_qpn  = cpu_to_be32(qp->qpn);
+	if (attr_mask & IB_QP_DEST_QPN) {
+		qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
+	}
+
+	if (qp->transport == MLX)
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(to_msqp(qp)->port << 24);
+	else {
+		if (attr_mask & IB_QP_PORT) {
+			qp_context->pri_path.port_pkey |=
+				cpu_to_be32(attr->port_num << 24);
+			qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM);
+		}
+	}
+
+	if (attr_mask & IB_QP_PKEY_INDEX) {
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(attr->pkey_index);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX);
+	}
+
+	if (attr_mask & IB_QP_RNR_RETRY) {
+		qp_context->pri_path.rnr_retry = attr->rnr_retry << 5;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY);
+	}
+
+	if (attr_mask & IB_QP_AV) {
+		qp_context->pri_path.g_mylmc     = attr->ah_attr.src_path_bits & 0x7f;
+		qp_context->pri_path.rlid        = cpu_to_be16(attr->ah_attr.dlid);
+		qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3;
+		if (attr->ah_attr.ah_flags & IB_AH_GRH) {
+			qp_context->pri_path.g_mylmc |= 1 << 7;
+			qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index;
+			qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit;
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32((attr->ah_attr.sl << 28)                |
+					    (attr->ah_attr.grh.traffic_class << 20) |
+					    (attr->ah_attr.grh.flow_label));
+			memcpy(qp_context->pri_path.rgid,
+			       attr->ah_attr.grh.dgid.raw, 16);
+		} else {
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32(attr->ah_attr.sl << 28);
+		}
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH);	
+	}
+
+	if (attr_mask & IB_QP_TIMEOUT) {
+		qp_context->pri_path.ackto = attr->timeout;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT);
+	}
+
+	/* XXX alt_path */
+
+	/* leave rdd as 0 */
+	qp_context->pd         = cpu_to_be32(to_mpd(ibqp->pd)->pd_num);
+	/* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */
+	qp_context->wqe_lkey   = cpu_to_be32(qp->mr.ibmr.lkey);
+	qp_context->params1    = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) |
+					     (MTHCA_FLIGHT_LIMIT << 24) |
+					     MTHCA_QP_BIT_SRE           |
+					     MTHCA_QP_BIT_SWE           |
+					     MTHCA_QP_BIT_SAE);
+	if (qp->sq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC);
+	if (attr_mask & IB_QP_RETRY_CNT) {
+		qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT);
+	}
+
+	/* XXX initiator resources */
+	if (attr_mask & IB_QP_SQ_PSN)
+		qp_context->next_send_psn = cpu_to_be32(attr->sq_psn);
+	qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn);
+
+	/* XXX RDMA/atomic enable, responder resources */
+
+	if (qp->rq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC);
+	if (attr_mask & IB_QP_MIN_RNR_TIMER) {
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT);
+	}
+	if (attr_mask & IB_QP_RQ_PSN)
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn);
+
+	/* XXX ra_buff_indx */
+
+	qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn);
+
+	if (attr_mask & IB_QP_QKEY) {
+		qp_context->qkey = cpu_to_be32(attr->qkey);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY);
+	}
+
+	err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans,
+			      qp->qpn, 0, qp_param, 0, &status);
+	if (status) {
+		mthca_warn(dev, "modify QP %d returned status %02x.\n",
+			   state_table[cur_state][new_state].trans, status);
+		err = -EINVAL;
+	}
+
+	if (!err) {
+		spin_lock_irq(&qp->lock);
+		/* XXX deal with async transitions to ERROR */
+		qp->state = new_state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	kfree(mailbox);
+
+	if (is_sqp(dev, qp))
+		store_attrs(to_msqp(qp), attr, attr_mask);
+
+	/* 
+	 * If we are moving QP0 to RTR, bring the IB link up; if we
+	 * are moving QP0 to RESET or ERROR, bring the link back down.
+	 */
+	if (is_qp0(dev, qp)) {
+		if (cur_state != IB_QPS_RTR &&
+		    new_state == IB_QPS_RTR)
+			init_port(dev, to_msqp(qp)->port);
+
+		if (cur_state != IB_QPS_RESET &&
+		    cur_state != IB_QPS_ERR &&
+		    (new_state == IB_QPS_RESET ||
+		     new_state == IB_QPS_ERR))
+			mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status);
+	}
+
+	return err;
+}
+
+/*
+ * Allocate and register buffer for WQEs.  qp->rq.max, sq.max,
+ * rq.max_gs and sq.max_gs must all be assigned.
+ * mthca_alloc_wqe_buf will calculate rq.wqe_shift and
+ * sq.wqe_shift (as well as send_wqe_offset, is_direct, and
+ * queue)
+ */
+static int mthca_alloc_wqe_buf(struct mthca_dev *dev,
+			       struct mthca_pd *pd,
+			       struct mthca_qp *qp)
+{
+	int size;
+	int i;
+	int npages, shift;
+	dma_addr_t t;
+	u64 *dma_list = NULL;
+	int err = -ENOMEM;
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->rq.max_gs * sizeof (struct mthca_data_seg);
+
+	for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size;
+	     qp->rq.wqe_shift++)
+		; /* nothing */
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->sq.max_gs * sizeof (struct mthca_data_seg);
+	if (qp->transport == MLX)
+		size += 2 * sizeof (struct mthca_data_seg);
+	else if (qp->transport == UD)
+		size += sizeof (struct mthca_ud_seg);
+	else /* bind seg is as big as atomic + raddr segs */
+		size += sizeof (struct mthca_bind_seg);
+
+	for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size;
+	     qp->sq.wqe_shift++)
+		; /* nothing */
+
+	qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift,
+				    1 << qp->sq.wqe_shift);
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64),
+			   GFP_KERNEL);
+	if (!qp->wrid)
+		goto err_out;
+
+	if (size <= MTHCA_MAX_DIRECT_QP_SIZE) {
+		qp->is_direct = 1;
+		npages = 1;
+		shift = get_order(size) + PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n",
+				  size, shift);
+
+		qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t);
+		if (!qp->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&qp->queue.direct, mapping, t);
+
+		memset(qp->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		qp->is_direct = 0;
+		npages = size / PAGE_SIZE;
+		shift = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		qp->queue.page_list = kmalloc(npages *
+					      sizeof *qp->queue.page_list,
+					      GFP_KERNEL);
+		if (!qp->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i) {
+			qp->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!qp->queue.page_list[i].buf)
+				goto err_out_free;
+
+			memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE);
+
+			pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t);
+			dma_list[i] = t;
+		}
+	}
+
+	err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift,
+				  npages, 0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &qp->mr);
+	if (err)
+		goto err_out_free;
+
+	kfree(dma_list);
+	return 0;
+
+ err_out_free:
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else
+		for (i = 0; i < npages; ++i) {
+			if (qp->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    qp->queue.page_list[i].buf,
+						    pci_unmap_addr(&qp->queue.page_list[i],
+								   mapping));
+
+		}
+
+ err_out:
+	kfree(qp->wrid);
+	kfree(dma_list);
+	return err;
+}
+
+static int mthca_alloc_qp_common(struct mthca_dev *dev,
+				 struct mthca_pd *pd,
+				 struct mthca_cq *send_cq,
+				 struct mthca_cq *recv_cq,
+				 enum ib_sig_type send_policy,
+				 enum ib_sig_type recv_policy,
+				 struct mthca_qp *qp)
+{
+	int err;
+
+	spin_lock_init(&qp->lock);
+	atomic_set(&qp->refcount, 1);
+	qp->state    	 = IB_QPS_RESET;
+	qp->sq.policy    = send_policy;
+	qp->rq.policy    = recv_policy;
+	qp->rq.cur       = 0;
+	qp->sq.cur       = 0;
+	qp->rq.next      = 0;
+	qp->sq.next      = 0;
+	qp->rq.last_comp = qp->rq.max - 1;
+	qp->sq.last_comp = qp->sq.max - 1;
+	qp->rq.last      = NULL;
+	qp->sq.last      = NULL;
+
+	err = mthca_alloc_wqe_buf(dev, pd, qp);
+	return err;
+}
+
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp)
+{
+	int err;
+
+	switch (type) {
+	case IB_QPT_RC: qp->transport = RC; break;
+	case IB_QPT_UC: qp->transport = UC; break;
+	case IB_QPT_UD: qp->transport = UD; break;
+	default: return -EINVAL;
+	}		
+
+	qp->qpn = mthca_alloc(&dev->qp_table.alloc);
+	if (qp->qpn == -1)
+		return -ENOMEM;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy, qp);
+	if (err) {
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+		return err;
+	}
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_set(&dev->qp_table.qp,
+			qp->qpn & (dev->limits.num_qps - 1), qp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	return 0;
+}
+
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp)
+{
+	int err = 0;
+	u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1;
+
+	sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE;
+	sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size,
+					     &sqp->header_dma, GFP_KERNEL);
+	if (!sqp->header_buf)
+		return -ENOMEM;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	if (mthca_array_get(&dev->qp_table.qp, mqpn))
+		err = -EBUSY;
+	else
+		mthca_array_set(&dev->qp_table.qp, mqpn, sqp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	if (err)
+		goto err_out;
+
+	sqp->port = port;
+	sqp->qp.qpn       = mqpn;
+	sqp->qp.transport = MLX;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy,
+				    &sqp->qp);
+	if (err)
+		goto err_out_free;
+
+	atomic_inc(&pd->sqp_count);
+
+	return 0;
+
+ err_out_free:
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp, mqpn);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+ err_out:
+	dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size,
+			  sqp->header_buf, sqp->header_dma);
+
+	return err;
+}
+
+void mthca_free_qp(struct mthca_dev *dev,
+		   struct mthca_qp *qp)
+{
+	u8 status;
+	int size;
+	int i;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp,
+			  qp->qpn & (dev->limits.num_qps - 1));
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	atomic_dec(&qp->refcount);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
+
+	if (qp->state != IB_QPS_RESET)
+		mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status);
+
+	mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn);
+	if (qp->ibqp.send_cq != qp->ibqp.recv_cq)
+		mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn);
+
+	mthca_free_mr(dev, &qp->mr);
+
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else {
+		for (i = 0; i < size / PAGE_SIZE; ++i) {
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    qp->queue.page_list[i].buf,
+					    pci_unmap_addr(&qp->queue.page_list[i],
+							   mapping));
+		}
+	}
+
+	kfree(qp->wrid);
+
+	if (is_sqp(dev, qp)) {
+		atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count));
+		dma_free_coherent(&dev->pdev->dev,
+				  to_msqp(qp)->header_buf_size,
+				  to_msqp(qp)->header_buf,
+				  to_msqp(qp)->header_dma);
+	}
+	else
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+}
+
+/* Create UD header for an MLX send and build a data segment for it */
+static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp,
+			    int ind, struct ib_send_wr *wr,
+			    struct mthca_mlx_seg *mlx,
+			    struct mthca_data_seg *data)
+{
+	int header_size;
+	int err;
+
+	ib_ud_header_init(256, /* assume a MAD */
+			  sqp->ud_header.grh_present,
+			  &sqp->ud_header);
+
+	err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header);
+	if (err)
+		return err;
+	mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1);
+	mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) |
+				  (sqp->ud_header.lrh.destination_lid == 0xffff ?
+				   MTHCA_MLX_SLR : 0) |
+				  (sqp->ud_header.lrh.service_level << 8));
+	mlx->rlid = sqp->ud_header.lrh.destination_lid;
+	mlx->vcrc = 0;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY;
+		sqp->ud_header.immediate_present = 0;
+		break;
+	case IB_WR_SEND_WITH_IMM:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+		sqp->ud_header.immediate_present = 1;
+		sqp->ud_header.immediate_data = wr->imm_data;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	sqp->ud_header.lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
+	if (sqp->ud_header.lrh.destination_lid == 0xffff)
+		sqp->ud_header.lrh.source_lid = 0xffff;
+	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+	if (!sqp->qp.ibqp.qp_num)
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   sqp->pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	else
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   wr->wr.ud.pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	cpu_to_be16s(&sqp->ud_header.bth.pkey);
+	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
+	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
+	sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
+					       sqp->qkey : wr->wr.ud.remote_qkey);
+	sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
+
+	header_size = ib_ud_header_pack(&sqp->ud_header,
+					sqp->header_buf +
+					ind * MTHCA_UD_HEADER_SIZE);
+
+	data->byte_count = cpu_to_be32(header_size);
+	data->lkey       = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey);
+	data->addr       = cpu_to_be64(sqp->header_dma +
+				       ind * MTHCA_UD_HEADER_SIZE);
+
+	return 0;
+}
+
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	void *wqe;
+	void *prev_wqe;
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	u32 f0 = 0;
+	int ind;
+	u8 op0 = 0;
+
+	static const u8 opcode[] = {
+		[IB_WR_SEND]                 = MTHCA_OPCODE_SEND,
+		[IB_WR_SEND_WITH_IMM]        = MTHCA_OPCODE_SEND_IMM,
+		[IB_WR_RDMA_WRITE]           = MTHCA_OPCODE_RDMA_WRITE,
+		[IB_WR_RDMA_WRITE_WITH_IMM]  = MTHCA_OPCODE_RDMA_WRITE_IMM,
+		[IB_WR_RDMA_READ]            = MTHCA_OPCODE_RDMA_READ,
+		[IB_WR_ATOMIC_CMP_AND_SWP]   = MTHCA_OPCODE_ATOMIC_CS,
+		[IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA,
+	};
+
+	spin_lock_irqsave(&qp->lock, flags);
+
+	/* XXX check that state is OK to post send */
+
+	ind = qp->sq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->sq.cur + nreq >= qp->sq.max) {
+			mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n",
+				  qp->sq.cur, qp->sq.max, nreq);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_send_wqe(qp, ind);
+		prev_wqe = qp->sq.last;
+		qp->sq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 0;
+		((struct mthca_next_seg *) wqe)->flags =
+			((wr->send_flags & IB_SEND_SIGNALED) ?
+			 cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) |
+			((wr->send_flags & IB_SEND_SOLICITED) ?
+			 cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0)   |
+			cpu_to_be32(1);
+		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
+		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+			((struct mthca_next_seg *) wqe)->flags = wr->imm_data;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (qp->transport == UD) {
+			((struct mthca_ud_seg *) wqe)->lkey =
+				cpu_to_be32(to_mah(wr->wr.ud.ah)->key);
+			((struct mthca_ud_seg *) wqe)->av_addr =
+				cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma);
+			((struct mthca_ud_seg *) wqe)->dqpn =
+				cpu_to_be32(wr->wr.ud.remote_qpn);
+			((struct mthca_ud_seg *) wqe)->qkey =
+				cpu_to_be32(wr->wr.ud.remote_qkey);
+
+			wqe += sizeof (struct mthca_ud_seg);
+			size += sizeof (struct mthca_ud_seg) / 16;
+		} else if (qp->transport == MLX) {
+			err = build_mlx_header(dev, to_msqp(qp), ind, wr,
+					       wqe - sizeof (struct mthca_next_seg),
+					       wqe);
+			if (err) {
+				*bad_wr = wr;
+				goto out;
+			}
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		if (wr->num_sge > qp->sq.max_gs) {
+			mthca_err(dev, "too many gathers\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		/* Add one more inline data segment for ICRC */
+		if (qp->transport == MLX) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32((1 << 31) | 4);
+			((u32 *) wqe)[1] = 0;
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind + qp->rq.max] = wr->wr_id;
+
+		if (wr->opcode >= ARRAY_SIZE(opcode)) {
+			mthca_err(dev, "opcode invalid\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32(((ind << qp->sq.wqe_shift) +
+					     qp->send_wqe_offset) |
+					    opcode[wr->opcode]);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size);
+		}
+
+		if (!size0) {
+			size0 = size;
+			op0   = opcode[wr->opcode];
+		}
+
+		++ind;
+		if (unlikely(ind >= qp->sq.max))
+			ind -= qp->sq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) +
+					   qp->send_wqe_offset) | f0 | op0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_SEND_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->sq.cur += nreq;
+	qp->sq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	int ind;
+	void *wqe;
+	void *prev_wqe;
+
+	spin_lock_irqsave(&qp->lock, flags);
+	
+	/* XXX check that state is OK to post receive */
+
+	ind = qp->rq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->rq.cur + nreq >= qp->rq.max) {
+			mthca_err(dev, "RQ %06x full\n", qp->qpn);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_recv_wqe(qp, ind);
+		prev_wqe = qp->rq.last;
+		qp->rq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 
+			cpu_to_be32(MTHCA_NEXT_DBD);
+		((struct mthca_next_seg *) wqe)->flags =
+			(wr->recv_flags & IB_RECV_SIGNALED) ?
+			cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (wr->num_sge > qp->rq.max_gs) {
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind] = wr->wr_id;
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32(MTHCA_NEXT_DBD | size);
+		}
+
+		if (!size0)
+			size0 = size;
+
+		++ind;
+		if (unlikely(ind >= qp->rq.max))
+			ind -= qp->rq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->rq.cur += nreq;
+	qp->rq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe)
+{
+	struct mthca_next_seg *next;
+
+	if (is_send)
+		next = get_send_wqe(qp, index);
+	else
+		next = get_recv_wqe(qp, index);
+
+	*dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD));
+	if (next->ee_nds & cpu_to_be32(0x3f))
+		*new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) |
+			(next->ee_nds & cpu_to_be32(0x3f));
+	else
+		*new_wqe = 0;
+
+	return 0;
+}
+
+int __devinit mthca_init_qp_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	int i;
+
+	spin_lock_init(&dev->qp_table.lock);
+
+	/*
+	 * We reserve 2 extra QPs per port for the special QPs.  The
+	 * special QP for port 1 has to be even, so round up.
+	 */
+	dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL;
+	err = mthca_alloc_init(&dev->qp_table.alloc,
+			       dev->limits.num_qps,
+			       (1 << 24) - 1,
+			       dev->qp_table.sqp_start +
+			       MTHCA_MAX_PORTS * 2);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->qp_table.qp,
+			       dev->limits.num_qps);
+	if (err) {
+		mthca_alloc_cleanup(&dev->qp_table.alloc);
+		return err;
+	}
+
+	for (i = 0; i < 2; ++i) {
+		err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI,
+					    dev->qp_table.sqp_start + i * 2,
+					    &status);
+		if (err)
+			goto err_out;
+		if (status) {
+			mthca_warn(dev, "CONF_SPECIAL_QP returned "
+				   "status %02x, aborting.\n",
+				   status);
+			err = -EINVAL;
+			goto err_out;
+		}
+	}
+	return 0;
+
+ err_out:
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps);
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev)
+{
+	int i;
+	u8 status;
+
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c	2004-11-19 08:36:03.007056746 -0800
@@ -0,0 +1,228 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+int mthca_reset(struct mthca_dev *mdev)
+{
+	int i;
+	int err = 0;
+	u32 *hca_header    = NULL;
+	u32 *bridge_header = NULL;
+	struct pci_dev *bridge = NULL;
+
+#define MTHCA_RESET_OFFSET 0xf0010
+#define MTHCA_RESET_VALUE  cpu_to_be32(1)
+
+	/*
+	 * Reset the chip.  This is somewhat ugly because we have to
+	 * save off the PCI header before reset and then restore it
+	 * after the chip reboots.  We skip config space offsets 22
+	 * and 23 since those have a special meaning.
+	 *
+	 * To make matters worse, for Tavor (PCI-X HCA) we have to
+	 * find the associated bridge device and save off its PCI
+	 * header as well.
+	 */
+
+	if (mdev->hca_type == TAVOR) {
+		/* Look for the bridge -- its device ID will be 2 more
+		   than HCA's device ID. */
+		while ((bridge = pci_get_device(mdev->pdev->vendor,
+						mdev->pdev->device + 2,
+						bridge)) != NULL) {
+			if (bridge->hdr_type    == PCI_HEADER_TYPE_BRIDGE &&
+			    bridge->subordinate == mdev->pdev->bus) {
+				mthca_dbg(mdev, "Found bridge: %s (%s)\n",
+					  pci_pretty_name(bridge), pci_name(bridge));
+				break;
+			}
+		}
+
+		if (!bridge) {
+			/*
+			 * Didn't find a bridge for a Tavor device --
+			 * assume we're in no-bridge mode and hope for
+			 * the best.
+			 */
+			mthca_warn(mdev, "No bridge found for %s (%s)\n",
+				  pci_pretty_name(mdev->pdev), pci_name(mdev->pdev));
+		}
+			
+	}
+
+	/* For Arbel do we need to save off the full 4K PCI Express header?? */
+	hca_header = kmalloc(256, GFP_KERNEL);
+	if (!hca_header) {
+		err = -ENOMEM;
+		mthca_err(mdev, "Couldn't allocate memory to save HCA "
+			  "PCI header, aborting.\n");
+		goto out;
+	}
+
+	for (i = 0; i < 64; ++i) {
+		if (i == 22 || i == 23)
+			continue;
+		if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't save HCA "
+				  "PCI header, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (bridge) {
+		bridge_header = kmalloc(256, GFP_KERNEL);
+		if (!bridge_header) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't allocate memory to save HCA "
+				  "bridge PCI header, aborting.\n");
+			goto out;
+		}
+
+		for (i = 0; i < 64; ++i) {
+			if (i == 22 || i == 23)
+				continue;
+			if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't save HCA bridge "
+					  "PCI header, aborting.\n");
+				goto out;
+			}
+		}
+	}
+
+	/* actually hit reset */
+	{
+		void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) +
+					      MTHCA_RESET_OFFSET, 4);
+
+		if (!reset) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't map HCA reset register, "
+				  "aborting.\n");
+			goto out;
+		}
+
+		writel(MTHCA_RESET_VALUE, reset);
+		iounmap(reset);
+	}
+
+	/* Docs say to wait one second before accessing device */
+	msleep(1000);
+
+	/* Now wait for PCI device to start responding again */
+	{
+		u32 v;
+		int c = 0;
+
+		for (c = 0; c < 100; ++c) {
+			if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't access HCA after reset, "
+					  "aborting.\n");
+				goto out;
+			}
+
+			if (v != 0xffffffff)
+				goto good;
+
+			msleep(100);
+		}
+
+		err = -ENODEV;
+		mthca_err(mdev, "PCI device did not come back after reset, "
+			  "aborting.\n");
+		goto out;
+	}
+
+good:
+	/* Now restore the PCI headers */
+	if (bridge) {
+		/*
+		 * Bridge control register is at 0x3e, so we'll
+		 * naturally restore it last in this loop.
+		 */
+		for (i = 0; i < 16; ++i) {
+			if (i * 4 == PCI_COMMAND)
+				continue;
+
+			if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't restore HCA bridge reg %x, "
+					  "aborting.\n", i);
+				goto out;
+			}
+		}
+
+		if (pci_write_config_dword(bridge, PCI_COMMAND,
+					   bridge_header[PCI_COMMAND / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, "
+				  "aborting.\n");
+			goto out;
+		}
+	}
+
+	for (i = 0; i < 16; ++i) {
+		if (i * 4 == PCI_COMMAND)
+			continue;
+
+		if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA reg %x, "
+				  "aborting.\n", i);
+			goto out;
+		}
+	}
+
+	if (pci_write_config_dword(mdev->pdev, PCI_COMMAND,
+				   hca_header[PCI_COMMAND / 4])) {
+		err = -ENODEV;
+		mthca_err(mdev, "Couldn't restore HCA COMMAND, "
+			  "aborting.\n");
+		goto out;
+	}
+
+out:
+	if (bridge)
+		pci_dev_put(bridge);
+	kfree(bridge_header);
+	kfree(hca_header);
+
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Fri Nov 19 08:48:11 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:11 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][4/12] Add
	InfiniBand SA (Subnet Administration) query support
Message-ID: <20041119 848.bGZXOMXI6bjJEWQr@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/c1ca6cd2/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v2][4/12] Add InfiniBand SA (Subnet Administration) query	support
Date: Fri, 19 Nov 2004 08:48:11 -0800
Size: 32678
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/c1ca6cd2/attachment.mht>

From roland at topspin.com  Fri Nov 19 08:48:17 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:17 -0800
Subject: [openib-general] [PATCH][RFC/v2][6/12] IPoIB IPv4 multicast
In-Reply-To: <20041119 848.kWwVxIYmeAt15lmS@topspin.com>
Message-ID: <20041119 848.hDNvGYK1INkrbzum@topspin.com>

Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB
hardware addresses.  Also add <linux/if_infiniband.h> so INFINIBAND_ALEN
has a home.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/include/linux/if_infiniband.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/include/linux/if_infiniband.h	2004-11-19 08:36:05.004762348 -0800
@@ -0,0 +1,29 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#ifndef _LINUX_IF_INFINIBAND_H
+#define _LINUX_IF_INFINIBAND_H
+
+#define INFINIBAND_ALEN		20	/* Octets in IPoIB HW addr	*/
+
+#endif /* _LINUX_IF_INFINIBAND_H */
Index: linux-bk/include/net/ip.h
===================================================================
--- linux-bk.orig/include/net/ip.h	2004-11-19 08:34:13.893136297 -0800
+++ linux-bk/include/net/ip.h	2004-11-19 08:36:05.005762200 -0800
@@ -229,6 +229,39 @@
 	buf[3]=addr&0x7F;
 }
 
+/*
+ *	Map a multicast IP onto multicast MAC for type IP-over-InfiniBand.
+ *	Leave P_Key as 0 to be filled in by driver.
+ */
+
+static inline void ip_ib_mc_map(u32 addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	addr    = ntohl(addr);
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x40;		/* IPv4 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	buf[10] = 0;
+	buf[11] = 0;
+	buf[12] = 0;
+	buf[13] = 0;
+	buf[14] = 0;
+	buf[15] = 0;
+	buf[19] = addr & 0xff;
+	addr  >>= 8;
+	buf[18] = addr & 0xff;
+	addr  >>= 8;
+	buf[17] = addr & 0xff;
+	addr  >>= 8;
+	buf[16] = addr & 0x0f;
+}
+
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 #include <linux/ipv6.h>
 #endif
Index: linux-bk/net/ipv4/arp.c
===================================================================
--- linux-bk.orig/net/ipv4/arp.c	2004-11-19 08:34:34.281131877 -0800
+++ linux-bk/net/ipv4/arp.c	2004-11-19 08:36:05.005762200 -0800
@@ -213,6 +213,9 @@
 	case ARPHRD_IEEE802_TR:
 		ip_tr_mc_map(addr, haddr);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ip_ib_mc_map(addr, haddr);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(haddr, dev->broadcast, dev->addr_len);


From roland at topspin.com  Fri Nov 19 08:48:23 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:23 -0800
Subject: [openib-general] [PATCH][RFC/v2][7/12] IPoIB IPv6 support
In-Reply-To: <20041119 848.hDNvGYK1INkrbzum@topspin.com>
Message-ID: <20041119 848.XUxwEgSdAfHPje3T@topspin.com>

Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB
hardware addresses, and add support for autoconfiguration for devices
with type ARPHRD_INFINIBAND.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Nitin Hande <Nitin.Hande at Sun.Com>
Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/include/net/if_inet6.h
===================================================================
--- linux-bk.orig/include/net/if_inet6.h	2004-11-19 08:34:41.311095920 -0800
+++ linux-bk/include/net/if_inet6.h	2004-11-19 08:36:05.345712102 -0800
@@ -266,5 +266,20 @@
 {
 	buf[0] = 0x00;
 }
+
+static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x60;		/* IPv6 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	memcpy(buf + 10, addr->s6_addr + 6, 10);
+}
 #endif
 #endif
Index: linux-bk/net/ipv6/addrconf.c
===================================================================
--- linux-bk.orig/net/ipv6/addrconf.c	2004-11-19 08:34:39.555354652 -0800
+++ linux-bk/net/ipv6/addrconf.c	2004-11-19 08:36:05.347711808 -0800
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/if_arp.h>
 #include <linux/if_arcnet.h>
+#include <linux/if_infiniband.h>
 #include <linux/route.h>
 #include <linux/inetdevice.h>
 #include <linux/init.h>
@@ -1098,6 +1099,12 @@
 		memset(eui, 0, 7);
 		eui[7] = *(u8*)dev->dev_addr;
 		return 0;
+	case ARPHRD_INFINIBAND:
+		if (dev->addr_len != INFINIBAND_ALEN)
+			return -1;
+		memcpy(eui, dev->dev_addr + 12, 8);
+		eui[0] |= 2;
+		return 0;
 	}
 	return -1;
 }
@@ -1797,6 +1804,7 @@
 	if ((dev->type != ARPHRD_ETHER) && 
 	    (dev->type != ARPHRD_FDDI) &&
 	    (dev->type != ARPHRD_IEEE802_TR) &&
+	    (dev->type != ARPHRD_INFINIBAND) &&
 	    (dev->type != ARPHRD_ARCNET)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;
Index: linux-bk/net/ipv6/ndisc.c
===================================================================
--- linux-bk.orig/net/ipv6/ndisc.c	2004-11-19 08:34:04.597506114 -0800
+++ linux-bk/net/ipv6/ndisc.c	2004-11-19 08:36:05.348711660 -0800
@@ -260,6 +260,9 @@
 	case ARPHRD_ARCNET:
 		ipv6_arcnet_mc_map(addr, buf);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ipv6_ib_mc_map(addr, buf);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(buf, dev->broadcast, dev->addr_len);


From roland at topspin.com  Fri Nov 19 08:48:28 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:28 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][8/12] Add IPoIB
	(IP-over-InfiniBand) driver
Message-ID: <20041119 848.bjjQhFQkoeJ2U43n@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/fd556282/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v2][8/12] Add IPoIB (IP-over-InfiniBand) driver
Date: Fri, 19 Nov 2004 08:48:28 -0800
Size: 101515
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/fd556282/attachment.mht>

From roland at topspin.com  Fri Nov 19 08:48:35 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:35 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][9/12] Add
	InfiniBand userspace MAD support
Message-ID: <20041119 848.SV7ZJDa1e8EOaqlB@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/86e09327/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v2][9/12] Add InfiniBand userspace MAD support
Date: Fri, 19 Nov 2004 08:48:35 -0800
Size: 23312
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041119/86e09327/attachment.mht>

From roland at topspin.com  Fri Nov 19 08:48:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:40 -0800
Subject: [openib-general] [PATCH][RFC/v2][10/12] Document InfiniBand ioctl
	use
In-Reply-To: <20041119 848.SV7ZJDa1e8EOaqlB@topspin.com>
Message-ID: <20041119 848.HEJ0RHrfzfVVBRVp@topspin.com>

Add the 0x1b ioctl magic number used by ib_umad module to
Documentation/ioctl-number.txt.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/Documentation/ioctl-number.txt
===================================================================
--- linux-bk.orig/Documentation/ioctl-number.txt	2004-11-19 08:34:40.240253723 -0800
+++ linux-bk/Documentation/ioctl-number.txt	2004-11-19 08:36:07.257430376 -0800
@@ -72,6 +72,7 @@
 0x09	all	linux/md.h
 0x12	all	linux/fs.h
 		linux/blkpg.h
+0x1b	all	InfiniBand Subsystem	<http://www.openib.org/>
 0x20	all	drivers/cdrom/cm206.h
 0x22	all	scsi/sg.h
 '#'	00-3F	IEEE 1394 Subsystem	Block for the entire subsystem


From roland at topspin.com  Fri Nov 19 08:48:45 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:45 -0800
Subject: [openib-general] [PATCH][RFC/v2][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041119 848.HEJ0RHrfzfVVBRVp@topspin.com>
Message-ID: <20041119 848.wcw3ffhLsggHhXp4@topspin.com>

Add files to Documentation/infiniband that describe the tree under
/sys/class/infiniband, the IPoIB driver and the userspace MAD access driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/Documentation/infiniband/ipoib.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/ipoib.txt	2004-11-19 08:36:07.579382931 -0800
@@ -0,0 +1,55 @@
+IP OVER INFINIBAND
+
+  The ib_ipoib driver is an implementation of the IP over InfiniBand
+  protocol as specified by the latest Internet-Drafts issued by the
+  IETF ipoib working group.  It is a "native" implementation in the
+  sense of setting the interface type to ARPHRD_INFINIBAND and the
+  hardware address length to 20 (earlier proprietary implementations
+  masqueraded to the kernel as ethernet interfaces).
+
+Partitions and P_Keys
+
+  When the IPoIB driver is loaded, it creates one interface for each
+  port using the P_Key at index 0.  To create an interface with a
+  different P_Key, write the desired P_Key into the main interface's
+  /sys/class/net/<intf name>/create_child file.  For example:
+
+    echo 0x8001 > /sys/class/net/ib0/create_child
+
+  This will create an interface named ib0.8001 with P_Key 0x8001.  To
+  remove a subinterface, use the "delete_child" file:
+
+    echo 0x8001 > /sys/class/net/ib0/delete_child
+
+  The P_Key for any interface is given by the "pkey" file, and the
+  main interface for a subinterface is in "parent."
+
+Debugging Information
+
+  By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
+  to 'y', tracing messages are compiled into the driver.  They are
+  turned on by setting the module parameters debug_level and
+  mcast_debug_level to 1.  These parameters can be controlled at
+  runtime through files in /sys/module/ib_ipoib/.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs"
+  virtual filesystem.  By mounting this filesystem, for example with
+
+    mkdir -p /ipoib_debugfs
+    mount -t ipoib_debugfs none /ipoib_debufs
+
+  it is possible to get statistics about multicast groups from the
+  files /ipoib_debugfs/ib0_mcg and so on.
+
+  The performance impact of this option is negligible, so it
+  is safe to enable this option with debug_level set to 0 for normal
+  operation.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output
+  in the data path when debug_level is set to 2.  However, even with
+  the output disabled, this option will affect performance.
+
+References
+
+  IETF IP over InfiniBand (ipoib) Working Group
+    http://ietf.org/html.charters/ipoib-charter.html
Index: linux-bk/Documentation/infiniband/sysfs.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/sysfs.txt	2004-11-19 08:36:07.775354051 -0800
@@ -0,0 +1,63 @@
+SYSFS FILES
+
+  For each InfiniBand device, the InfiniBand drivers create the
+  following files under /sys/class/infiniband/<device name>:
+
+    node_guid      - Node GUID
+    sys_image_guid - System image GUID
+
+  In addition, there is a "ports" subdirectory, with one subdirectory
+  for each port.  For example, if mthca0 is a 2-port HCA, there will
+  be two directories:
+
+    /sys/class/infiniband/mthca0/ports/1
+    /sys/class/infiniband/mthca0/ports/2
+
+  (A switch will only have a single "0" subdirectory for switch port
+  0; no subdirectory is created for normal switch ports)
+
+  In each port subdirectory, the following files are created:
+
+    cap_mask       - Port capability mask
+    lid            - Port LID
+    lid_mask_count - Port LID mask count
+    sm_lid         - Subnet manager LID for port's subnet
+    sm_sl          - Subnet manager SL for port's subnet
+    state          - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
+
+  There is also a "counters" subdirectory, with files
+
+    VL15_dropped
+    excessive_buffer_overrun_errors
+    link_downed
+    link_error_recovery
+    local_link_integrity_errors
+    port_rcv_constraint_errors
+    port_rcv_data
+    port_rcv_errors
+    port_rcv_packets
+    port_rcv_remote_physical_errors
+    port_rcv_switch_relay_errors
+    port_xmit_constraint_errors
+    port_xmit_data
+    port_xmit_discards
+    port_xmit_packets
+    symbol_error
+
+  Each of these files contains the corresponding value from the port's
+  Performance Management PortCounters attribute, as described in
+  section 16.1.3.5 of the InfiniBand Architecture Specification.
+
+  The "pkeys" and "gids" subdirectories contain one file for each
+  entry in the port's P_Key or GID table respectively.  For example,
+  ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
+  table.
+
+MTHCA
+
+  The Mellanox HCA driver also creates the files:
+
+    hw_rev   - Hardware revision number
+    fw_ver   - Firmware version
+    hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
+               or "MT25208"
Index: linux-bk/Documentation/infiniband/user_mad.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/user_mad.txt	2004-11-19 08:36:07.822347125 -0800
@@ -0,0 +1,77 @@
+USERSPACE MAD ACCESS
+
+Device files
+
+  Each port of each InfiniBand device has a "umad" device attached.
+  For example, a two-port HCA will have two devices, while a switch
+  will have one device (for switch port 0).
+
+Creating MAD agents
+
+  A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+  and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+  descriptor for the appropriate device file.  If the registration
+  request succeeds, a 32-bit id will be returned in the structure.
+  For example:
+
+	struct ib_user_mad_reg_req req = { /* ... */ };
+	ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+        if (!ret)
+		my_agent = req.id;
+	else
+		perror("agent register");
+
+  Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+  ioctl.  Also, all agents registered through a file descriptor will
+  be unregistered when the descriptor is closed.
+
+Receiving MADs
+
+  MADs are received using read().  The buffer passed to read() must be
+  large enough to hold at least one struct ib_user_mad.  For example:
+
+	struct ib_user_mad mad;
+	ret = read(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("read");
+
+  In addition to the actual MAD contents, the other struct ib_user_mad
+  fields will be filled in with information on the received MAD.  For
+  example, the remote LID will be in mad.lid.
+
+  If a send times out, a receive will be generated with mad.status set
+  to ETIMEDOUT.  Otherwise when a MAD has been successfully received,
+  mad.status will be 0.
+
+  poll()/select() may be used to wait until a MAD can be read.
+
+Sending MADs
+
+  MADs are sent using write().  The agent ID for sending should be
+  filled into the id field of the MAD, the destination LID should be
+  filled into the lid field, and so on.  For example:
+
+	struct ib_user_mad mad;
+
+	/* fill in mad.data */
+
+	mad.id  = my_agent;	/* req.id from agent registration */
+	mad.lid = my_dest;	/* in network byte order... */
+	/* etc. */
+
+	ret = write(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("write");
+
+/dev files
+
+  To create the appropriate character device files automatically with
+  udev, a rule like
+
+    KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad"
+
+  can be used.  This will create a device node named
+
+    /dev/infiniband/mthca0/ports/1/mad
+
+  for port 1 of device mthca0, and so on.


From roland at topspin.com  Fri Nov 19 08:48:51 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 08:48:51 -0800
Subject: [openib-general] [PATCH][RFC/v2][12/12] InfiniBand MAINTAINERS entry
In-Reply-To: <20041119 848.wcw3ffhLsggHhXp4@topspin.com>
Message-ID: <20041119 848.ZGXMNSS7d0T6XA9U@topspin.com>

Add OpenIB maintainers information to MAINTAINERS.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/MAINTAINERS
===================================================================
--- linux-bk.orig/MAINTAINERS	2004-11-19 08:34:04.771480477 -0800
+++ linux-bk/MAINTAINERS	2004-11-19 08:36:08.142299974 -0800
@@ -1075,6 +1075,17 @@
 L:	linux-fbdev-devel at lists.sourceforge.net
 S:	Maintained
 
+INFINIBAND SUBSYSTEM
+P:	Roland Dreier
+M:	roland at topspin.com
+P:	Sean Hefty
+M:	mshefty at ichips.intel.com
+P:	Hal Rosenstock
+M:	halr at voltaire.com
+L:	openib-general at openib.org
+W:	http://www.openib.org/
+S:	Supported
+
 INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS
 P:	Vojtech Pavlik
 M:	vojtech at suse.cz


From roland at topspin.com  Fri Nov 19 09:04:36 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 09:04:36 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][1/12] Add core
	InfiniBand support
In-Reply-To: <20041119 847.0UsrM0D745D1EXvV@topspin.com> (Roland Dreier's
	message of "Fri, 19 Nov 2004 08:47:52 -0800")
References: <20041119 847.0UsrM0D745D1EXvV@topspin.com>
Message-ID: <52hdnlhisb.fsf@topspin.com>

This being flagged as spam seems to be a bug in my patch-sending
script -- I'm generating an invalid message ID:

 1.8 INVALID_MSGID          Message-Id is not valid, according to RFC 2822

I'll fix this up before Monday.

 - R.


From mshefty at ichips.intel.com  Fri Nov 19 09:08:43 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 19 Nov 2004 09:08:43 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <524qjliz0u.fsf@topspin.com>
References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com>
Message-ID: <419E289B.90408@ichips.intel.com>

Roland Dreier wrote:
> In fact I'm not sure that having some many MAD workqueue threads isn't
> overkill that wastes resources, especially on machines with a lot of
> CPUs.

I tried to keep the MAD layer from knowing about completion threads to 
make it easier to change it later.  I think once we get to some CM 
performance testing, we can try adjusting the threading model to gives 
us the best performance and scalability: one per port, one per CPU 
shared across all ports, one per system, dynamically allocated threads, etc.

Right now, I'm not sure what sort of performance hit we'll see by having 
additional idle threads when MAD traffic is low.

- Sean


From roland at topspin.com  Fri Nov 19 09:20:24 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 09:20:24 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <419E289B.90408@ichips.intel.com> (Sean Hefty's message of
	"Fri, 19 Nov 2004 09:08:43 -0800")
References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com>
	<419E289B.90408@ichips.intel.com>
Message-ID: <52d5y9hi1z.fsf@topspin.com>

    Sean> I tried to keep the MAD layer from knowing about completion
    Sean> threads to make it easier to change it later.  I think once
    Sean> we get to some CM performance testing, we can try adjusting
    Sean> the threading model to gives us the best performance and
    Sean> scalability: one per port, one per CPU shared across all
    Sean> ports, one per system, dynamically allocated threads, etc.

I think the CM ends up needing its own set of workqueues so that it
can queue MAD processing along with "time wait" events etc.  Also we
don't want the CM to block general MAD processing while it waits for
things like QP modify.

    Sean> Right now, I'm not sure what sort of performance hit we'll
    Sean> see by having additional idle threads when MAD traffic is
    Sean> low.

idle threads have pretty minimal impact beyond the memory they use.
However on say a 512 CPU box with 6 HCAs, we would create 6000+ kernel
threads, which seems pretty excessive.

 - R.


From mshefty at ichips.intel.com  Fri Nov 19 09:36:27 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 19 Nov 2004 09:36:27 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <52d5y9hi1z.fsf@topspin.com>
References: <1100878801.19061.5.camel@hpc-1>
	<524qjliz0u.fsf@topspin.com>	<419E289B.90408@ichips.intel.com>
	<52d5y9hi1z.fsf@topspin.com>
Message-ID: <419E2F1B.7050804@ichips.intel.com>

Roland Dreier wrote:
> I think the CM ends up needing its own set of workqueues so that it
> can queue MAD processing along with "time wait" events etc.  Also we
> don't want the CM to block general MAD processing while it waits for
> things like QP modify.

I thought about this approach, but wasn't sure about taking a context 
switch.  I guess with QP redirection, this wouldn't be an issue though.

> idle threads have pretty minimal impact beyond the memory they use.
> However on say a 512 CPU box with 6 HCAs, we would create 6000+ kernel
> threads, which seems pretty excessive.

Wouldn't it still just be one per port, or 12 total?


From halr at voltaire.com  Fri Nov 19 09:46:38 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 19 Nov 2004 12:46:38 -0500
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <419E2F1B.7050804@ichips.intel.com>
References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com>
	<419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com>
	<419E2F1B.7050804@ichips.intel.com>
Message-ID: <1100886398.3002.0.camel@hpc-1>

On Fri, 2004-11-19 at 12:36, Sean Hefty wrote:

> > idle threads have pretty minimal impact beyond the memory they use.
> > However on say a 512 CPU box with 6 HCAs, we would create 6000+ kernel
> > threads, which seems pretty excessive.
> 
> Wouldn't it still just be one per port, or 12 total?

It's currently 1/port/CPU. I think this can be changed easily.

-- Hal


From roland at topspin.com  Fri Nov 19 09:51:33 2004
From: roland at topspin.com (Roland Dreier)
Date: Fri, 19 Nov 2004 09:51:33 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <419E2F1B.7050804@ichips.intel.com> (Sean Hefty's message of
	"Fri, 19 Nov 2004 09:36:27 -0800")
References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com>
	<419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com>
	<419E2F1B.7050804@ichips.intel.com>
Message-ID: <528y8xhgm2.fsf@topspin.com>

    Sean> I thought about this approach, but wasn't sure about taking
    Sean> a context switch.  I guess with QP redirection, this
    Sean> wouldn't be an issue though.

I don't think there's a choice.  If the CM processes MADs from one queue
and time wait expirations from another, it's not possible to prevent
the MAD queue from getting arbitrarily far ahead of the time wait
queue.  This results in QPs never being reaped and eventually the
system runs out of memory.

 - R.


From mshefty at ichips.intel.com  Fri Nov 19 11:59:11 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 19 Nov 2004 11:59:11 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <528y8xhgm2.fsf@topspin.com>
References: <1100878801.19061.5.camel@hpc-1>
	<524qjliz0u.fsf@topspin.com>	<419E289B.90408@ichips.intel.com>
	<52d5y9hi1z.fsf@topspin.com>	<419E2F1B.7050804@ichips.intel.com>
	<528y8xhgm2.fsf@topspin.com>
Message-ID: <419E508F.5040603@ichips.intel.com>

Roland Dreier wrote:

>     Sean> I thought about this approach, but wasn't sure about taking
>     Sean> a context switch.  I guess with QP redirection, this
>     Sean> wouldn't be an issue though.
> 
> I don't think there's a choice.  If the CM processes MADs from one queue
> and time wait expirations from another, it's not possible to prevent
> the MAD queue from getting arbitrarily far ahead of the time wait
> queue.  This results in QPs never being reaped and eventually the
> system runs out of memory.

I'm not understanding the issue here.  If connections are being made 
faster than QPs are leaving the time wait state, then the system will 
eventually run out of resources.  But this problem seems somewhat 
separate from the threading model used to establish connections, unless 
that thread is preventing other threads from executing.

If that's the case, is it be worth considering exposing the MAD work 
queue for use by not just the MAD layer, but also specific clients, such 
as the CM and SA client code?

I would think that as long as the code is structured around using work 
queues, we should be able to adjust the number of threads per work 
queue, along with the number of work queues in order to see what 
combination works best.

- Sean


From halr at voltaire.com  Fri Nov 19 12:18:45 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Fri, 19 Nov 2004 15:18:45 -0500
Subject: [openib-general] [RFC] [PATCH] mad: Change mad thread model to be 1
 thread/port	rather than 1 thread/port/CPU
Message-ID: <1100895525.4136.11.camel@localhost.localdomain>

Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU
(Note that I have not applied this but am requesting comments).

Index: mad.c
===================================================================
--- mad.c	(revision 1269)
+++ mad.c	(working copy)
@@ -1900,7 +1900,7 @@
 		goto error7;
 
 	snprintf(name, sizeof name, "ib_mad%d", port_num);
-	port_priv->wq = create_workqueue(name);
+	port_priv->wq = create_singlethread_workqueue(name);
 	if (!port_priv->wq) {
 		ret = -ENOMEM;
 		goto error8;


From mshefty at ichips.intel.com  Fri Nov 19 13:25:27 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 19 Nov 2004 13:25:27 -0800
Subject: [openib-general] [RFC] [PATCH] mad: Change mad thread model to
	be 1 thread/port	rather than 1 thread/port/CPU
In-Reply-To: <1100895525.4136.11.camel@localhost.localdomain>
References: <1100895525.4136.11.camel@localhost.localdomain>
Message-ID: <419E64C7.9030905@ichips.intel.com>

Hal Rosenstock wrote:
> Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU
> (Note that I have not applied this but am requesting comments).
> 
> Index: mad.c
> ===================================================================
> --- mad.c	(revision 1269)
> +++ mad.c	(working copy)
> @@ -1900,7 +1900,7 @@
>  		goto error7;
>  
>  	snprintf(name, sizeof name, "ib_mad%d", port_num);
> -	port_priv->wq = create_workqueue(name);
> +	port_priv->wq = create_singlethread_workqueue(name);
>  	if (!port_priv->wq) {
>  		ret = -ENOMEM;
>  		goto error8;

My guess is that this is probably preferable to having 1/port/CPU, 
especially on larger systems.  It would depend on what the clients do 
when notified of a completion.

I guess one advantage of keeping it 1/port/CPU (for now) is that it 
would help test multi-threaded support.

- Sean


From halr at voltaire.com  Sun Nov 21 09:17:57 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sun, 21 Nov 2004 12:17:57 -0500
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <52wtwmk9q8.fsf@topspin.com>
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman>
	<52wtwmk9q8.fsf@topspin.com>
Message-ID: <1101057477.4124.7.camel@localhost.localdomain>

On Mon, 2004-11-15 at 17:50, Roland Dreier wrote:
>     Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2,
>     Tom> mthca and ipoib builtin and everything builds and boots fine
>     Tom> (at least on x86_64).
> 
> Cool, thanks for testing.  For what it's worth, it works here on i386
> as well.  (Not very convenient for development though :)

Thanks for investigating. Here's the combo which breaks the build as
follows:

  LD      drivers/infiniband/core/built-in.o
  LD      drivers/infiniband/built-in.o
ld: cannot open drivers/infiniband/ulp/built-in.o: No such file or
directory
make[2]: *** [drivers/infiniband/built-in.o] Error 1

This occurs when you configure Infiniband as built-in and mthca and
IPoIB as modular.

This sort of combo seems to work for other subsystems like I2O.

-- Hal


From mst at mellanox.co.il  Sun Nov 21 14:06:07 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 22 Nov 2004 00:06:07 +0200
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <419E508F.5040603@ichips.intel.com>
References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com>
	<419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com>
	<419E2F1B.7050804@ichips.intel.com> <528y8xhgm2.fsf@topspin.com>
	<419E508F.5040603@ichips.intel.com>
Message-ID: <20041121220607.GF11676@mellanox.co.il>

Hello!
Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] Re: OpenIB Thread Usage":
> Roland Dreier wrote:
> 
> >    Sean> I thought about this approach, but wasn't sure about taking
> >    Sean> a context switch.  I guess with QP redirection, this
> >    Sean> wouldn't be an issue though.
> >
> >I don't think there's a choice.  If the CM processes MADs from one queue
> >and time wait expirations from another, it's not possible to prevent
> >the MAD queue from getting arbitrarily far ahead of the time wait
> >queue.  This results in QPs never being reaped and eventually the
> >system runs out of memory.
> 
> I'm not understanding the issue here.  If connections are being made 
> faster than QPs are leaving the time wait state, then the system will 
> eventually run out of resources.  But this problem seems somewhat 
> separate from the threading model used to establish connections, unless 
> that thread is preventing other threads from executing.

The idea I think was that if you start dropping MADs when you can
not establish a connection, the remote side will retry.
One way to drop MADs is by blocking the MAD work thread.

MST


From roland at topspin.com  Sun Nov 21 20:41:31 2004
From: roland at topspin.com (Roland Dreier)
Date: Sun, 21 Nov 2004 20:41:31 -0800
Subject: [openib-general] Re: OpenIB BuiltIn Support ?
In-Reply-To: <1101057477.4124.7.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Sun, 21 Nov 2004 12:17:57 -0500")
References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com>
	<419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman>
	<52wtwmk9q8.fsf@topspin.com>
	<1101057477.4124.7.camel@localhost.localdomain>
Message-ID: <52d5y6fqbo.fsf@topspin.com>

    Hal> Thanks for investigating. Here's the combo which breaks the
    Hal> build as follows:

    Hal> This occurs when you configure Infiniband as built-in and
    Hal> mthca and IPoIB as modular.

Hmm, seems like the kernel build system doesn't like going into
directories and finding nothing to build.  I fixed it by getting rid
of ulp/Makefile, ulp/Kconfig, hw/Makefile and hw/Kconfig (and having
infiniband/Kconfig and infiniband/Makefile go directy to ulp/ipoib and
hw/mthca).

With these changes, I built but didn't boot a kernel with IB-core
built-in and mthca/ipoib as modules.

 - R.


From roland at topspin.com  Sun Nov 21 20:45:45 2004
From: roland at topspin.com (Roland Dreier)
Date: Sun, 21 Nov 2004 20:45:45 -0800
Subject: [openib-general] Re: OpenIB Thread Usage
In-Reply-To: <419E508F.5040603@ichips.intel.com> (Sean Hefty's message of
	"Fri, 19 Nov 2004 11:59:11 -0800")
References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com>
	<419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com>
	<419E2F1B.7050804@ichips.intel.com> <528y8xhgm2.fsf@topspin.com>
	<419E508F.5040603@ichips.intel.com>
Message-ID: <524qjifq4m.fsf@topspin.com>

    Sean> I'm not understanding the issue here.  If connections are
    Sean> being made faster than QPs are leaving the time wait state,
    Sean> then the system will eventually run out of resources.  But
    Sean> this problem seems somewhat separate from the threading
    Sean> model used to establish connections, unless that thread is
    Sean> preventing other threads from executing.

Sorry I wasn't clearer.  The problem I was trying to describe (which
incidentally has been seen with a real application) is that if MADs
like CM REQs are processed in one queue, and time wait expirations are
processed in a separate queue, then it's possible for the MAD queue +
application to starve the time wait queue.  This means a larger and
larger backlog of time wait expirations accumulates and eventually the
system runs out of resources, even though the application only keeps a
constant number of QPs in use.

    Sean> If that's the case, is it be worth considering exposing the
    Sean> MAD work queue for use by not just the MAD layer, but also
    Sean> specific clients, such as the CM and SA client code?

That would be another solution.  However it seems reasonable to let
clients tune their work processing model to their needs (and avoid
having them clog up the MAD queue).

 - R.


From roland at topspin.com  Mon Nov 22 07:13:24 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:24 -0800
Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of
	InfiniBand patches for review
Message-ID: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>

I'm very happy to be able to post an initial version of InfiniBand
patches for review.  Although this code should be far closer to kernel
coding standards than previous open source InfiniBand drivers, this
initial posting should be treated as a request for comments and not a
request for inclusion; our ultimate goal is to have these drivers
included in the mainline kernel, but we expect that fixes and
improvements will need to be made before the code is completely
acceptable.

These patches add a minimal but complete level of InfiniBand support,
including an IB midlayer, a low-level driver for Mellanox HCAs, an
IP-over-InfiniBand driver, and a mechanism for MADs (management
datagrams) to be passed to and from userspace.  This means that these
patches are all that is required for the kernel to bring up and use an
IP-over-InfiniBand link.  (The OpenSM subnet manager has not been
ported to this kernel API yet, although this work is underway.  This
means that at the moment, a kernel with these patches cannot be used
to bring up a fabric; however, the kernel side is complete)

The code has not been through extreme stress testing yet, but it has
been used successfully on i386, x86_64, ppc64, ia64 and sparc64
systems, including mixed 32/64 systems.

Feedback on both details of the code as well as the high-level
organization of the code will be very much appreciated.  For example,
the current set of patches puts include files in driver/infiniband/include;
would it be preferred to put include files in include/linux/infiniband/,
directly in include/linux, or perhaps in include/infiniband?

We would also like to explore the best avenue for having these patches
merged.  It may be desirable for the patches to spend some time in -mm
before moving into Linus's kernel; on the other hand, the patches make
only very minimal and safe changes outside of drivers/infiniband, so
it is quite reasonable to merge them directly into the mainline
kernel.  Although 2.6.10 is now closed, 2.6.11 will probably be open
by the time the review process is complete.

We look forward to the community's comments and criticisms!

Thanks,
  Roland Dreier
  OpenIB Alliance
  www.openib.org


From roland at topspin.com  Mon Nov 22 07:13:29 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:29 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
Message-ID: <20041122713.TMt4584EVSreQOO2@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/30d28719/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][1/12] Add core InfiniBand support
Date: Mon, 22 Nov 2004 07:13:29 -0800
Size: 120269
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/30d28719/attachment.mht>

From roland at topspin.com  Mon Nov 22 07:13:36 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:36 -0800
Subject: [openib-general] [PATCH][RFC/v1][2/12] Hook up drivers/infiniband
In-Reply-To: <20041122713.TMt4584EVSreQOO2@topspin.com>
Message-ID: <20041122713.yCm1WiU1XOAxLOWd@topspin.com>

Add the appropriate lines to drivers/Kconfig and drivers/Makefile so
that the kernel configuration and build systems know about drivers/infiniband.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/drivers/Kconfig
===================================================================
--- linux-bk.orig/drivers/Kconfig	2004-11-21 21:07:30.646934807 -0800
+++ linux-bk/drivers/Kconfig	2004-11-21 21:25:52.850360262 -0800
@@ -54,4 +54,6 @@
 
 source "drivers/usb/Kconfig"
 
+source "drivers/infiniband/Kconfig"
+
 endmenu
Index: linux-bk/drivers/Makefile
===================================================================
--- linux-bk.orig/drivers/Makefile	2004-11-21 21:07:54.491393897 -0800
+++ linux-bk/drivers/Makefile	2004-11-21 21:25:52.850360262 -0800
@@ -59,4 +59,5 @@
 obj-$(CONFIG_EISA)		+= eisa/
 obj-$(CONFIG_CPU_FREQ)		+= cpufreq/
 obj-$(CONFIG_MMC)		+= mmc/
+obj-$(CONFIG_INFINIBAND)	+= infiniband/
 obj-y				+= firmware/


From roland at topspin.com  Mon Nov 22 07:13:41 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:41 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][3/12] Add
	InfiniBand MAD (management datagram) support
Message-ID: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/bae44229/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][3/12] Add InfiniBand MAD (management datagram) support
Date: Mon, 22 Nov 2004 07:13:41 -0800
Size: 108306
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/bae44229/attachment.mht>

From roland at topspin.com  Mon Nov 22 07:13:48 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:48 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12] Add
	InfiniBand SA (Subnet Administration) query support
Message-ID: <20041122713.g6bh6aqdXIN4RJYR@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/847617fc/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query	support
Date: Mon, 22 Nov 2004 07:13:48 -0800
Size: 32662
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/847617fc/attachment.mht>

From roland at topspin.com  Mon Nov 22 07:13:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:54 -0800
Subject: [openib-general] [PATCH][RFC/v1][5/12] Add Mellanox HCA low-level
	driver
In-Reply-To: <20041122713.g6bh6aqdXIN4RJYR@topspin.com>
Message-ID: <20041122713.cSeT4UFKGqJDdZ8T@topspin.com>

Add a low-level driver for Mellanox MT23108 and MT25208 HCAs.  The
MT25208 is only fully supported when in MT23108 compatibility mode;
only the very beginnings of support for native MT25208 mode (required
for HCAs without local memory) is present.

(As a side note, I believe this driver would be the first in-tree
consumer of the PCI MSI/MSI-X API)

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/drivers/infiniband/Kconfig
===================================================================
--- linux-bk.orig/drivers/infiniband/Kconfig	2004-11-21 21:25:51.525556772 -0800
+++ linux-bk/drivers/infiniband/Kconfig	2004-11-21 21:25:54.389132014 -0800
@@ -8,4 +8,6 @@
 	  any protocols you wish to use as well as drivers for your
 	  InfiniBand hardware.
 
+source "drivers/infiniband/hw/mthca/Kconfig"
+
 endmenu
Index: linux-bk/drivers/infiniband/Makefile
===================================================================
--- linux-bk.orig/drivers/infiniband/Makefile	2004-11-21 21:25:51.549553213 -0800
+++ linux-bk/drivers/infiniband/Makefile	2004-11-21 21:25:54.364135721 -0800
@@ -1 +1,2 @@
 obj-$(CONFIG_INFINIBAND)		+= core/
+obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mthca/
Index: linux-bk/drivers/infiniband/hw/mthca/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Kconfig	2004-11-21 21:25:54.414128306 -0800
@@ -0,0 +1,26 @@
+config INFINIBAND_MTHCA
+	tristate "Mellanox HCA support"
+	depends on PCI && INFINIBAND
+	---help---
+	  This is a low-level driver for Mellanox InfiniHost host
+	  channel adapters (HCAs), including the MT23108 PCI-X HCA
+	  ("Tavor") and the MT25208 PCI Express HCA ("Arbel").
+
+config INFINIBAND_MTHCA_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_MTHCA
+	default n
+	---help---
+	  This option causes the mthca driver produce a bunch of debug
+	  messages.  Select this is you are developing the driver or
+	  trying to diagnose a problem.
+
+config INFINIBAND_MTHCA_SSE_DOORBELL
+	bool "SSE doorbell code"
+	depends on INFINIBAND_MTHCA && X86 && !X86_64
+	default n
+	---help---
+	  This option will have the mthca driver use SSE instructions
+	  to ring hardware doorbell registers.  This may improve
+	  performance for some workloads, but the driver will not run
+	  on processors without SSE instructions.
Index: linux-bk/drivers/infiniband/hw/mthca/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Makefile	2004-11-21 21:25:54.439124598 -0800
@@ -0,0 +1,23 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o
+
+ib_mthca-objs := \
+    mthca_main.o \
+    mthca_cmd.o  \
+    mthca_profile.o \
+    mthca_reset.o \
+    mthca_allocator.o \
+    mthca_eq.o \
+    mthca_pd.o \
+    mthca_cq.o \
+    mthca_mr.o \
+    mthca_qp.o \
+    mthca_av.o \
+    mthca_mcg.o \
+    mthca_mad.o \
+    mthca_provider.o
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c	2004-11-21 21:25:54.464120890 -0800
@@ -0,0 +1,175 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $
+ */
+
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/bitmap.h> 
+
+#include "mthca_dev.h"
+
+/* Trivial bitmap-based allocator */
+u32 mthca_alloc(struct mthca_alloc *alloc)
+{
+	u32 obj;
+
+	spin_lock(&alloc->lock);
+	obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last);
+	if (obj >= alloc->max) {
+		alloc->top = (alloc->top + alloc->max) & alloc->mask;
+		obj = find_first_zero_bit(alloc->table, alloc->max);
+	}
+
+	if (obj < alloc->max) {
+		set_bit(obj, alloc->table);
+		obj |= alloc->top;
+	} else
+		obj = -1;
+
+	spin_unlock(&alloc->lock);
+
+	return obj;
+}
+
+void mthca_free(struct mthca_alloc *alloc, u32 obj)
+{
+	obj &= alloc->max - 1;
+	spin_lock(&alloc->lock);
+	clear_bit(obj, alloc->table);
+	alloc->last = min(alloc->last, obj);
+	alloc->top = (alloc->top + alloc->max) & alloc->mask;
+	spin_unlock(&alloc->lock);
+}
+
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved)
+{
+	int i;
+
+	/* num must be a power of 2 */
+	if (num != 1 << (ffs(num) - 1))
+		return -EINVAL;
+
+	alloc->last = 0;
+	alloc->top  = 0;
+	alloc->max  = num;
+	alloc->mask = mask;
+	spin_lock_init(&alloc->lock);
+	alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long),
+			       GFP_KERNEL);
+	if (!alloc->table)
+		return -ENOMEM;
+
+	bitmap_zero(alloc->table, num);
+	for (i = 0; i < reserved; ++i)
+		set_bit(i, alloc->table);
+
+	return 0;
+}
+
+void mthca_alloc_cleanup(struct mthca_alloc *alloc)
+{
+	kfree(alloc->table);
+}
+
+/*
+ * Array of pointers with lazy allocation of leaf pages.  Callers of
+ * _get, _set and _clear methods must use a lock or otherwise
+ * serialize access to the array.
+ */
+
+void *mthca_array_get(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (array->page_list[p].page) {
+		int i = index & (PAGE_SIZE / sizeof (void *) - 1);
+		return array->page_list[p].page[i];
+	} else
+		return NULL;
+}
+
+int mthca_array_set(struct mthca_array *array, int index, void *value)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
+	if (!array->page_list[p].page)
+		array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC);
+
+	if (!array->page_list[p].page)
+		return -ENOMEM;
+
+	array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] =
+		value;
+	++array->page_list[p].used;
+
+	return 0;
+}
+
+void mthca_array_clear(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (--array->page_list[p].used == 0) {
+		free_page((unsigned long) array->page_list[p].page);
+		array->page_list[p].page = NULL;
+	}
+
+	if (array->page_list[p].used < 0)
+		pr_debug("Array %p index %d page %d with ref count %d < 0\n",
+			 array, index, p, array->page_list[p].used);
+}
+
+int mthca_array_init(struct mthca_array *array, int nent)
+{
+	int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE;
+	int i;
+
+	array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL);
+	if (!array->page_list)
+		return -ENOMEM;
+
+	for (i = 0; i < npage; ++i) {
+		array->page_list[i].page = NULL;
+		array->page_list[i].used = 0;
+	}
+
+	return 0;
+}
+
+void mthca_array_cleanup(struct mthca_array *array, int nent)
+{
+	int i;
+
+	for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i)
+		free_page((unsigned long) array->page_list[i].page);
+
+	kfree(array->page_list);
+}
+
+/*
+ * Local Variables:
+ *  c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_av.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c	2004-11-21 21:25:54.489117183 -0800
@@ -0,0 +1,212 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+
+#include "mthca_dev.h"
+
+struct mthca_av {
+	u32 port_pd;
+	u8  reserved1;
+	u8  g_slid;
+	u16 dlid;
+	u8  reserved2;
+	u8  gid_index;
+	u8  msg_sr;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u32 dgid[4];
+} __attribute__((packed));
+
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah)
+{
+	u32 index = -1;
+	struct mthca_av *av = NULL;
+
+	ah->on_hca = 0;
+
+	if (!atomic_read(&pd->sqp_count) &&
+	    !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		index = mthca_alloc(&dev->av_table.alloc);
+
+		/* fall back to allocate in host memory */
+		if (index == -1)
+			goto host_alloc;
+
+		av = kmalloc(sizeof *av, GFP_KERNEL);
+		if (!av)
+			goto host_alloc;
+			
+		ah->on_hca = 1;
+		ah->avdma  = dev->av_table.ddr_av_base +
+			index * MTHCA_AV_SIZE;
+	}
+
+ host_alloc:
+	if (!ah->on_hca) {
+		ah->av = pci_pool_alloc(dev->av_table.pool,
+					SLAB_KERNEL, &ah->avdma);
+		if (!ah->av)
+			return -ENOMEM;
+
+		av = ah->av;
+	}
+
+	ah->key = pd->ntmr.ibmr.lkey;
+
+	memset(av, 0, MTHCA_AV_SIZE);
+
+	av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24));
+	av->g_slid  = ah_attr->src_path_bits;
+	av->dlid    = cpu_to_be16(ah_attr->dlid);
+	av->msg_sr  = (3 << 4) | /* 2K message */
+		ah_attr->static_rate;
+	av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+	if (ah_attr->ah_flags & IB_AH_GRH) {
+		av->g_slid |= 0x80;
+		av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len +
+			ah_attr->grh.sgid_index;
+		av->hop_limit = ah_attr->grh.hop_limit;
+		av->sl_tclass_flowlabel |=
+			cpu_to_be32((ah_attr->grh.traffic_class << 20) |
+				    ah_attr->grh.flow_label);
+		memcpy(av->dgid, ah_attr->grh.dgid.raw, 16);
+	}
+
+	if (0) {
+		int j;
+		
+		mthca_dbg(dev, "Created UDAV at %p/%08lx:\n",
+			  av, (unsigned long) ah->avdma);
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) av)[j]));
+	}
+
+	if (ah->on_hca) {
+		memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
+			    av, MTHCA_AV_SIZE);
+		kfree(av);
+	}
+
+	return 0;
+}
+
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
+{
+	if (ah->on_hca)
+		mthca_free(&dev->av_table.alloc,
+ 			   (ah->avdma - dev->av_table.ddr_av_base) /
+			   MTHCA_AV_SIZE);
+	else
+		pci_pool_free(dev->av_table.pool, ah->av, ah->avdma);
+
+	return 0;
+}
+
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header)
+{
+	if (ah->on_hca)
+		return -EINVAL;
+
+	header->lrh.service_level   = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28;
+	header->lrh.destination_lid = ah->av->dlid;
+	header->lrh.source_lid      = ah->av->g_slid & 0x7f;
+	if (ah->av->g_slid & 0x80) {
+		header->grh_present = 1;
+		header->grh.traffic_class =
+			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
+		header->grh.flow_label    =
+			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		ib_cached_gid_get(&dev->ib_dev,
+				  be32_to_cpu(ah->av->port_pd) >> 24,
+				  ah->av->gid_index,
+				  &header->grh.source_gid);
+		memcpy(header->grh.destination_gid.raw,
+		       ah->av->dgid, 16);
+	} else {
+		header->grh_present = 0;
+	}
+
+	return 0;
+}
+
+int __devinit mthca_init_av_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->av_table.alloc,
+			       dev->av_table.num_ddr_avs,
+			       dev->av_table.num_ddr_avs - 1,
+			       0);
+	if (err)
+		return err;
+
+	dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev,
+					     MTHCA_AV_SIZE,
+					     MTHCA_AV_SIZE, 0);
+	if (!dev->av_table.pool)
+		goto out_free_alloc;
+
+	if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) +
+					       dev->av_table.ddr_av_base -
+					       dev->ddr_start,
+					       dev->av_table.num_ddr_avs *
+					       MTHCA_AV_SIZE);
+		if (!dev->av_table.av_map)
+			goto out_free_pool;
+	} else
+		dev->av_table.av_map = NULL;
+
+	return 0;
+
+ out_free_pool:
+	pci_pool_destroy(dev->av_table.pool);
+
+ out_free_alloc:
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+	return -ENOMEM;
+}
+
+void __devexit mthca_cleanup_av_table(struct mthca_dev *dev)
+{
+	if (dev->av_table.av_map)
+		iounmap(dev->av_table.av_map);
+	pci_pool_destroy(dev->av_table.pool);
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c	2004-11-21 21:25:54.517113030 -0800
@@ -0,0 +1,1522 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/sched.h>
+#include <linux/pci.h>
+#include <linux/errno.h>
+#include <asm/io.h>
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+
+#define CMD_POLL_TOKEN 0xffff
+
+enum {
+	HCR_IN_PARAM_OFFSET    = 0x00,
+	HCR_IN_MODIFIER_OFFSET = 0x08,
+	HCR_OUT_PARAM_OFFSET   = 0x0c,
+	HCR_TOKEN_OFFSET       = 0x14,
+	HCR_STATUS_OFFSET      = 0x18,
+
+	HCR_OPMOD_SHIFT        = 12,
+	HCA_E_BIT              = 22,
+	HCR_GO_BIT             = 23
+};
+
+enum {
+	/* initialization and general commands */
+	CMD_SYS_EN          = 0x1,
+	CMD_SYS_DIS         = 0x2,
+	CMD_MAP_FA          = 0xfff,
+	CMD_UNMAP_FA        = 0xffe,
+	CMD_RUN_FW          = 0xff6,
+	CMD_MOD_STAT_CFG    = 0x34,
+	CMD_QUERY_DEV_LIM   = 0x3,
+	CMD_QUERY_FW        = 0x4,
+	CMD_ENABLE_LAM      = 0xff8,
+	CMD_DISABLE_LAM     = 0xff7,
+	CMD_QUERY_DDR       = 0x5,
+	CMD_QUERY_ADAPTER   = 0x6,
+	CMD_INIT_HCA        = 0x7,
+	CMD_CLOSE_HCA       = 0x8,
+	CMD_INIT_IB         = 0x9,
+	CMD_CLOSE_IB        = 0xa,
+	CMD_QUERY_HCA       = 0xb,
+	CMD_SET_IB          = 0xc,
+	CMD_ACCESS_DDR      = 0x2e,
+	CMD_MAP_ICM         = 0xffa,
+	CMD_UNMAP_ICM       = 0xff9,
+	CMD_MAP_ICM_AUX     = 0xffc,
+	CMD_UNMAP_ICM_AUX   = 0xffb,
+	CMD_SET_ICM_SIZE    = 0xffd,
+
+	/* TPT commands */
+	CMD_SW2HW_MPT 	    = 0xd,
+	CMD_QUERY_MPT 	    = 0xe,
+	CMD_HW2SW_MPT 	    = 0xf,
+	CMD_READ_MTT        = 0x10,
+	CMD_WRITE_MTT       = 0x11,
+	CMD_SYNC_TPT        = 0x2f,
+
+	/* EQ commands */
+	CMD_MAP_EQ          = 0x12,
+	CMD_SW2HW_EQ 	    = 0x13,
+	CMD_HW2SW_EQ 	    = 0x14,
+	CMD_QUERY_EQ        = 0x15,
+
+	/* CQ commands */
+	CMD_SW2HW_CQ 	    = 0x16,
+	CMD_HW2SW_CQ 	    = 0x17,
+	CMD_QUERY_CQ 	    = 0x18,
+	CMD_RESIZE_CQ       = 0x2c,
+
+	/* SRQ commands */
+	CMD_SW2HW_SRQ 	    = 0x35,
+	CMD_HW2SW_SRQ 	    = 0x36,
+	CMD_QUERY_SRQ       = 0x37,
+
+	/* QP/EE commands */
+	CMD_RST2INIT_QPEE   = 0x19,
+	CMD_INIT2RTR_QPEE   = 0x1a,
+	CMD_RTR2RTS_QPEE    = 0x1b,
+	CMD_RTS2RTS_QPEE    = 0x1c,
+	CMD_SQERR2RTS_QPEE  = 0x1d,
+	CMD_2ERR_QPEE       = 0x1e,
+	CMD_RTS2SQD_QPEE    = 0x1f,
+	CMD_SQD2SQD_QPEE    = 0x38,
+	CMD_SQD2RTS_QPEE    = 0x20,
+	CMD_ERR2RST_QPEE    = 0x21,
+	CMD_QUERY_QPEE      = 0x22,
+	CMD_INIT2INIT_QPEE  = 0x2d,
+	CMD_SUSPEND_QPEE    = 0x32,
+	CMD_UNSUSPEND_QPEE  = 0x33,
+	/* special QPs and management commands */
+	CMD_CONF_SPECIAL_QP = 0x23,
+	CMD_MAD_IFC         = 0x24,
+
+	/* multicast commands */
+	CMD_READ_MGM        = 0x25,
+	CMD_WRITE_MGM       = 0x26,
+	CMD_MGID_HASH       = 0x27,
+
+	/* miscellaneous commands */
+	CMD_DIAG_RPRT       = 0x30,
+	CMD_NOP             = 0x31,
+
+	/* debug commands */
+	CMD_QUERY_DEBUG_MSG = 0x2a,
+	CMD_SET_DEBUG_MSG   = 0x2b,
+};
+
+/*
+ * According to Mellanox code, FW may be starved and never complete
+ * commands.  So we can't use strict timeouts described in PRM -- we
+ * just arbitrarily select 60 seconds for now.
+ */
+#if 0
+/*
+ * Round up and add 1 to make sure we get the full wait time (since we
+ * will be starting in the middle of a jiffy)
+ */
+enum {
+	CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1,
+	CMD_TIME_CLASS_B = (HZ +  99) /  100 + 1,
+	CMD_TIME_CLASS_C = (HZ +   9) /   10 + 1
+};
+#else
+enum {
+	CMD_TIME_CLASS_A = 60 * HZ,
+	CMD_TIME_CLASS_B = 60 * HZ,
+	CMD_TIME_CLASS_C = 60 * HZ
+};
+#endif
+
+enum {
+	GO_BIT_TIMEOUT = HZ * 10
+};
+
+struct mthca_cmd_context {
+	struct completion done;
+	struct timer_list timer;
+	int               result;
+	int               next;
+	u64               out_param;
+	u16               token;
+	u8                status;
+};
+
+static inline int go_bit(struct mthca_dev *dev)
+{
+	return readl(dev->hcr + HCR_STATUS_OFFSET) &
+		swab32(1 << HCR_GO_BIT);
+}
+
+static int mthca_cmd_post(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 out_param,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  u16 token,
+			  int event)
+{
+	int err = 0;
+	
+	if (down_interruptible(&dev->cmd.hcr_sem))
+		return -EINTR;
+
+	if (event) {
+		unsigned long end = jiffies + GO_BIT_TIMEOUT;
+
+		while (go_bit(dev) && time_before(jiffies, end)) {
+			set_current_state(TASK_RUNNING);
+			schedule();
+		}
+	}
+
+	if (go_bit(dev)) {
+		err = -EAGAIN;
+		goto out;
+	}
+
+	/*
+	 * We use writel (instead of something like memcpy_toio)
+	 * because writes of less than 32 bits to the HCR don't work
+	 * (and some architectures such as ia64 implement memcpy_toio
+	 * in terms of writeb).
+	 */
+	__raw_writel(cpu_to_be32(in_param >> 32),           dev->hcr + 0 * 4);
+	__raw_writel(cpu_to_be32(in_param & 0xfffffffful),  dev->hcr + 1 * 4);
+	__raw_writel(cpu_to_be32(in_modifier),              dev->hcr + 2 * 4);
+	__raw_writel(cpu_to_be32(out_param >> 32),          dev->hcr + 3 * 4);
+	__raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4);
+	__raw_writel(cpu_to_be32(token << 16),              dev->hcr + 5 * 4);
+
+	/*
+	 * Flush posted writes so GO bit is written last (needed with
+	 * __raw_writel, which may not order writes).
+	 */
+	readl(dev->hcr + HCR_STATUS_OFFSET);	
+
+	__raw_writel(cpu_to_be32((1 << HCR_GO_BIT)                |
+				 (event ? (1 << HCA_E_BIT) : 0)   |
+				 (op_modifier << HCR_OPMOD_SHIFT) |
+				 op),                       dev->hcr + 6 * 4);
+
+out:
+	up(&dev->cmd.hcr_sem);
+	return err;
+}
+
+static int mthca_cmd_poll(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	unsigned long end;
+
+	if (down_interruptible(&dev->cmd.poll_sem))
+		return -EINTR;
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, CMD_POLL_TOKEN, 0);
+	if (err)
+		goto out;
+
+	end = timeout + jiffies;
+	while (go_bit(dev) && time_before(jiffies, end)) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+
+	if (go_bit(dev)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	if (out_is_imm) {
+		memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64));
+		be64_to_cpus(out_param);
+	}
+
+	*status = readb(dev->hcr + HCR_STATUS_OFFSET);
+
+out:
+	up(&dev->cmd.poll_sem);
+	return err;
+}
+
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param)
+{
+	struct mthca_cmd_context *context =
+		&dev->cmd.context[token & dev->cmd.token_mask];
+
+	/* previously timed out command completing at long last */
+	if (token != context->token)
+		return;
+
+	context->result    = 0;
+	context->status    = status;
+	context->out_param = out_param;
+
+	context->token += dev->cmd.token_mask + 1;
+
+	complete(&context->done);
+}
+
+static void event_timeout(unsigned long context_ptr)
+{
+	struct mthca_cmd_context *context =
+		(struct mthca_cmd_context *) context_ptr;
+
+	context->result = -EBUSY;
+	complete(&context->done);
+}
+
+static int mthca_cmd_wait(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	struct mthca_cmd_context *context;
+
+	if (down_interruptible(&dev->cmd.event_sem))
+		return -EINTR;
+
+	spin_lock(&dev->cmd.context_lock);
+	BUG_ON(dev->cmd.free_head < 0);
+	context = &dev->cmd.context[dev->cmd.free_head];
+	dev->cmd.free_head = context->next;
+	spin_unlock(&dev->cmd.context_lock);
+
+	init_completion(&context->done);
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, context->token, 1);
+	if (err)
+		goto out;
+
+	context->timer.expires  = jiffies + timeout;
+	add_timer(&context->timer);
+
+	wait_for_completion(&context->done);
+	del_timer_sync(&context->timer);
+
+	err = context->result;
+	if (err)
+		goto out;
+
+	*status = context->status;
+	if (*status)
+		mthca_dbg(dev, "Command %02x completed with status %02x\n",
+			  op, *status);
+
+	if (out_is_imm)
+		*out_param = context->out_param;
+
+out:
+	spin_lock(&dev->cmd.context_lock);
+	context->next = dev->cmd.free_head;
+	dev->cmd.free_head = context - dev->cmd.context;
+	spin_unlock(&dev->cmd.context_lock);
+
+	up(&dev->cmd.event_sem);
+	return err;
+}
+
+/* Invoke a command with an output mailbox */
+static int mthca_cmd_box(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/* Invoke a command with no output parameter */
+static int mthca_cmd(struct mthca_dev *dev,
+		     u64 in_param,
+		     u32 in_modifier,
+		     u8 op_modifier,
+		     u16 op,
+		     unsigned long timeout,
+		     u8 *status)
+{
+	return mthca_cmd_box(dev, in_param, 0, in_modifier,
+			     op_modifier, op, timeout, status);
+}
+
+/*
+ * Invoke a command with an immediate output parameter (and copy the
+ * output into the caller's out_param pointer after the command
+ * executes).
+ */
+static int mthca_cmd_imm(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 *out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/*
+ * Switch to using events to issue FW commands (should be called after
+ * event queue to command events has been initialized).
+ */
+int mthca_cmd_use_events(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.context = kmalloc(dev->cmd.max_cmds *
+				   sizeof (struct mthca_cmd_context),
+				   GFP_KERNEL);
+	if (!dev->cmd.context)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i) {
+		dev->cmd.context[i].token = i;
+		dev->cmd.context[i].next = i + 1;
+		init_timer(&dev->cmd.context[i].timer);
+		dev->cmd.context[i].timer.data     =
+			(unsigned long) &dev->cmd.context[i];
+		dev->cmd.context[i].timer.function = event_timeout;
+	}
+
+	dev->cmd.context[dev->cmd.max_cmds - 1].next = -1;
+	dev->cmd.free_head = 0;
+
+	sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds);
+	spin_lock_init(&dev->cmd.context_lock);
+
+	for (dev->cmd.token_mask = 1;
+	     dev->cmd.token_mask < dev->cmd.max_cmds;
+	     dev->cmd.token_mask <<= 1)
+		; /* nothing */
+	--dev->cmd.token_mask;
+
+	dev->cmd.use_events = 1;
+	down(&dev->cmd.poll_sem);
+
+	return 0;
+}
+
+/*
+ * Switch back to polling (used when shutting down the device)
+ */
+void mthca_cmd_use_polling(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.use_events = 0;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i)
+		down(&dev->cmd.event_sem);
+
+	kfree(dev->cmd.context);
+
+	up(&dev->cmd.poll_sem);
+}
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status)
+{
+	u64 out;
+	int ret;
+
+	ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status);
+
+	if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR)
+		mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, "
+			   "sladdr=%d, SPD source=%s\n",
+			   (int) (out >> 6) & 0xf, (int) (out >> 4) & 3,
+			   (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM");
+
+	return ret;
+}
+
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status);
+}
+
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int lg;
+	int nent = 0;
+	int i, j;
+	int err = 0;
+	int ts = 0;
+
+	inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma);
+	memset(inbox, 0, PAGE_SIZE);
+
+	for (i = 0; i < count; ++i) {
+		/*
+		 * We have to pass pages that are aligned to their
+		 * size, so find the least significant 1 in the
+		 * address or size and use that as our log2 size.
+		 */
+		lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1;
+		if (lg < 12) {
+			mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n",
+				   (unsigned long long) sg_dma_address(sglist + i),
+				   sg_dma_len(sglist + i));
+			err = -EINVAL;
+			goto out;
+		}
+		for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) {
+			*((__be64 *) (inbox + nent * 4 + 2)) =
+				cpu_to_be64((sg_dma_address(sglist + i) +
+					     (j << lg)) |
+					    (lg - 12));
+			ts += 1 << (lg - 10);
+			if (nent == PAGE_SIZE / 16) {
+				err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+						CMD_TIME_CLASS_B, status);
+				if (err || *status)
+					goto out;
+				nent = 0;
+			}
+		}
+	}
+
+	if (nent) {
+		err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+				CMD_TIME_CLASS_B, status);
+	}
+
+	mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts);
+
+out:
+	pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status);
+}
+
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+	u8 lg;
+
+#define QUERY_FW_OUT_SIZE             0x100
+#define QUERY_FW_VER_OFFSET            0x00
+#define QUERY_FW_MAX_CMD_OFFSET        0x0f
+#define QUERY_FW_ERR_START_OFFSET      0x30
+#define QUERY_FW_ERR_SIZE_OFFSET       0x38
+
+#define QUERY_FW_START_OFFSET          0x20
+#define QUERY_FW_END_OFFSET            0x28
+
+#define QUERY_FW_SIZE_OFFSET           0x00
+#define QUERY_FW_CLR_INT_BASE_OFFSET   0x20
+#define QUERY_FW_EQ_ARM_BASE_OFFSET    0x40
+#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma);
+	if (!outbox) {
+		return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->fw_ver,   outbox, QUERY_FW_VER_OFFSET);
+	/*
+	 * FW subminor version is at more signifant bits than minor
+	 * version, so swap here.
+	 */
+	dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) |
+		((dev->fw_ver & 0xffff0000ull) >> 16) |
+		((dev->fw_ver & 0x0000ffffull) << 16);
+
+	MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
+	dev->cmd.max_cmds = 1 << lg;
+
+	mthca_dbg(dev, "FW version %012llx, max commands %d\n",
+		  (unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
+
+	if (dev->hca_type == ARBEL_NATIVE) {
+		MTHCA_GET(dev->fw.arbel.fw_pages,       outbox, QUERY_FW_SIZE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.clr_int_base,   outbox, QUERY_FW_CLR_INT_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_arm_base,    outbox, QUERY_FW_EQ_ARM_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET);
+		mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2);
+
+		mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n",
+			  (unsigned long long) dev->fw.arbel.clr_int_base,
+			  (unsigned long long) dev->fw.arbel.eq_arm_base,
+			  (unsigned long long) dev->fw.arbel.eq_set_ci_base);
+	} else {
+		MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET);
+		MTHCA_GET(dev->fw.tavor.fw_end,   outbox, QUERY_FW_END_OFFSET);
+
+		mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n",
+			  (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10),
+			  (unsigned long long) dev->fw.tavor.fw_start,
+			  (unsigned long long) dev->fw.tavor.fw_end);
+	}
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define ENABLE_LAM_OUT_SIZE         0x100
+#define ENABLE_LAM_START_OFFSET     0x00
+#define ENABLE_LAM_END_OFFSET       0x08
+#define ENABLE_LAM_INFO_OFFSET      0x13
+
+#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4)
+#define ENABLE_LAM_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM,
+			    CMD_TIME_CLASS_C, status);
+
+	if (err)
+		goto out;
+
+	if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, ENABLE_LAM_END_OFFSET);
+	MTHCA_GET(info,           outbox, ENABLE_LAM_INFO_OFFSET);
+
+	if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & ENABLE_LAM_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status);
+}
+
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define QUERY_DDR_OUT_SIZE         0x100
+#define QUERY_DDR_START_OFFSET     0x00
+#define QUERY_DDR_END_OFFSET       0x08
+#define QUERY_DDR_INFO_OFFSET      0x13
+
+#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4)
+#define QUERY_DDR_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, QUERY_DDR_END_OFFSET);
+	MTHCA_GET(info,           outbox, QUERY_DDR_INFO_OFFSET);
+
+	if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & QUERY_DDR_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & QUERY_DDR_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	u8 field;
+	u16 size;
+	int err;
+
+#define QUERY_DEV_LIM_OUT_SIZE             0x100
+#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET    0x10
+#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET     0x11
+#define QUERY_DEV_LIM_RSVD_QP_OFFSET       0x12
+#define QUERY_DEV_LIM_MAX_QP_OFFSET        0x13
+#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET      0x14
+#define QUERY_DEV_LIM_MAX_SRQ_OFFSET       0x15
+#define QUERY_DEV_LIM_RSVD_EEC_OFFSET      0x16
+#define QUERY_DEV_LIM_MAX_EEC_OFFSET       0x17
+#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET     0x19
+#define QUERY_DEV_LIM_RSVD_CQ_OFFSET       0x1a
+#define QUERY_DEV_LIM_MAX_CQ_OFFSET        0x1b
+#define QUERY_DEV_LIM_MAX_MPT_OFFSET       0x1d
+#define QUERY_DEV_LIM_RSVD_EQ_OFFSET       0x1e
+#define QUERY_DEV_LIM_MAX_EQ_OFFSET        0x1f
+#define QUERY_DEV_LIM_RSVD_MTT_OFFSET      0x20
+#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET    0x21
+#define QUERY_DEV_LIM_RSVD_MRW_OFFSET      0x22
+#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET   0x23
+#define QUERY_DEV_LIM_MAX_AV_OFFSET        0x27
+#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET    0x29
+#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET    0x2b
+#define QUERY_DEV_LIM_MAX_RDMA_OFFSET      0x2f
+#define QUERY_DEV_LIM_ACK_DELAY_OFFSET     0x35
+#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET     0x36
+#define QUERY_DEV_LIM_VL_PORT_OFFSET       0x37
+#define QUERY_DEV_LIM_MAX_GID_OFFSET       0x3b
+#define QUERY_DEV_LIM_MAX_PKEY_OFFSET      0x3f
+#define QUERY_DEV_LIM_FLAGS_OFFSET         0x44
+#define QUERY_DEV_LIM_RSVD_UAR_OFFSET      0x48
+#define QUERY_DEV_LIM_UAR_SZ_OFFSET        0x49
+#define QUERY_DEV_LIM_PAGE_SZ_OFFSET       0x4b
+#define QUERY_DEV_LIM_MAX_SG_OFFSET        0x51
+#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET   0x52
+#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET    0x61
+#define QUERY_DEV_LIM_RSVD_MCG_OFFSET      0x62
+#define QUERY_DEV_LIM_MAX_MCG_OFFSET       0x63
+#define QUERY_DEV_LIM_RSVD_PD_OFFSET       0x64
+#define QUERY_DEV_LIM_MAX_PD_OFFSET        0x65
+#define QUERY_DEV_LIM_RSVD_RDD_OFFSET      0x66
+#define QUERY_DEV_LIM_MAX_RDD_OFFSET       0x67
+#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET  0x80
+#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET  0x82
+#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84
+#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86
+#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET  0x88
+#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET  0x8a
+#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET  0x8c
+#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET  0x8e
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET);
+	dev_lim->max_srq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET);
+	dev_lim->max_qp_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET);
+	dev_lim->reserved_qps = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET);
+	dev_lim->max_qps = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET);
+	dev_lim->reserved_srqs = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET);
+	dev_lim->max_srqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET);
+	dev_lim->reserved_eecs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET);
+	dev_lim->max_eecs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET);
+	dev_lim->max_cq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET);
+	dev_lim->reserved_cqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET);
+	dev_lim->max_cqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET);
+	dev_lim->max_mpts = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET);
+	dev_lim->reserved_eqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET);
+	dev_lim->max_eqs = 1 << (field & 0x7);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET);
+	dev_lim->reserved_mtts = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET);
+	dev_lim->max_mrw_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET);
+	dev_lim->reserved_mrws = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET);
+	dev_lim->max_mtt_seg = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET);
+	dev_lim->max_avs = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET);
+	dev_lim->max_requester_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET);
+	dev_lim->max_responder_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET);
+	dev_lim->max_rdma_global = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET);
+	dev_lim->local_ca_ack_delay = field & 0x1f;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET);
+	dev_lim->max_mtu        = field >> 4;
+	dev_lim->max_port_width = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET);
+	dev_lim->max_vl    = field >> 4;
+	dev_lim->num_ports = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET);
+	dev_lim->max_gids = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET);
+	dev_lim->max_pkeys = 1 << (field & 0xf);
+	MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET);
+	dev_lim->reserved_uars = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET);
+	dev_lim->uar_size = 1 << ((field & 0x3f) + 20);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET);
+	dev_lim->min_page_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET);
+	dev_lim->max_sg = field;
+	
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET);
+	dev_lim->max_desc_sz = size;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET);
+	dev_lim->max_qp_per_mcg = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET);
+	dev_lim->reserved_mgms = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET);
+	dev_lim->max_mcgs = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET);
+	dev_lim->reserved_pds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET);
+	dev_lim->max_pds = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET);
+	dev_lim->reserved_rdds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET);
+	dev_lim->max_rdds = 1 << (field & 0x3f);
+
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET);
+	dev_lim->eec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET);
+	dev_lim->qpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET);
+	dev_lim->eeec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET);
+	dev_lim->eqpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET);
+	dev_lim->eqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET);
+	dev_lim->cqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET);
+	dev_lim->srq_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET);
+	dev_lim->uar_scratch_entry_sz = size;
+
+	mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n",
+		  dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz);
+	mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n",
+		  dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz);
+	mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n",
+		  dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz);
+	mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n",
+		  dev_lim->reserved_mrws, dev_lim->reserved_mtts);
+	mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars);
+	mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_mgms);
+
+	mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err;
+
+#define QUERY_ADAPTER_OUT_SIZE             0x100
+#define QUERY_ADAPTER_VENDOR_ID_OFFSET     0x00
+#define QUERY_ADAPTER_DEVICE_ID_OFFSET     0x04
+#define QUERY_ADAPTER_REVISION_ID_OFFSET   0x08
+#define QUERY_ADAPTER_INTA_PIN_OFFSET      0x10
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET);
+	MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET);
+	MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET);
+	MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+
+#define INIT_HCA_IN_SIZE             	 0x200
+#define INIT_HCA_FLAGS_OFFSET        	 0x014
+#define INIT_HCA_QPC_OFFSET          	 0x020
+#define  INIT_HCA_QPC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x10)
+#define  INIT_HCA_LOG_QP_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x17)
+#define  INIT_HCA_EEC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x20)
+#define  INIT_HCA_LOG_EEC_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x27)
+#define  INIT_HCA_SRQC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x28)
+#define  INIT_HCA_LOG_SRQ_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x2f)
+#define  INIT_HCA_CQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x30)
+#define  INIT_HCA_LOG_CQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x37)
+#define  INIT_HCA_EQPC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x40)
+#define  INIT_HCA_EEEC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x50)
+#define  INIT_HCA_EQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x60)
+#define  INIT_HCA_LOG_EQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x67)
+#define  INIT_HCA_RDB_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x70)
+#define INIT_HCA_UDAV_OFFSET         	 0x0b0
+#define  INIT_HCA_UDAV_LKEY_OFFSET   	 (INIT_HCA_UDAV_OFFSET + 0x0)
+#define  INIT_HCA_UDAV_PD_OFFSET     	 (INIT_HCA_UDAV_OFFSET + 0x4)
+#define INIT_HCA_MCAST_OFFSET        	 0x0c0
+#define  INIT_HCA_MC_BASE_OFFSET         (INIT_HCA_MCAST_OFFSET + 0x00)
+#define  INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12)
+#define  INIT_HCA_MC_HASH_SZ_OFFSET      (INIT_HCA_MCAST_OFFSET + 0x16)
+#define  INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b)
+#define INIT_HCA_TPT_OFFSET              0x0f0
+#define  INIT_HCA_MPT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x00)
+#define  INIT_HCA_MTT_SEG_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x09)
+#define  INIT_HCA_LOG_MPT_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x0b)
+#define  INIT_HCA_MTT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x10)
+#define INIT_HCA_UAR_OFFSET              0x120
+#define  INIT_HCA_UAR_BASE_OFFSET        (INIT_HCA_UAR_OFFSET + 0x00)
+#define  INIT_HCA_UAR_PAGE_SZ_OFFSET     (INIT_HCA_UAR_OFFSET + 0x0b)
+#define  INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10)
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_HCA_IN_SIZE);
+
+#if defined(__LITTLE_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1);
+#elif defined(__BIG_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1);
+#else
+#error Host endianness not defined
+#endif
+	/* Check port for UD address vector: */
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1);
+
+	/* We leave wqe_quota, responder_exu, etc as 0 (default) */
+
+	/* QPC/EEC/CQC/EQC/RDB attributes */
+
+	MTHCA_PUT(inbox, param->qpc_base,     INIT_HCA_QPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_qps,  INIT_HCA_LOG_QP_OFFSET);
+	MTHCA_PUT(inbox, param->eec_base,     INIT_HCA_EEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET);
+	MTHCA_PUT(inbox, param->srqc_base,    INIT_HCA_SRQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET);
+	MTHCA_PUT(inbox, param->cqc_base,     INIT_HCA_CQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_cqs,  INIT_HCA_LOG_CQ_OFFSET);
+	MTHCA_PUT(inbox, param->eqpc_base,    INIT_HCA_EQPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eeec_base,    INIT_HCA_EEEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eqc_base,     INIT_HCA_EQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eqs,  INIT_HCA_LOG_EQ_OFFSET);
+	MTHCA_PUT(inbox, param->rdb_base,     INIT_HCA_RDB_BASE_OFFSET);
+
+	/* UD AV attributes */
+
+	/* multicast attributes */
+
+	MTHCA_PUT(inbox, param->mc_base,         INIT_HCA_MC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mc_hash_sz,      INIT_HCA_MC_HASH_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET);
+
+	/* TPT attributes */
+
+	MTHCA_PUT(inbox, param->mpt_base,   INIT_HCA_MPT_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_base,   INIT_HCA_MTT_BASE_OFFSET);
+
+	/* UAR attributes */
+	{
+		u8 uar_page_sz = PAGE_SHIFT - 12;
+		MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET);
+		MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET);
+	}
+
+	err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA,
+			HZ, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+	u32 flags;
+
+#define INIT_IB_IN_SIZE          56
+#define INIT_IB_FLAGS_OFFSET     0x00
+#define INIT_IB_FLAG_SIG         (1 << 18)
+#define INIT_IB_FLAG_NG          (1 << 17)
+#define INIT_IB_FLAG_G0          (1 << 16)
+#define INIT_IB_FLAG_1X          (1 << 8)
+#define INIT_IB_FLAG_4X          (1 << 9)
+#define INIT_IB_FLAG_12X         (1 << 11)
+#define INIT_IB_VL_SHIFT         4
+#define INIT_IB_MTU_SHIFT        12
+#define INIT_IB_MAX_GID_OFFSET   0x06
+#define INIT_IB_MAX_PKEY_OFFSET  0x0a
+#define INIT_IB_GUID0_OFFSET     0x10
+#define INIT_IB_NODE_GUID_OFFSET 0x18
+#define INIT_IB_SI_GUID_OFFSET   0x20
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_IB_IN_SIZE);
+
+	flags = 0;
+	flags |= param->enable_1x     ? INIT_IB_FLAG_1X  : 0;
+	flags |= param->enable_4x     ? INIT_IB_FLAG_4X  : 0;
+	flags |= param->set_guid0     ? INIT_IB_FLAG_G0  : 0;
+	flags |= param->set_node_guid ? INIT_IB_FLAG_NG  : 0;
+	flags |= param->set_si_guid   ? INIT_IB_FLAG_SIG : 0;
+	flags |= param->vl_cap << INIT_IB_VL_SHIFT;
+	flags |= param->mtu_cap << INIT_IB_MTU_SHIFT;
+	MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET);
+
+	MTHCA_PUT(inbox, param->gid_cap,   INIT_IB_MAX_GID_OFFSET);
+	MTHCA_PUT(inbox, param->pkey_cap,  INIT_IB_MAX_PKEY_OFFSET);
+	MTHCA_PUT(inbox, param->guid0,     INIT_IB_GUID0_OFFSET);
+	MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET);
+	MTHCA_PUT(inbox, param->si_guid,   INIT_IB_SI_GUID_OFFSET);
+
+	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
+			CMD_TIME_CLASS_A, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status)
+{
+	return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status);
+}
+
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status);
+}
+
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mpt_entry,
+			       MTHCA_MPT_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	if (mpt_entry) {
+		outdma = pci_map_single(dev->pdev, mpt_entry,
+					MTHCA_MPT_ENTRY_SIZE,
+					PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(outdma))
+			return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry,
+			    CMD_HW2SW_MPT,
+			    CMD_TIME_CLASS_B, status);
+
+	if (mpt_entry)
+		pci_unmap_single(dev->pdev, outdma,
+				 MTHCA_MPT_ENTRY_SIZE,
+				 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mtt_entry,
+			       (num_mtt + 2) * 8,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 (num_mtt + 2) * 8, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status)
+{
+	mthca_dbg(dev, "%s mask %016llx for eqn %d\n",
+		  unmap ? "Clearing" : "Setting",
+		  (unsigned long long) event_mask, eq_num);
+	return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num,
+			 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, eq_context,
+			       MTHCA_EQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, eq_context,
+				MTHCA_EQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, eq_num, 0,
+			    CMD_HW2SW_EQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_EQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, cq_context,
+			       MTHCA_CQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, cq_context,
+				MTHCA_CQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, cq_num, 0,
+			    CMD_HW2SW_CQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_CQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status)
+{
+	static const u16 op[] = {
+		[MTHCA_TRANS_RST2INIT]  = CMD_RST2INIT_QPEE,
+		[MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE,
+		[MTHCA_TRANS_INIT2RTR]  = CMD_INIT2RTR_QPEE,
+		[MTHCA_TRANS_RTR2RTS]   = CMD_RTR2RTS_QPEE,
+		[MTHCA_TRANS_RTS2RTS]   = CMD_RTS2RTS_QPEE,
+		[MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE,
+		[MTHCA_TRANS_ANY2ERR]   = CMD_2ERR_QPEE,
+		[MTHCA_TRANS_RTS2SQD]   = CMD_RTS2SQD_QPEE,
+		[MTHCA_TRANS_SQD2SQD]   = CMD_SQD2SQD_QPEE,
+		[MTHCA_TRANS_SQD2RTS]   = CMD_SQD2RTS_QPEE,
+		[MTHCA_TRANS_ANY2RST]   = CMD_ERR2RST_QPEE
+	};
+	u8 op_mod = 0;
+
+	dma_addr_t indma;
+	int err;
+
+	if (trans < 0 || trans >= ARRAY_SIZE(op))
+		return -EINVAL;
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		indma  = 0;
+		op_mod = 3;	/* don't write outbox, any->reset */
+
+		/* For debugging */
+		qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+						  &indma);
+		op_mod = 2;	/* write outbox, any->reset */
+	} else {
+		indma = pci_map_single(dev->pdev, qp_context,
+				       MTHCA_QP_CONTEXT_SIZE,
+				       PCI_DMA_TODEVICE);
+		if (pci_dma_mapping_error(indma))
+			return -ENOMEM;
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+	}
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num,
+				    op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+
+	} else
+		err = mthca_cmd(dev, indma, (!!is_ee << 24) | num,
+				op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+	if (trans != MTHCA_TRANS_ANY2RST)
+		pci_unmap_single(dev->pdev, indma,
+				 MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	else
+		pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+				    qp_context, indma);
+	return err;
+}
+
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, qp_context,
+				MTHCA_QP_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0,
+			    CMD_QUERY_QPEE,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_QP_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status)
+{
+	u8 op_mod;
+
+	switch (type) {
+	case IB_QPT_SMI:
+		op_mod = 0;
+		break;
+	case IB_QPT_GSI:
+		op_mod = 1;
+		break;
+	case IB_QPT_RAW_IPV6:
+		op_mod = 2;
+		break;
+	case IB_QPT_RAW_ETY:
+		op_mod = 3;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP,
+			 CMD_TIME_CLASS_B, status);
+}
+
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status) {
+	void *box;
+	dma_addr_t dma;
+	int err;
+
+#define MAD_IFC_BOX_SIZE 512
+
+	box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma);
+	if (!box)
+		return -ENOMEM;
+
+	memcpy(box, in_mad, 256);
+
+	err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey,
+			    CMD_MAD_IFC, CMD_TIME_CLASS_C, status);
+
+	if (!err && !*status)
+		memcpy(response_mad, box + 256, 256);
+
+	pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma);
+	return err;
+}
+
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, mgm,
+				MTHCA_MGM_ENTRY_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, index, 0,
+			    CMD_READ_MGM,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_MGM_ENTRY_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mgm,
+			       MTHCA_MGM_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	u64 imm;
+	int err;
+
+	indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH,
+			    CMD_TIME_CLASS_A, status);
+	*hash = imm;
+
+	pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE);
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h	2004-11-21 21:25:54.543109174 -0800
@@ -0,0 +1,260 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_CMD_H
+#define MTHCA_CMD_H
+
+#include <ib_verbs.h>
+
+#define MTHCA_CMD_MAILBOX_ALIGN 16UL
+#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1)
+
+enum {
+	/* command completed successfully: */
+	MTHCA_CMD_STAT_OK 	      = 0x00,
+	/* Internal error (such as a bus error) occurred while processing command: */
+	MTHCA_CMD_STAT_INTERNAL_ERR   = 0x01,
+	/* Operation/command not supported or opcode modifier not supported: */
+	MTHCA_CMD_STAT_BAD_OP 	      = 0x02,
+	/* Parameter not supported or parameter out of range: */
+	MTHCA_CMD_STAT_BAD_PARAM      = 0x03,
+	/* System not enabled or bad system state: */
+	MTHCA_CMD_STAT_BAD_SYS_STATE  = 0x04,
+	/* Attempt to access reserved or unallocaterd resource: */
+	MTHCA_CMD_STAT_BAD_RESOURCE   = 0x05,
+	/* Requested resource is currently executing a command, or is otherwise busy: */
+	MTHCA_CMD_STAT_RESOURCE_BUSY  = 0x06,
+	/* memory error: */
+	MTHCA_CMD_STAT_DDR_MEM_ERR    = 0x07,
+	/* Required capability exceeds device limits: */
+	MTHCA_CMD_STAT_EXCEED_LIM     = 0x08,
+	/* Resource is not in the appropriate state or ownership: */
+	MTHCA_CMD_STAT_BAD_RES_STATE  = 0x09,
+	/* Index out of range: */
+	MTHCA_CMD_STAT_BAD_INDEX      = 0x0a,
+	/* FW image corrupted: */
+	MTHCA_CMD_STAT_BAD_NVMEM      = 0x0b,
+	/* Attempt to modify a QP/EE which is not in the presumed state: */
+	MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10,
+	/* Bad segment parameters (Address/Size): */
+	MTHCA_CMD_STAT_BAD_SEG_PARAM  = 0x20,
+	/* Memory Region has Memory Windows bound to: */
+	MTHCA_CMD_STAT_REG_BOUND      = 0x21,
+	/* HCA local attached memory not present: */
+	MTHCA_CMD_STAT_LAM_NOT_PRE    = 0x22,
+        /* Bad management packet (silently discarded): */
+	MTHCA_CMD_STAT_BAD_PKT 	      = 0x30,
+        /* More outstanding CQEs in CQ than new CQ size: */
+	MTHCA_CMD_STAT_BAD_SIZE       = 0x40
+};
+
+enum {
+	MTHCA_TRANS_INVALID = 0,
+	MTHCA_TRANS_RST2INIT,
+	MTHCA_TRANS_INIT2INIT,
+	MTHCA_TRANS_INIT2RTR,
+	MTHCA_TRANS_RTR2RTS,
+	MTHCA_TRANS_RTS2RTS,
+	MTHCA_TRANS_SQERR2RTS,
+	MTHCA_TRANS_ANY2ERR,
+	MTHCA_TRANS_RTS2SQD,
+	MTHCA_TRANS_SQD2SQD,
+	MTHCA_TRANS_SQD2RTS,
+	MTHCA_TRANS_ANY2RST,
+};
+
+enum {
+	DEV_LIM_FLAG_SRQ = 1 << 6
+};
+
+struct mthca_dev_lim {
+	int max_srq_sz;
+	int max_qp_sz;
+	int reserved_qps;
+	int max_qps;
+	int reserved_srqs;
+	int max_srqs;
+	int reserved_eecs;
+	int max_eecs;
+	int max_cq_sz;
+	int reserved_cqs;
+	int max_cqs;
+	int max_mpts;
+	int reserved_eqs;
+	int max_eqs;
+	int reserved_mtts;
+	int max_mrw_sz;
+	int reserved_mrws;
+	int max_mtt_seg;
+	int max_avs;
+	int max_requester_per_qp;
+	int max_responder_per_qp;
+	int max_rdma_global;
+	int local_ca_ack_delay;
+	int max_mtu;
+	int max_port_width;
+	int max_vl;
+	int num_ports;
+	int max_gids;
+	int max_pkeys;
+	u32 flags;
+	int reserved_uars;
+	int uar_size;
+	int min_page_sz;
+	int max_sg;
+	int max_desc_sz;
+	int max_qp_per_mcg;
+	int reserved_mgms;
+	int max_mcgs;
+	int reserved_pds;
+	int max_pds;
+	int reserved_rdds;
+	int max_rdds;
+	int eec_entry_sz;
+	int qpc_entry_sz;
+	int eeec_entry_sz;
+	int eqpc_entry_sz;
+	int eqc_entry_sz;
+	int cqc_entry_sz;
+	int srq_entry_sz;
+	int uar_scratch_entry_sz;
+};
+
+struct mthca_adapter {
+	u32 vendor_id;
+	u32 device_id;
+	u32 revision_id;
+	u8  inta_pin;
+};
+
+struct mthca_init_hca_param {
+	u64 qpc_base;
+	u8  log_num_qps;
+	u64 eec_base;
+	u8  log_num_eecs;
+	u64 srqc_base;
+	u8  log_num_srqs;
+	u64 cqc_base;
+	u8  log_num_cqs;
+	u64 eqpc_base;
+	u64 eeec_base;
+	u64 eqc_base;
+	u8  log_num_eqs;
+	u64 rdb_base;
+	u64 mc_base;
+	u16 log_mc_entry_sz;
+	u16 mc_hash_sz;
+	u8  log_mc_table_sz;
+	u64 mpt_base;
+	u8  mtt_seg_sz;
+	u8  log_mpt_sz;
+	u64 mtt_base;
+	u64 uar_scratch_base;
+};
+
+struct mthca_init_ib_param {
+	int enable_1x;
+	int enable_4x;
+	int vl_cap;
+	int mtu_cap;
+	u16 gid_cap;
+	u16 pkey_cap;
+	int set_guid0;
+	u64 guid0;
+	int set_node_guid;
+	u64 node_guid;
+	int set_si_guid;
+	u64 si_guid;
+};
+
+int mthca_cmd_use_events(struct mthca_dev *dev);
+void mthca_cmd_use_polling(struct mthca_dev *dev);
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param);
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status);
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status);
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status);
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status);
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status);
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status);
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status);
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status);
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status);
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status);
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status);
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status);
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status);
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status);
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status);
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status);
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status);
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status);
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status);
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status);
+
+#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN))
+
+#endif /* MTHCA_CMD_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h	2004-11-21 21:25:54.567105615 -0800
@@ -0,0 +1,51 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $
+ */
+
+#ifndef MTHCA_CONFIG_REG_H
+#define MTHCA_CONFIG_REG_H
+
+#include <asm/page.h>
+
+#define MTHCA_HCR_BASE         0x80680
+#define MTHCA_HCR_SIZE         0x0001c
+#define MTHCA_ECR_BASE         0x80700
+#define MTHCA_ECR_SIZE         0x00008
+#define MTHCA_ECR_CLR_BASE     0x80708
+#define MTHCA_ECR_CLR_SIZE     0x00008
+#define MTHCA_ECR_OFFSET       (MTHCA_ECR_BASE     - MTHCA_HCR_BASE)
+#define MTHCA_ECR_CLR_OFFSET   (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE)
+#define MTHCA_CLR_INT_BASE     0xf00d8
+#define MTHCA_CLR_INT_SIZE     0x00008
+
+#define MTHCA_MAP_HCR_SIZE     (MTHCA_ECR_CLR_BASE   + \
+			        MTHCA_ECR_CLR_SIZE   - \
+			        MTHCA_HCR_BASE)
+
+#endif /* MTHCA_CONFIG_REG_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c	2004-11-21 21:25:54.594101610 -0800
@@ -0,0 +1,821 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE
+};
+
+enum {
+	MTHCA_CQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_cq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 error_eqn;
+	u32 comp_eqn;
+	u32 pd;
+	u32 lkey;
+	u32 last_notified_index;
+	u32 solicit_producer_index;
+	u32 consumer_index;
+	u32 producer_index;
+	u32 cqn;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_CQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_CQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_CQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_CQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_CQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_CQ_STATE_DISARMED     ( 0 <<  8)
+#define MTHCA_CQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_CQ_STATE_ARMED_SOL    ( 4 <<  8)
+#define MTHCA_EQ_STATE_FIRED        (10 <<  8)
+
+enum {
+	MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe
+};
+
+enum {
+	SYNDROME_LOCAL_LENGTH_ERR 	 = 0x01,
+	SYNDROME_LOCAL_QP_OP_ERR  	 = 0x02,
+	SYNDROME_LOCAL_EEC_OP_ERR 	 = 0x03,
+	SYNDROME_LOCAL_PROT_ERR   	 = 0x04,
+	SYNDROME_WR_FLUSH_ERR     	 = 0x05,
+	SYNDROME_MW_BIND_ERR      	 = 0x06,
+	SYNDROME_BAD_RESP_ERR     	 = 0x10,
+	SYNDROME_LOCAL_ACCESS_ERR 	 = 0x11,
+	SYNDROME_REMOTE_INVAL_REQ_ERR 	 = 0x12,
+	SYNDROME_REMOTE_ACCESS_ERR 	 = 0x13,
+	SYNDROME_REMOTE_OP_ERR     	 = 0x14,
+	SYNDROME_RETRY_EXC_ERR 		 = 0x15,
+	SYNDROME_RNR_RETRY_EXC_ERR 	 = 0x16,
+	SYNDROME_LOCAL_RDD_VIOL_ERR 	 = 0x20,
+	SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21,
+	SYNDROME_REMOTE_ABORTED_ERR 	 = 0x22,
+	SYNDROME_INVAL_EECN_ERR 	 = 0x23,
+	SYNDROME_INVAL_EEC_STATE_ERR 	 = 0x24
+};
+
+struct mthca_cqe {
+	u32 my_qpn;
+	u32 my_ee;
+	u32 rqpn;
+	u16 sl_g_mlpath;
+	u16 rlid;
+	u32 imm_etype_pkey_eec;
+	u32 byte_cnt;
+	u32 wqe;
+	u8  opcode;
+	u8  is_send;
+	u8  reserved;
+	u8  owner;
+} __attribute__((packed));
+
+struct mthca_err_cqe {
+	u32 my_qpn;
+	u32 reserved1[3];
+	u8  syndrome;
+	u8  reserved2;
+	u16 db_cnt;
+	u32 reserved3;
+	u32 wqe;
+	u8  opcode;
+	u8  reserved4[2];
+	u8  owner;
+} __attribute__((packed));
+
+#define MTHCA_CQ_ENTRY_OWNER_SW      (0 << 7)
+#define MTHCA_CQ_ENTRY_OWNER_HW      (1 << 7)
+
+#define MTHCA_CQ_DB_INC_CI       (1 << 24)
+#define MTHCA_CQ_DB_REQ_NOT      (2 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_SOL  (3 << 24)
+#define MTHCA_CQ_DB_SET_CI       (4 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24)
+
+static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry)
+{
+	if (cq->is_direct)
+		return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE);
+	else
+		return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf
+			+ (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int cqe_sw(struct mthca_cq *cq, int i)
+{
+	return !(MTHCA_CQ_ENTRY_OWNER_HW &
+		 get_cqe(cq, i)->owner);
+}
+
+static inline int next_cqe_sw(struct mthca_cq *cq)
+{
+	return cqe_sw(cq, cq->cons_index);
+}
+
+static inline void set_cqe_hw(struct mthca_cq *cq, int entry)
+{
+	get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW;
+}
+
+static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq,
+				  int nent)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn);
+	doorbell[1] = cpu_to_be32(nent - 1);
+		
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn)
+{
+	struct mthca_cq *cq;
+
+	spin_lock(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock(&dev->cq_table.lock);
+
+	if (!cq) {
+		mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn);
+		return;
+	}
+
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn)
+{
+	struct mthca_cq *cq;
+	struct mthca_cqe *cqe;
+	int prod_index;
+	int nfreed = 0;
+
+	spin_lock_irq(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	if (!cq)
+		return;
+
+	spin_lock_irq(&cq->lock);
+
+	/*
+	 * First we need to find the current producer index, so we
+	 * know where to start cleaning from.  It doesn't matter if HW
+	 * adds new entries after this loop -- the QP we're worried
+	 * about is already in RESET, so the new entries won't come
+	 * from our QP and therefore don't need to be checked.
+	 */
+	for (prod_index = cq->cons_index;
+	     cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1));
+	     ++prod_index)
+		if (prod_index == cq->cons_index + cq->ibcq.cqe - 1)
+			break;
+
+	if (0)
+		mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n",
+			  qpn, cqn, cq->cons_index, prod_index);
+
+	/*
+	 * Now sweep backwards through the CQ, removing CQ entries
+	 * that match our QP by copying older entries on top of them.
+	 */
+	while (prod_index > cq->cons_index) {
+		cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1));
+		if (cqe->my_qpn == cpu_to_be32(qpn))
+			++nfreed;
+		else if (nfreed)
+			memcpy(get_cqe(cq, (prod_index - 1 + nfreed) &
+				       (cq->ibcq.cqe - 1)),
+			       cqe,
+			       MTHCA_CQ_ENTRY_SIZE);
+		--prod_index;
+	}
+
+	if (nfreed) {
+		wmb();
+		inc_cons_index(dev, cq, nfreed);
+		cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1);
+	}
+
+	spin_unlock_irq(&cq->lock);
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq,
+			    struct mthca_qp *qp, int wqe_index, int is_send,
+			    struct mthca_err_cqe *cqe,
+			    struct ib_wc *entry, int *free_cqe)
+{
+	int err;
+	int dbd;
+	u32 new_wqe;
+
+	if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) {
+		int j;
+		
+		mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+	}
+
+	/*
+	 * For completions in error, only work request ID, status (and
+	 * freed resource count for RD) have to be set.
+	 */
+	switch (cqe->syndrome) {
+	case SYNDROME_LOCAL_LENGTH_ERR:
+		entry->status = IB_WC_LOC_LEN_ERR;
+		break;
+	case SYNDROME_LOCAL_QP_OP_ERR:
+		entry->status = IB_WC_LOC_QP_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_EEC_OP_ERR:
+		entry->status = IB_WC_LOC_EEC_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_PROT_ERR:
+		entry->status = IB_WC_LOC_PROT_ERR;
+		break;
+	case SYNDROME_WR_FLUSH_ERR:
+		entry->status = IB_WC_WR_FLUSH_ERR;
+		break;
+	case SYNDROME_MW_BIND_ERR:
+		entry->status = IB_WC_MW_BIND_ERR;
+		break;
+	case SYNDROME_BAD_RESP_ERR:
+		entry->status = IB_WC_BAD_RESP_ERR;
+		break;
+	case SYNDROME_LOCAL_ACCESS_ERR:
+		entry->status = IB_WC_LOC_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_REQ_ERR:
+		entry->status = IB_WC_REM_INV_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ACCESS_ERR:
+		entry->status = IB_WC_REM_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_OP_ERR:
+		entry->status = IB_WC_REM_OP_ERR;
+		break;
+	case SYNDROME_RETRY_EXC_ERR:
+		entry->status = IB_WC_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_RNR_RETRY_EXC_ERR:
+		entry->status = IB_WC_RNR_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_LOCAL_RDD_VIOL_ERR:
+		entry->status = IB_WC_LOC_RDD_VIOL_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_RD_REQ_ERR:
+		entry->status = IB_WC_REM_INV_RD_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ABORTED_ERR:
+		entry->status = IB_WC_REM_ABORT_ERR;
+		break;
+	case SYNDROME_INVAL_EECN_ERR:
+		entry->status = IB_WC_INV_EECN_ERR;
+		break;
+	case SYNDROME_INVAL_EEC_STATE_ERR:
+		entry->status = IB_WC_INV_EEC_STATE_ERR;
+		break;
+	default:
+		entry->status = IB_WC_GENERAL_ERR;
+		break;
+	}
+
+	err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe);
+	if (err)
+		return err;
+
+	/*
+	 * If we're at the end of the WQE chain, or we've used up our
+	 * doorbell count, free the CQE.  Otherwise just update it for
+	 * the next poll operation.
+	 */
+	if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd))
+		return 0;
+
+	cqe->db_cnt   = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd);
+	cqe->wqe      = new_wqe;
+	cqe->syndrome = SYNDROME_WR_FLUSH_ERR;
+
+	*free_cqe = 0;
+
+	return 0;
+}
+
+static void dump_cqe(struct mthca_cqe *cqe)
+{
+	int j;
+
+	for (j = 0; j < 8; ++j)
+		printk(KERN_DEBUG "  [%2x] %08x\n",
+		       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+}
+
+static inline int mthca_poll_one(struct mthca_dev *dev,
+				 struct mthca_cq *cq,
+				 struct mthca_qp **cur_qp,
+				 int *freed,
+				 struct ib_wc *entry)
+{
+	struct mthca_wq *wq;
+	struct mthca_cqe *cqe;
+	int wqe_index;
+	int is_error = 0;
+	int is_send;
+	int free_cqe = 1;
+	int err = 0;
+
+	if (!next_cqe_sw(cq))
+		return -EAGAIN;
+
+	rmb();
+
+	cqe = get_cqe(cq, cq->cons_index);
+
+	if (0) {
+		mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		dump_cqe(cqe);
+	}
+
+	if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) ==
+	    MTHCA_ERROR_CQE_OPCODE_MASK) {
+		is_error = 1;
+		is_send = cqe->opcode & 1;
+	} else
+		is_send = cqe->is_send & 0x80;
+
+	if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) {
+		if (*cur_qp) {
+			spin_unlock(&(*cur_qp)->lock);
+			if (atomic_dec_and_test(&(*cur_qp)->refcount))
+				wake_up(&(*cur_qp)->wait);
+		}
+
+		spin_lock(&dev->qp_table.lock);
+		*cur_qp = mthca_array_get(&dev->qp_table.qp,
+					  be32_to_cpu(cqe->my_qpn) &
+					  (dev->limits.num_qps - 1));
+		if (*cur_qp)
+			atomic_inc(&(*cur_qp)->refcount);
+		spin_unlock(&dev->qp_table.lock);
+
+		if (!*cur_qp) {
+			mthca_warn(dev, "CQ entry for unknown QP %06x\n",
+				   be32_to_cpu(cqe->my_qpn) & 0xffffff);
+			err = -EINVAL;
+			goto out;
+		}
+
+		spin_lock(&(*cur_qp)->lock);
+	}
+
+	if (is_send) {
+		wq = &(*cur_qp)->sq;
+		wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset)
+			     >> wq->wqe_shift);
+		entry->wr_id = (*cur_qp)->wrid[wqe_index +
+					       (*cur_qp)->rq.max];
+	} else {
+		wq = &(*cur_qp)->rq;
+		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		entry->wr_id = (*cur_qp)->wrid[wqe_index];
+	}
+
+	if (wq->last_comp < wqe_index)
+		wq->cur -= wqe_index - wq->last_comp;
+	else
+		wq->cur -= wq->max - wq->last_comp + wqe_index;
+
+	wq->last_comp = wqe_index;
+
+	if (0)
+		mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n",
+			  is_send ? "Send" : "Receive",
+			  (*cur_qp)->qpn, wqe_index, wq->max);
+
+	if (is_error) {
+		err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send,
+				       (struct mthca_err_cqe *) cqe,
+				       entry, &free_cqe);
+		goto out;
+	}
+
+	if (is_send) {
+		entry->opcode = IB_WC_SEND; /* XXX */
+	} else {
+		entry->byte_len = be32_to_cpu(cqe->byte_cnt);
+		switch (cqe->opcode & 0x1f) {
+		case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV;
+			break;
+		case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV_RDMA_WITH_IMM;
+			break;
+		default:
+			entry->wc_flags = 0;
+			entry->opcode = IB_WC_RECV;
+			break;
+		}
+		entry->slid 	   = be16_to_cpu(cqe->rlid);
+		entry->sl   	   = be16_to_cpu(cqe->sl_g_mlpath) >> 12;
+		entry->src_qp 	   = be32_to_cpu(cqe->rqpn) & 0xffffff;
+		entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f;
+		entry->pkey_index  = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16;
+		entry->wc_flags   |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ?
+					IB_WC_GRH : 0;
+	}
+
+	entry->status = IB_WC_SUCCESS;
+
+ out:
+	if (free_cqe) {
+		set_cqe_hw(cq, cq->cons_index);
+		++(*freed);
+		cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1);
+	}
+
+	return err;
+}
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry)
+{
+	struct mthca_dev *dev = to_mdev(ibcq->device);
+	struct mthca_cq *cq = to_mcq(ibcq);
+	struct mthca_qp *qp = NULL;
+	unsigned long flags;
+	int err = 0;
+	int freed = 0;
+	int npolled;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+		err = mthca_poll_one(dev, cq, &qp,
+				     &freed, entry + npolled);
+		if (err)
+			break;
+	}
+
+	if (qp) {
+		spin_unlock(&qp->lock);
+		if (atomic_dec_and_test(&qp->refcount))
+			wake_up(&qp->wait);
+	}
+
+	wmb();
+	inc_cons_index(dev, cq, freed);
+
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return err == 0 || err == -EAGAIN ? npolled : err;
+}
+
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited)
+{
+	u32 doorbell[2];
+
+	doorbell[0] =  cpu_to_be32((solicited ?
+				    MTHCA_CQ_DB_REQ_NOT_SOL :
+				    MTHCA_CQ_DB_REQ_NOT)      |
+				   cq->cqn);
+	doorbell[1] = 0xffffffff;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq)
+{
+	int size = nent * MTHCA_CQ_ENTRY_SIZE;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	int npages, shift;
+	u64 *dma_list = NULL;
+	struct mthca_cq_context *cq_context;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out;
+
+	cq_context = MAILBOX_ALIGN(mailbox);
+
+	if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) {
+		if (0)
+			mthca_dbg(dev, "Creating direct CQ of size %d\n", size);
+
+		cq->is_direct = 1;
+		npages        = 1;
+		shift         = get_order(size) + PAGE_SHIFT;
+
+		cq->queue.direct.buf = pci_alloc_consistent(dev->pdev,
+							    size, &t);
+		if (!cq->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&cq->queue.direct, mapping, t);
+
+		memset(cq->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		cq->is_direct = 0;
+		npages        = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+		shift         = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list,
+					      GFP_KERNEL);
+		if (!cq->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i)
+			cq->queue.page_list[i].buf = NULL;
+
+		for (i = 0; i < npages; ++i) {
+			cq->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!cq->queue.page_list[i].buf)
+				goto err_out_free;
+			
+			dma_list[i] = t;
+			pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t);
+
+			memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE);
+		}
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_cqe_hw(cq, i);
+
+	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
+	if (cq->cqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, shift, npages,
+				  0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &cq->mr);
+	if (err)
+		goto err_out_free_cq;
+
+	spin_lock_init(&cq->lock);
+	atomic_set(&cq->refcount, 1);
+	init_waitqueue_head(&cq->wait);
+
+	memset(cq_context, 0, sizeof *cq_context);
+	cq_context->flags           = cpu_to_be32(MTHCA_CQ_STATUS_OK      |
+						  MTHCA_CQ_STATE_DISARMED |
+						  MTHCA_CQ_FLAG_TR);
+	cq_context->start           = cpu_to_be64(0);
+	cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
+	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
+	cq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	cq_context->lkey            = cpu_to_be32(cq->mr.ibmr.lkey);
+	cq_context->cqn             = cpu_to_be32(cq->cqn);
+
+	err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+
+	if (status) {
+		mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	if (mthca_array_set(&dev->cq_table.cq,
+			    cq->cqn & (dev->limits.num_cqs - 1),
+			    cq)) {
+		spin_unlock_irq(&dev->cq_table.lock);
+		goto err_out_free_mr;
+	}
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	cq->cons_index = 0;
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return 0;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &cq->mr);
+
+ err_out_free_cq:
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+
+ err_out_free:
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev, size,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct, mapping));
+	else {
+		for (i = 0; i < npages; ++i)
+			if (cq->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    cq->queue.page_list[i].buf,
+						    pci_unmap_addr(&cq->queue.page_list[i],
+								   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+ err_out:
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return err;
+}
+
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq)
+{
+	void *mailbox;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_warn(dev, "No memory for mailbox to free CQ.\n");
+		return;
+	}
+
+	err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		u32 *ctx = MAILBOX_ALIGN(mailbox);
+		int j;
+		
+		printk(KERN_ERR "context for CQN %x\n", cq->cqn);
+		for (j = 0; j < 16; ++j)
+			printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j]));
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	mthca_array_clear(&dev->cq_table.cq,
+			  cq->cqn & (dev->limits.num_cqs - 1));
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	atomic_dec(&cq->refcount);
+	wait_event(cq->wait, !atomic_read(&cq->refcount));
+
+	mthca_free_mr(dev, &cq->mr);
+
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev,
+				    cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct,
+						   mapping));
+	else {
+		int i;
+
+		for (i = 0;
+		     i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+			     PAGE_SIZE;
+		     ++i)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    cq->queue.page_list[i].buf,
+					    pci_unmap_addr(&cq->queue.page_list[i],
+							   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+	kfree(mailbox);
+}
+
+int __devinit mthca_init_cq_table(struct mthca_dev *dev)
+{
+	int err;
+
+	spin_lock_init(&dev->cq_table.lock);
+
+	err = mthca_alloc_init(&dev->cq_table.alloc,
+			       dev->limits.num_cqs,
+			       (1 << 24) - 1,
+			       dev->limits.reserved_cqs);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->cq_table.cq,
+			       dev->limits.num_cqs);
+	if (err)
+		mthca_alloc_cleanup(&dev->cq_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev)
+{
+	mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs);
+	mthca_alloc_cleanup(&dev->cq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h	2004-11-21 21:25:54.619097902 -0800
@@ -0,0 +1,386 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_DEV_H
+#define MTHCA_DEV_H
+
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <asm/semaphore.h>
+#include <asm/scatterlist.h>
+
+#include "mthca_provider.h"
+#include "mthca_doorbell.h"
+
+#define DRV_NAME	"ib_mthca"
+#define PFX		DRV_NAME ": "
+#define DRV_VERSION	"0.06-pre"
+#define DRV_RELDATE	"November 8, 2004"
+
+/* Types of supported HCA */
+enum {
+	TAVOR,			/* MT23108                        */
+	ARBEL_COMPAT,		/* MT25208 in Tavor compat mode   */
+	ARBEL_NATIVE		/* MT25208 with extended features */
+};
+
+enum {
+	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
+	MTHCA_FLAG_SRQ        = 1 << 2,
+	MTHCA_FLAG_MSI        = 1 << 3,
+	MTHCA_FLAG_MSI_X      = 1 << 4,
+	MTHCA_FLAG_NO_LAM     = 1 << 5
+};
+
+enum {
+	MTHCA_KAR_PAGE  = 1,
+	MTHCA_MAX_PORTS = 2
+};
+
+enum {
+	MTHCA_MPT_ENTRY_SIZE  =  0x40,
+	MTHCA_EQ_CONTEXT_SIZE =  0x40,
+	MTHCA_CQ_CONTEXT_SIZE =  0x40,
+	MTHCA_QP_CONTEXT_SIZE = 0x200,
+	MTHCA_AV_SIZE         =  0x20,
+	MTHCA_MGM_ENTRY_SIZE  =  0x40
+};
+
+enum {
+	MTHCA_EQ_CMD,
+	MTHCA_EQ_ASYNC,
+	MTHCA_EQ_COMP,
+	MTHCA_NUM_EQ
+};
+
+struct mthca_cmd {
+	int                       use_events;
+	struct semaphore          hcr_sem;
+	struct semaphore 	  poll_sem;
+	struct semaphore 	  event_sem;
+	int              	  max_cmds;
+	spinlock_t                context_lock;
+	int                       free_head;
+	struct mthca_cmd_context *context;
+	u16                       token_mask;
+};
+
+struct mthca_limits {
+	int      num_ports;
+	int      vl_cap;
+	int      mtu_cap;
+	int      gid_table_len;
+	int      pkey_table_len;
+	int      local_ca_ack_delay;
+	int      max_sg;
+	int      num_qps;
+	int      reserved_qps;
+	int      num_srqs;
+	int      reserved_srqs;
+	int      num_eecs;
+	int      reserved_eecs;
+	int      num_cqs;
+	int      reserved_cqs;
+	int      num_eqs;
+	int      reserved_eqs;
+	int      num_mpts;
+	int      num_mtt_segs;
+	int      mtt_seg_size;
+	int      reserved_mtts;
+	int      reserved_mrws;
+	int      num_rdbs;
+	int      reserved_uars;
+	int      num_mgms;
+	int      num_amgms;
+	int      reserved_mcgs;
+	int      num_pds;
+	int      reserved_pds;
+};
+
+struct mthca_alloc {
+	u32            last;
+	u32            top;
+	u32            max;
+	u32            mask;
+	spinlock_t     lock;
+	unsigned long *table;
+};
+
+struct mthca_array {
+	struct {
+		void    **page;
+		int       used;
+	} *page_list;
+};
+
+struct mthca_pd_table {
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mr_table {
+	struct mthca_alloc mpt_alloc;
+	int                max_mtt_order;
+	unsigned long    **mtt_buddy;
+	u64                mtt_base;
+};
+
+struct mthca_eq_table {
+	struct mthca_alloc alloc;
+	void __iomem      *clr_int;
+	u32                clr_mask;
+	struct mthca_eq    eq[MTHCA_NUM_EQ];
+	int                have_irq;
+	u8                 inta_pin;
+};
+
+struct mthca_cq_table {
+	struct mthca_alloc alloc;
+	spinlock_t         lock;
+	struct mthca_array cq;
+};
+
+struct mthca_qp_table {
+	struct mthca_alloc alloc;
+	int                sqp_start;
+	spinlock_t         lock;
+	struct mthca_array qp;
+};
+
+struct mthca_av_table {
+	struct pci_pool   *pool;
+	int                num_ddr_avs;
+	u64                ddr_av_base;
+	void __iomem      *av_map;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mcg_table {
+	struct semaphore   sem;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_dev {
+	struct ib_device  ib_dev;
+	struct pci_dev   *pdev;
+
+	int          	 hca_type;
+	unsigned long	 mthca_flags;
+
+	u32              rev_id;
+
+	/* firmware info */
+	u64              fw_ver;
+	union {
+		struct {
+			u64 fw_start;
+			u64 fw_end;
+		}        tavor;
+		struct {
+			u64 clr_int_base;
+			u64 eq_arm_base;
+			u64 eq_set_ci_base;
+			struct scatterlist *mem;
+			u16 fw_pages;
+		}        arbel;
+	}                fw;
+
+	u64              ddr_start;
+	u64              ddr_end;
+
+	MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock)
+
+	void __iomem    *hcr;
+	void __iomem    *clr_base;
+	void __iomem    *kar;
+
+	struct mthca_cmd    cmd;
+	struct mthca_limits limits;
+
+	struct mthca_pd_table  pd_table;
+	struct mthca_mr_table  mr_table;
+	struct mthca_eq_table  eq_table;
+	struct mthca_cq_table  cq_table;
+	struct mthca_qp_table  qp_table;
+	struct mthca_av_table  av_table;
+	struct mthca_mcg_table mcg_table;
+
+	struct mthca_pd       driver_pd;
+	struct mthca_mr       driver_mr;
+
+	struct ib_mad_agent  *send_agent[MTHCA_MAX_PORTS][2];
+	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
+	spinlock_t            sm_lock;
+};
+
+#define mthca_dbg(mdev, format, arg...) \
+	dev_dbg(&mdev->pdev->dev, format, ## arg)
+#define mthca_err(mdev, format, arg...) \
+	dev_err(&mdev->pdev->dev, format, ## arg)
+#define mthca_info(mdev, format, arg...) \
+	dev_info(&mdev->pdev->dev, format, ## arg)
+#define mthca_warn(mdev, format, arg...) \
+	dev_warn(&mdev->pdev->dev, format, ## arg)
+
+extern void __buggy_use_of_MTHCA_GET(void);
+extern void __buggy_use_of_MTHCA_PUT(void);
+
+#define MTHCA_GET(dest, source, offset)                               \
+	do {                                                          \
+		void *__p = (char *) (source) + (offset);             \
+		switch (sizeof (dest)) {                              \
+			case 1: (dest) = *(u8 *) __p;       break;    \
+			case 2: (dest) = be16_to_cpup(__p); break;    \
+			case 4: (dest) = be32_to_cpup(__p); break;    \
+			case 8: (dest) = be64_to_cpup(__p); break;    \
+			default: __buggy_use_of_MTHCA_GET();          \
+		}                                                     \
+	} while (0)
+
+#define MTHCA_PUT(dest, source, offset)                               \
+	do {                                                          \
+		__typeof__(source) *__p =                             \
+			(__typeof__(source) *) ((char *) (dest) + (offset)); \
+		switch (sizeof(source)) {                             \
+			case 1: *__p = (source);            break;    \
+			case 2: *__p = cpu_to_be16(source); break;    \
+			case 4: *__p = cpu_to_be32(source); break;    \
+			case 8: *__p = cpu_to_be64(source); break;    \
+			default: __buggy_use_of_MTHCA_PUT();          \
+		}                                                     \
+	} while (0)
+
+int mthca_reset(struct mthca_dev *mdev);
+
+u32 mthca_alloc(struct mthca_alloc *alloc);
+void mthca_free(struct mthca_alloc *alloc, u32 obj);
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved);
+void mthca_alloc_cleanup(struct mthca_alloc *alloc);
+void *mthca_array_get(struct mthca_array *array, int index);
+int mthca_array_set(struct mthca_array *array, int index, void *value);
+void mthca_array_clear(struct mthca_array *array, int index);
+int mthca_array_init(struct mthca_array *array, int nent);
+void mthca_array_cleanup(struct mthca_array *array, int nent);
+
+int mthca_init_pd_table(struct mthca_dev *dev);
+int mthca_init_mr_table(struct mthca_dev *dev);
+int mthca_init_eq_table(struct mthca_dev *dev);
+int mthca_init_cq_table(struct mthca_dev *dev);
+int mthca_init_qp_table(struct mthca_dev *dev);
+int mthca_init_av_table(struct mthca_dev *dev);
+int mthca_init_mcg_table(struct mthca_dev *dev);
+
+void mthca_cleanup_pd_table(struct mthca_dev *dev);
+void mthca_cleanup_mr_table(struct mthca_dev *dev);
+void mthca_cleanup_eq_table(struct mthca_dev *dev);
+void mthca_cleanup_cq_table(struct mthca_dev *dev);
+void mthca_cleanup_qp_table(struct mthca_dev *dev);
+void mthca_cleanup_av_table(struct mthca_dev *dev);
+void mthca_cleanup_mcg_table(struct mthca_dev *dev);
+
+int mthca_register_device(struct mthca_dev *dev);
+void mthca_unregister_device(struct mthca_dev *dev);
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd);
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr);
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr);
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr);
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry);
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited);
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq);
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq);
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn);
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn);
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type);
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr);
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr);
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe);
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp);
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp);
+void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp);
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah);
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah);
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header);
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad);
+int mthca_create_agents(struct mthca_dev *dev);
+void mthca_free_agents(struct mthca_dev *dev);
+
+static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct mthca_dev, ib_dev);
+}
+
+#endif /* MTHCA_DEV_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h	2004-11-21 21:25:54.644094195 -0800
@@ -0,0 +1,119 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/preempt.h>
+
+#define MTHCA_RD_DOORBELL      0x00
+#define MTHCA_SEND_DOORBELL    0x10
+#define MTHCA_RECEIVE_DOORBELL 0x18
+#define MTHCA_CQ_DOORBELL      0x20
+#define MTHCA_EQ_DOORBELL      0x28
+
+#if BITS_PER_LONG == 64
+/*
+ * Assume that we can just write a 64-bit doorbell atomically.  s390
+ * actually doesn't have writeq() but S/390 systems don't even have
+ * PCI so we won't worry about it.
+ */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	__raw_writeq(*(u64 *) val, dest);
+}
+
+#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL)
+/* Use SSE to write 64 bits atomically without a lock. */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline unsigned long mthca_get_fpu(void)
+{
+	unsigned long cr0;
+
+	preempt_disable();
+	asm volatile("mov %%cr0,%0; clts" : "=r" (cr0));
+	return cr0;
+}
+
+static inline void mthca_put_fpu(unsigned long cr0)
+{
+	asm volatile("mov %0,%%cr0" : : "r" (cr0));
+	preempt_enable();
+}
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	/* i386 stack is aligned to 8 bytes, so this should be OK: */
+	u8 xmmsave[8] __attribute__((aligned(8)));
+	unsigned long cr0;
+
+	cr0 = mthca_get_fpu();
+
+	asm volatile (
+		"movlps %%xmm0,(%0); \n\t"
+		"movlps (%1),%%xmm0; \n\t"
+		"movlps %%xmm0,(%2); \n\t"
+		"movlps (%0),%%xmm0; \n\t"
+		:
+		: "r" (xmmsave), "r" (val), "r" (dest)
+		: "memory" );
+
+	mthca_put_fpu(cr0);
+}
+
+#else
+/* Just fall back to a spinlock to protect the doorbell */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name;
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)     spin_lock_init(ptr)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (ptr)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(doorbell_lock, flags);
+	__raw_writel(val[0], dest);
+	__raw_writel(val[1], dest + 4);
+	spin_unlock_irqrestore(doorbell_lock, flags);
+}
+
+#endif
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c	2004-11-21 21:25:54.670090339 -0800
@@ -0,0 +1,650 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+#include "mthca_config_reg.h"
+
+enum {
+	MTHCA_NUM_ASYNC_EQE = 0x80,
+	MTHCA_NUM_CMD_EQE   = 0x80,
+	MTHCA_EQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_eq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 pd;
+	u8  reserved1[3];
+	u8  intr;
+	u32 lost_count;
+	u32 lkey;
+	u32 reserved2[2];
+	u32 consumer_index;
+	u32 producer_index;
+	u32 reserved3[4];
+} __attribute__((packed));
+
+#define MTHCA_EQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_EQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_EQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_EQ_OWNER_SW           ( 0 << 24)
+#define MTHCA_EQ_OWNER_HW           ( 1 << 24)
+#define MTHCA_EQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_EQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_EQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_EQ_STATE_FIRED        ( 2 <<  8)
+#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 <<  8)
+
+enum {
+	MTHCA_EVENT_TYPE_COMP       	    = 0x00,
+	MTHCA_EVENT_TYPE_PATH_MIG   	    = 0x01,
+	MTHCA_EVENT_TYPE_COMM_EST   	    = 0x02,
+	MTHCA_EVENT_TYPE_SQ_DRAINED 	    = 0x03,
+	MTHCA_EVENT_TYPE_SRQ_LAST_WQE       = 0x13,
+	MTHCA_EVENT_TYPE_CQ_ERROR   	    = 0x04,
+	MTHCA_EVENT_TYPE_WQ_CATAS_ERROR     = 0x05,
+	MTHCA_EVENT_TYPE_EEC_CATAS_ERROR    = 0x06,
+	MTHCA_EVENT_TYPE_PATH_MIG_FAILED    = 0x07,
+	MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
+	MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR    = 0x11,
+	MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR    = 0x12,
+	MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR  = 0x08,
+	MTHCA_EVENT_TYPE_PORT_CHANGE        = 0x09,
+	MTHCA_EVENT_TYPE_EQ_OVERFLOW        = 0x0f,
+	MTHCA_EVENT_TYPE_ECC_DETECT         = 0x0e,
+	MTHCA_EVENT_TYPE_CMD                = 0x0a
+};
+
+#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG)           | \
+				(1ULL << MTHCA_EVENT_TYPE_COMM_EST)           | \
+				(1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED)         | \
+				(1ULL << MTHCA_EVENT_TYPE_CQ_ERROR)           | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR)     | \
+				(1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED)    | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR)  | \
+				(1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE)        | \
+				(1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW)        | \
+				(1ULL << MTHCA_EVENT_TYPE_ECC_DETECT))
+#define MTHCA_SRQ_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE)
+#define MTHCA_CMD_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_CMD)
+
+#define MTHCA_EQ_DB_INC_CI     (1 << 24)
+#define MTHCA_EQ_DB_REQ_NOT    (2 << 24)
+#define MTHCA_EQ_DB_DISARM_CQ  (3 << 24)
+#define MTHCA_EQ_DB_SET_CI     (4 << 24)
+#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24)
+
+struct mthca_eqe {
+	u8 reserved1;
+	u8 type;
+	u8 reserved2;
+	u8 subtype;
+	union {
+		u32 raw[6];
+		struct {
+			u32 cqn;
+		} __attribute__((packed)) comp;
+		struct {
+			u16 reserved1;
+			u16 token;
+			u32 reserved2;
+			u8  reserved3[3];
+			u8  status;
+			u64 out_param;
+		} __attribute__((packed)) cmd;
+		struct {
+			u32 qpn;
+		} __attribute__((packed)) qp;
+		struct {
+			u32 reserved1[2];
+			u32 port;
+		} __attribute__((packed)) port_change;
+	} event;
+	u8 reserved3[3];
+	u8 owner;
+} __attribute__((packed));
+
+#define  MTHCA_EQ_ENTRY_OWNER_SW      (0 << 7)
+#define  MTHCA_EQ_ENTRY_OWNER_HW      (1 << 7)
+
+static inline u64 async_mask(struct mthca_dev *dev)
+{
+	return dev->mthca_flags & MTHCA_FLAG_SRQ ?
+		MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK :
+		MTHCA_ASYNC_EVENT_MASK;
+}
+
+static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn);
+	doorbell[1] = cpu_to_be32(ci);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void eq_req_not(struct mthca_dev *dev, int eqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
+	doorbell[1] = 0;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
+	doorbell[1] = cpu_to_be32(cqn);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry)
+{
+	return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf
+		+ (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int next_eqe_sw(struct mthca_eq *eq)
+{
+	return !(MTHCA_EQ_ENTRY_OWNER_HW &
+		 get_eqe(eq, eq->cons_index)->owner);
+}
+
+static inline void set_eqe_hw(struct mthca_eq *eq, int entry)
+{
+	get_eqe(eq, entry)->owner =  MTHCA_EQ_ENTRY_OWNER_HW;
+}
+
+static void port_change(struct mthca_dev *dev, int port, int active)
+{
+	struct ib_event record;
+
+	mthca_dbg(dev, "Port change to %s for port %d\n",
+		  active ? "active" : "down", port);
+
+	record.device = &dev->ib_dev;
+	record.event  = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR;
+	record.element.port_num = port;
+
+	ib_dispatch_event(&record);
+}
+
+static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq)
+{
+	struct mthca_eqe *eqe;
+	int disarm_cqn;
+	int work = 0;
+
+	while (1) {
+		if (!next_eqe_sw(eq))
+			break;
+
+		eqe = get_eqe(eq, eq->cons_index);
+		work = 1;
+
+		switch (eqe->type) {
+		case MTHCA_EVENT_TYPE_COMP:
+			disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
+			disarm_cq(dev, eq->eqn, disarm_cqn);
+			mthca_cq_event(dev, disarm_cqn);
+			break;
+			
+		case MTHCA_EVENT_TYPE_PATH_MIG:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG);
+			break;
+
+		case MTHCA_EVENT_TYPE_COMM_EST:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_COMM_EST);
+			break;
+
+		case MTHCA_EVENT_TYPE_SQ_DRAINED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_SQ_DRAINED);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_FATAL);
+			break;
+
+		case MTHCA_EVENT_TYPE_PATH_MIG_FAILED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_REQ_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_ACCESS_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_CMD:
+			mthca_cmd_event(dev,
+					be16_to_cpu(eqe->event.cmd.token),
+					eqe->event.cmd.status,
+					be64_to_cpu(eqe->event.cmd.out_param));
+			break;
+
+		case MTHCA_EVENT_TYPE_PORT_CHANGE:
+			port_change(dev,
+				    (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3,
+				    eqe->subtype == 0x4);
+			break;
+
+		case MTHCA_EVENT_TYPE_CQ_ERROR:
+		case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_EQ_OVERFLOW:
+		case MTHCA_EVENT_TYPE_ECC_DETECT:
+		default:
+			mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n",
+				   eqe->type, eqe->subtype, eq->eqn);
+			break;
+		};
+
+		set_eqe_hw(eq, eq->cons_index);
+		eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1);
+	}
+
+	if (work) {
+		wmb();
+		set_eq_ci(dev, eq->eqn, eq->cons_index);
+	}
+
+	eq_req_not(dev, eq->eqn);
+}
+
+static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs)
+{
+	struct mthca_dev *dev = dev_ptr;
+	u32 ecr;
+	int work = 0;
+	int i;
+
+	if (dev->eq_table.clr_mask)
+		writel(dev->eq_table.clr_mask, dev->eq_table.clr_int);
+
+	while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) {
+		work = 1;
+
+		writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i)
+			if (ecr & dev->eq_table.eq[i].ecr_mask)
+				mthca_eq_int(dev, &dev->eq_table.eq[i]);
+	}
+
+	return IRQ_RETVAL(work);
+}
+
+static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr,
+					 struct pt_regs *regs)
+{
+	struct mthca_eq  *eq  = eq_ptr;
+	struct mthca_dev *dev = eq->dev;
+
+	writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+	mthca_eq_int(dev, eq);
+
+	/* MSI-X vectors always belong to us */
+	return IRQ_HANDLED;
+}
+
+static int __devinit mthca_create_eq(struct mthca_dev *dev,
+				     int nent,
+				     u8 intr,
+				     struct mthca_eq *eq)
+{
+	int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	u64 *dma_list = NULL;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	struct mthca_eq_context *eq_context;
+	int err = -ENOMEM;
+	int i;
+	u8 status;
+
+	eq->dev = dev;
+
+	eq->page_list = kmalloc(npages * sizeof *eq->page_list,
+				GFP_KERNEL);
+	if (!eq->page_list)
+		goto err_out;
+
+	for (i = 0; i < npages; ++i)
+		eq->page_list[i].buf = NULL;
+
+	dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+	if (!dma_list)
+		goto err_out_free;
+
+	mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free;
+	eq_context = MAILBOX_ALIGN(mailbox);
+
+	for (i = 0; i < npages; ++i) {
+		eq->page_list[i].buf = pci_alloc_consistent(dev->pdev,
+							    PAGE_SIZE, &t);
+		if (!eq->page_list[i].buf)
+			goto err_out_free;
+
+		dma_list[i] = t;
+		pci_unmap_addr_set(&eq->page_list[i], mapping, t);
+
+		memset(eq->page_list[i].buf, 0, PAGE_SIZE);
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_eqe_hw(eq, i);
+
+	eq->eqn = mthca_alloc(&dev->eq_table.alloc);
+	if (eq->eqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, PAGE_SHIFT, npages,
+				  0, npages * PAGE_SIZE,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &eq->mr);
+	if (err)
+		goto err_out_free_eq;
+
+	eq->nent = nent;
+
+	memset(eq_context, 0, sizeof *eq_context);
+	eq_context->flags           = cpu_to_be32(MTHCA_EQ_STATUS_OK   |
+						  MTHCA_EQ_OWNER_HW    |
+						  MTHCA_EQ_STATE_ARMED |
+						  MTHCA_EQ_FLAG_TR);
+	eq_context->start           = cpu_to_be64(0);
+	eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	eq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	eq_context->intr            = intr;
+	eq_context->lkey            = cpu_to_be32(eq->mr.ibmr.lkey);
+
+	err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+	if (status) {
+		mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	eq->ecr_mask   = swab32(1 << eq->eqn);
+	eq->cons_index = 0;
+
+	eq_req_not(dev, eq->eqn);
+
+	mthca_dbg(dev, "Allocated EQ %d with %d entries\n",
+		  eq->eqn, nent);
+
+	return err;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &eq->mr);
+
+ err_out_free_eq:
+	mthca_free(&dev->eq_table.alloc, eq->eqn);
+
+ err_out_free:
+	for (i = 0; i < npages; ++i)
+		if (eq->page_list[i].buf)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    eq->page_list[i].buf,
+					    pci_unmap_addr(&eq->page_list[i],
+							   mapping));
+
+	kfree(eq->page_list);
+	kfree(dma_list);
+	kfree(mailbox);
+
+ err_out:
+	return err;
+}
+
+static void mthca_free_eq(struct mthca_dev *dev,
+			  struct mthca_eq *eq)
+{
+	void *mailbox = NULL;
+	int err;
+	u8 status;
+	int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	int i;
+
+	mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		return;
+
+	err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox),
+			     eq->eqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err);
+	if (status)
+		mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn);
+		for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+
+	mthca_free_mr(dev, &eq->mr);
+	for (i = 0; i < npages; ++i)
+		pci_free_consistent(dev->pdev, PAGE_SIZE,
+				    eq->page_list[i].buf,
+				    pci_unmap_addr(&eq->page_list[i], mapping));
+
+	kfree(eq->page_list);
+	kfree(mailbox);
+}
+
+static void mthca_free_irqs(struct mthca_dev *dev)
+{
+	int i;
+
+	if (dev->eq_table.have_irq)
+		free_irq(dev->pdev->irq, dev);
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		if (dev->eq_table.eq[i].have_irq)
+			free_irq(dev->eq_table.eq[i].msi_x_vector,
+				 dev->eq_table.eq + i);
+}
+
+int __devinit mthca_init_eq_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	u8 intr;
+	int i;
+
+	err = mthca_alloc_init(&dev->eq_table.alloc,
+			       dev->limits.num_eqs,
+			       dev->limits.num_eqs - 1,
+			       dev->limits.reserved_eqs);
+	if (err)
+		return err;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI ||
+	    dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		dev->eq_table.clr_mask = 0;
+	} else {
+		dev->eq_table.clr_mask =
+			swab32(1 << (dev->eq_table.inta_pin & 31));
+		dev->eq_table.clr_int  = dev->clr_base +
+			(dev->eq_table.inta_pin < 31 ? 4 : 0);
+	}
+
+	intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ?
+		128 : dev->eq_table.inta_pin;
+
+	err = mthca_create_eq(dev, dev->limits.num_cqs,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_COMP]);
+	if (err)
+		goto err_out_free;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+	if (err)
+		goto err_out_comp;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_CMD]);
+	if (err)
+		goto err_out_async;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		static const char *eq_name[] = {
+			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
+			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
+			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)" 
+		};
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
+					  mthca_msi_x_interrupt, 0,
+					  eq_name[i], dev->eq_table.eq + i);
+			if (err)
+				goto err_out_cmd;
+			dev->eq_table.eq[i].have_irq = 1;
+		}
+	} else {
+		err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ,
+				  DRV_NAME, dev);
+		if (err)
+			goto err_out_cmd;
+		dev->eq_table.have_irq = 1;
+	}
+
+	err = mthca_MAP_EQ(dev, async_mask(dev),
+			   0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status);
+
+	err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+			   0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status);
+
+	return 0;
+
+err_out_cmd:
+	mthca_free_irqs(dev);
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]);
+
+err_out_async:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+
+err_out_comp:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]);
+
+err_out_free:
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+	return err;
+}
+
+void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev)
+{
+	u8 status;
+	int i;
+
+	mthca_free_irqs(dev);
+
+	mthca_MAP_EQ(dev, async_mask(dev),
+		     1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+		     1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		mthca_free_eq(dev, &dev->eq_table.eq[i]);
+
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c	2004-11-21 21:25:54.696086483 -0800
@@ -0,0 +1,321 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $
+ */
+
+#include <ib_verbs.h>
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	IB_SM_PORT_INFO        = 0x0015,
+	IB_SM_PKEY_TABLE       = 0x0016,
+	IB_SM_SM_INFO          = 0x0020,
+	IB_SM_VENDOR_START     = 0xff00
+};
+
+enum {
+	MTHCA_VENDOR_CLASS1 = 0x9,
+	MTHCA_VENDOR_CLASS2 = 0xa
+};
+
+struct mthca_trap_mad {
+	struct ib_mad *mad;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+static void update_sm_ah(struct mthca_dev *dev,
+			 u8 port_num, u16 lid, u8 sl)
+{
+	struct ib_ah *new_ah;
+	struct ib_ah_attr ah_attr;
+	unsigned long flags;
+
+	if (!dev->send_agent[port_num - 1][0])
+		return;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid     = lid;
+	ah_attr.sl       = sl;
+	ah_attr.port_num = port_num;
+
+	new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd,
+			      &ah_attr);
+	if (IS_ERR(new_ah))
+		return;
+
+	spin_lock_irqsave(&dev->sm_lock, flags);
+	if (dev->sm_ah[port_num - 1])
+		ib_destroy_ah(dev->sm_ah[port_num - 1]);
+	dev->sm_ah[port_num - 1] = new_ah;
+	spin_unlock_irqrestore(&dev->sm_lock, flags);
+}
+
+/*
+ * Snoop SM MADs for port info and P_Key table sets, so we can
+ * synthesize LID change and P_Key change events.
+ */
+static void smp_snoop(struct ib_device *ibdev,
+		      u8 port_num,
+		      struct ib_mad *mad)
+{
+	struct ib_event event;
+
+	if ((mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+	    mad->mad_hdr.method     == IB_MGMT_METHOD_SET) {
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) {
+			update_sm_ah(to_mdev(ibdev), port_num,
+				     be16_to_cpup((__be16 *) (mad->data + 58)),
+				     (*(u8 *) (mad->data + 76)) & 0xf);
+
+			event.device           = ibdev;
+			event.event            = IB_EVENT_LID_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) {
+			event.device           = ibdev;
+			event.event            = IB_EVENT_PKEY_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+	}
+}
+
+static void forward_trap(struct mthca_dev *dev,
+			 u8 port_num,
+			 struct ib_mad *mad)
+{
+	int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	struct mthca_trap_mad *tmad;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+		.wr	     = {
+			 .ud = {
+				 .remote_qpn  = qpn,
+				 .remote_qkey = qpn ? IB_QP1_QKEY : 0,
+				 .timeout_ms  = 0
+			 }
+		 }
+	};
+	struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn];
+	int ret;
+	unsigned long flags;
+
+	if (agent) {
+		tmad = kmalloc(sizeof *tmad, GFP_KERNEL);
+		if (!tmad)
+			return;
+
+		tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL);
+		if (!tmad->mad) {
+			kfree(tmad);
+			return;
+		}
+
+		memcpy(tmad->mad, mad, sizeof *mad);
+
+		wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr;
+		wr.wr_id         = (unsigned long) tmad;
+
+		gather_list.addr   = pci_map_single(agent->device->dma_device,
+						    tmad->mad,
+						    sizeof *tmad->mad,
+						    PCI_DMA_TODEVICE);
+		gather_list.length = sizeof *tmad->mad;
+		gather_list.lkey   = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey;
+		pci_unmap_addr_set(tmad, mapping, gather_list.addr);
+		
+		/*
+		 * We rely here on the fact that MLX QPs don't use the
+		 * address handle after the send is posted (this is
+		 * wrong following the IB spec strictly, but we know
+		 * it's OK for our devices).
+		 */
+		spin_lock_irqsave(&dev->sm_lock, flags);
+		wr.wr.ud.ah      = dev->sm_ah[port_num - 1];
+		if (wr.wr.ud.ah)
+			ret = ib_post_send_mad(agent, &wr, &bad_wr);
+		else
+			ret = -EINVAL;
+		spin_unlock_irqrestore(&dev->sm_lock, flags);
+
+		if (ret) {
+			pci_unmap_single(agent->device->dma_device,
+					 pci_unmap_addr(tmad, mapping),
+					 sizeof *tmad->mad,
+					 PCI_DMA_TODEVICE);
+			kfree(tmad->mad);
+			kfree(tmad);
+		}
+	}
+}
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad)
+{
+	int err;
+	u8 status;
+
+	/* Forward locally generated traps to the SM */
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
+	    slid == 0) {
+		forward_trap(to_mdev(ibdev), port_num, in_mad);
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+	}
+
+	/*
+	 * Only handle SM gets, sets and trap represses for SM class
+	 *
+	 * Only handle PMA and Mellanox vendor-specific class gets and
+	 * sets for other classes.
+	 */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || 
+	    in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		if (in_mad->mad_hdr.method   != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_SET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_TRAP_REPRESS)
+			return IB_MAD_RESULT_SUCCESS;
+
+		/* 
+		 * Don't process SMInfo queries or vendor-specific
+		 * MADs -- the SMA can't handle them.
+		 */
+		if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO ||
+		    be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START)
+			return IB_MAD_RESULT_SUCCESS;
+	} else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT ||
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1     || 
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) {
+		if (in_mad->mad_hdr.method  != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method  != IB_MGMT_METHOD_SET)
+			return IB_MAD_RESULT_SUCCESS;
+	} else
+		return IB_MAD_RESULT_SUCCESS;
+
+	err = mthca_MAD_IFC(to_mdev(ibdev),
+			    !!(mad_flags & IB_MAD_IGNORE_MKEY),
+			    port_num, in_mad, out_mad,
+			    &status);
+	if (err) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC failed\n");
+		return IB_MAD_RESULT_FAILURE;
+	}
+	if (status == MTHCA_CMD_STAT_BAD_PKT)
+		return IB_MAD_RESULT_SUCCESS;
+	if (status) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n",
+			  status);
+		return IB_MAD_RESULT_FAILURE;
+	}
+
+	if (!out_mad->mad_hdr.status)
+		smp_snoop(ibdev, port_num, in_mad);
+
+	/* set return bit in status of directed route responses */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		out_mad->mad_hdr.status |= cpu_to_be16(1 << 15);
+
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS)
+		/* no response for trap repress */
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+
+	return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct mthca_trap_mad *tmad =
+		(void *) (unsigned long) mad_send_wc->wr_id;
+
+	pci_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(tmad, mapping),
+			 sizeof *tmad->mad,
+			 PCI_DMA_TODEVICE);
+	kfree(tmad->mad);
+	kfree(tmad);
+}
+
+int mthca_create_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	spin_lock_init(&dev->sm_lock);
+
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q) {
+			agent = ib_register_mad_agent(&dev->ib_dev, p + 1,
+						      q ? IB_QPT_GSI : IB_QPT_SMI,
+						      NULL, 0, send_handler,
+						      NULL, NULL);
+			if (IS_ERR(agent))
+				goto err;
+			dev->send_agent[p][q] = agent;
+		}
+
+	return 0;
+
+err:
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q)
+			if (dev->send_agent[p][q])
+				ib_unregister_mad_agent(dev->send_agent[p][q]);
+
+	return PTR_ERR(agent);
+}
+
+void mthca_free_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	for (p = 0; p < dev->limits.num_ports; ++p) {
+		for (q = 0; q <= 1; ++q) {
+			agent = dev->send_agent[p][q];
+			dev->send_agent[p][q] = NULL;
+			ib_unregister_mad_agent(agent);
+		}
+
+		if (dev->sm_ah[p])
+			ib_destroy_ah(dev->sm_ah[p]);
+	}
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_main.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c	2004-11-21 21:25:54.722082627 -0800
@@ -0,0 +1,889 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
+
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+#include <asm/cpufeature.h>
+#endif
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+#include "mthca_profile.h"
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+#ifdef CONFIG_PCI_MSI
+
+static int msi_x = 0;
+module_param(msi_x, int, 0444);
+MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
+
+static int msi = 0;
+module_param(msi, int, 0444);
+MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero");
+
+#else /* CONFIG_PCI_MSI */
+
+#define msi_x (0)
+#define msi   (0)
+
+#endif /* CONFIG_PCI_MSI */
+
+static const char mthca_version[] __devinitdata =
+	"ib_mthca: Mellanox InfiniBand HCA driver v"
+	DRV_VERSION " (" DRV_RELDATE ")\n";
+
+static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
+{
+	int cap;
+	u16 val;
+
+	/* First try to max out Read Byte Count */
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) {
+			mthca_err(mdev, "Couldn't read PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) {
+			mthca_err(mdev, "Couldn't write PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == TAVOR)
+		mthca_info(mdev, "No PCI-X capability, not setting RBC.\n");
+
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) {
+			mthca_err(mdev, "Couldn't read PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) {
+			mthca_err(mdev, "Couldn't write PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == ARBEL_NATIVE ||
+		   mdev->hca_type == ARBEL_COMPAT)
+		mthca_info(mdev, "No PCI Express capability, "
+			   "not setting Max Read Request Size.\n");
+
+	return 0;
+}
+
+static int __devinit mthca_init_tavor(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	struct mthca_dev_lim        dev_lim;
+	struct mthca_init_hca_param init_hca;
+	struct mthca_adapter        adapter;
+
+	err = mthca_SYS_EN(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "SYS_EN command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "SYS_EN returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DDR(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DDR command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DDR returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	if (dev_lim.min_page_sz > PAGE_SIZE) {
+		mthca_err(mdev, "HCA minimum page size of %d bigger than "
+			  "kernel PAGE_SIZE of %ld, aborting.\n",
+			  dev_lim.min_page_sz, PAGE_SIZE);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+	if (dev_lim.num_ports > MTHCA_MAX_PORTS) {
+		mthca_err(mdev, "HCA has %d ports, but we only support %d, "
+			  "aborting.\n",
+			  dev_lim.num_ports, MTHCA_MAX_PORTS);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+
+	mdev->limits.num_ports      	= dev_lim.num_ports;
+	mdev->limits.vl_cap             = dev_lim.max_vl;
+	mdev->limits.mtu_cap            = dev_lim.max_mtu;
+	mdev->limits.gid_table_len  	= dev_lim.max_gids;
+	mdev->limits.pkey_table_len 	= dev_lim.max_pkeys;
+	mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay;
+	mdev->limits.max_sg             = dev_lim.max_sg;
+	mdev->limits.reserved_qps       = dev_lim.reserved_qps;
+	mdev->limits.reserved_srqs      = dev_lim.reserved_srqs;
+	mdev->limits.reserved_eecs      = dev_lim.reserved_eecs;
+	mdev->limits.reserved_cqs       = dev_lim.reserved_cqs;
+	mdev->limits.reserved_eqs       = dev_lim.reserved_eqs;
+	mdev->limits.reserved_mtts      = dev_lim.reserved_mtts;
+	mdev->limits.reserved_mrws      = dev_lim.reserved_mrws;
+	mdev->limits.reserved_uars      = dev_lim.reserved_uars;
+	mdev->limits.reserved_pds       = dev_lim.reserved_pds;
+
+	if (dev_lim.flags & DEV_LIM_FLAG_SRQ)
+		mdev->mthca_flags |= MTHCA_FLAG_SRQ;
+	
+	err = mthca_make_profile(mdev, &dev_lim, &init_hca);
+	if (err)
+		goto err_out_disable;
+
+	err = mthca_INIT_HCA(mdev, &init_hca, &status);
+	if (err) {
+		mthca_err(mdev, "INIT_HCA command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "INIT_HCA returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+
+	err = mthca_QUERY_ADAPTER(mdev, &adapter, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_close;
+	}
+
+	mdev->eq_table.inta_pin = adapter.inta_pin;
+	mdev->rev_id            = adapter.revision_id;
+
+	return 0;
+
+err_out_close:
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+err_out_disable:
+	mthca_SYS_DIS(mdev, &status);
+
+	return err;
+}
+
+static int __devinit mthca_load_fw(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	int num_sg;
+	int i;
+
+	/* FIXME: use HCA-attached memory for FW if present */
+
+	mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem *
+				     mdev->fw.arbel.fw_pages,
+				     GFP_KERNEL);
+	if (!mdev->fw.arbel.mem) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		return -ENOMEM;
+	}
+
+	memset(mdev->fw.arbel.mem, 0,
+	       sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages);
+
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) {
+		mdev->fw.arbel.mem[i].page   = alloc_page(GFP_HIGHUSER);
+		mdev->fw.arbel.mem[i].length = PAGE_SIZE;
+		if (!mdev->fw.arbel.mem[i].page) {
+			mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+	}
+	num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem,
+					   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+	if (num_sg <= 0) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status);
+	if (err) {
+		mthca_err(mdev, "MAP_FA command failed, aborting.\n");
+		goto err_unmap;
+	}
+	if (status) {
+		mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap;
+	}
+
+	err = mthca_RUN_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "RUN_FW command failed, aborting.\n");
+		goto err_unmap_fa;
+	}
+	if (status) {
+		mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap_fa;
+	}
+
+	return 0;
+
+err_unmap_fa:
+	mthca_UNMAP_FA(mdev, &status);
+
+err_unmap:
+	pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+		   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+err_free:
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+		if (mdev->fw.arbel.mem[i].page)
+			__free_page(mdev->fw.arbel.mem[i].page);
+	kfree(mdev->fw.arbel.mem);
+	return err;
+}
+
+static int __devinit mthca_init_arbel(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_ENABLE_LAM(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n");
+		return err;
+	}
+	if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) {
+		mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n");
+		mdev->mthca_flags |= MTHCA_FLAG_NO_LAM;
+	} else if (status) {
+		mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_load_fw(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to start FW, aborting.\n");
+		goto err_out_disable;
+	}
+
+	mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, "
+		   "aborting.\n");
+	return -ENODEV;
+
+err_out_disable:
+	if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+		mthca_DISABLE_LAM(mdev, &status);
+	return err;
+}
+
+static int __devinit mthca_init_hca(struct mthca_dev *mdev)
+{
+	if (mdev->hca_type == ARBEL_NATIVE)
+		return mthca_init_arbel(mdev);
+	else
+		return mthca_init_tavor(mdev);
+}
+
+static int __devinit mthca_setup_hca(struct mthca_dev *dev)
+{
+	int err;
+
+	MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock);
+
+	err = mthca_init_pd_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "protection domain table, aborting.\n");
+		return err;
+	}
+
+	err = mthca_init_mr_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "memory region table, aborting.\n");
+		goto err_out_pd_table_free;
+	}
+
+	err = mthca_pd_alloc(dev, &dev->driver_pd);
+	if (err) {
+		mthca_err(dev, "Failed to create driver PD, "
+			  "aborting.\n");
+		goto err_out_mr_table_free;
+	}
+
+	err = mthca_init_eq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "event queue table, aborting.\n");
+		goto err_out_pd_free;
+	}
+
+	err = mthca_cmd_use_events(dev);
+	if (err) {
+		mthca_err(dev, "Failed to switch to event-driven "
+			  "firmware commands, aborting.\n");
+		goto err_out_eq_table_free;
+	}
+
+	err = mthca_init_cq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "completion queue table, aborting.\n");
+		goto err_out_cmd_poll;
+	}
+
+	err = mthca_init_qp_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "queue pair table, aborting.\n");
+		goto err_out_cq_table_free;
+	}
+
+	err = mthca_init_av_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "address vector table, aborting.\n");
+		goto err_out_qp_table_free;
+	}
+
+	err = mthca_init_mcg_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "multicast group table, aborting.\n");
+		goto err_out_av_table_free;
+	}
+
+	return 0;
+
+err_out_av_table_free:
+	mthca_cleanup_av_table(dev);
+
+err_out_qp_table_free:
+	mthca_cleanup_qp_table(dev);
+
+err_out_cq_table_free:
+	mthca_cleanup_cq_table(dev);
+
+err_out_cmd_poll:
+	mthca_cmd_use_polling(dev);
+
+err_out_eq_table_free:
+	mthca_cleanup_eq_table(dev);
+
+err_out_pd_free:
+	mthca_pd_free(dev, &dev->driver_pd);
+
+err_out_mr_table_free:
+	mthca_cleanup_mr_table(dev);
+
+err_out_pd_table_free:
+	mthca_cleanup_pd_table(dev);
+	return err;
+}
+
+static int __devinit mthca_request_regions(struct pci_dev *pdev,
+					   int ddr_hidden)
+{
+	int err;
+
+	/*
+	 * We request our first BAR in two chunks, since the MSI-X
+	 * vector table is right in the middle.
+	 *
+	 * This is why we can't just use pci_request_regions() -- if
+	 * we did then setting up MSI-X would fail, since the PCI core
+	 * wants to do request_mem_region on the MSI-X vector table.
+	 */
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_HCR_BASE,
+				MTHCA_MAP_HCR_SIZE,
+				DRV_NAME))
+		return -EBUSY;
+
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_CLR_INT_BASE,
+				MTHCA_CLR_INT_SIZE,
+				DRV_NAME)) {
+		err = -EBUSY;
+		goto err_out_bar0_beg;
+	}
+
+	err = pci_request_region(pdev, 2, DRV_NAME);
+	if (err)
+		goto err_out_bar0_end;
+
+	if (!ddr_hidden) {
+		err = pci_request_region(pdev, 4, DRV_NAME);
+		if (err)
+			goto err_out_bar2;
+	}
+
+	return 0;
+
+err_out_bar0_beg:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+
+err_out_bar0_end:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+
+err_out_bar2:
+	pci_release_region(pdev, 2);
+	return err;
+}
+
+static void mthca_release_regions(struct pci_dev *pdev,
+				  int ddr_hidden)
+{
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+	pci_release_region(pdev, 2);
+	if (!ddr_hidden)
+		pci_release_region(pdev, 4);
+}
+
+static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev)
+{
+	struct msix_entry entries[3];
+	int err;
+
+	entries[0].entry = 0;
+	entries[1].entry = 1;
+	entries[2].entry = 2;
+
+	err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries));
+	if (err) {
+		if (err > 0)
+			mthca_info(mdev, "Only %d MSI-X vectors available, "
+				   "not using MSI-X\n", err);
+		return err;
+	}
+
+	mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector;
+	mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector;
+	mdev->eq_table.eq[MTHCA_EQ_CMD  ].msi_x_vector = entries[2].vector;
+
+	return 0;
+}
+
+static void mthca_close_hca(struct mthca_dev *mdev)
+{
+	u8 status;
+	int i;
+
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+	if (mdev->hca_type == ARBEL_NATIVE) {
+		mthca_UNMAP_FA(mdev, &status);
+
+		pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+			     mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+
+		for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+			__free_page(mdev->fw.arbel.mem[i].page);
+		kfree(mdev->fw.arbel.mem);
+
+		if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+			mthca_DISABLE_LAM(mdev, &status);
+	} else
+		mthca_SYS_DIS(mdev, &status);
+}
+
+static int __devinit mthca_init_one(struct pci_dev *pdev,
+				    const struct pci_device_id *id)
+{
+	static int mthca_version_printed = 0;
+	int ddr_hidden = 0;
+	int err;
+	unsigned long mthca_base;
+	struct mthca_dev *mdev;
+
+	if (!mthca_version_printed) {
+		printk(KERN_INFO "%s", mthca_version);
+		++mthca_version_printed;
+	}
+
+	printk(KERN_INFO PFX "Initializing %s (%s)\n",
+	       pci_pretty_name(pdev), pci_name(pdev));
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot enable PCI device, "
+			"aborting.\n");
+		return err;
+	}
+
+	/*
+	 * Check for BARs.  We expect 0: 1MB, 2: 8MB, 4: DDR (may not
+	 * be present)
+	 */
+	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 0) != 1 << 20) {
+		dev_err(&pdev->dev, "Missing DCS, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 2) != 1 << 23) {
+		dev_err(&pdev->dev, "Missing UAR, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM))
+		ddr_hidden = 1;
+
+	err = mthca_request_regions(pdev, ddr_hidden);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot obtain PCI resources, "
+			"aborting.\n");
+		goto err_out_disable_pdev;
+	}
+
+	pci_set_master(pdev);
+
+	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n");
+		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+	err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit "
+			 "consistent PCI DMA mask.\n");
+		err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, "
+				"aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+
+	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
+	if (!mdev) {
+		dev_err(&pdev->dev, "Device struct alloc failed, "
+			"aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_res;
+	}
+
+	mdev->pdev     = pdev;
+	mdev->hca_type = id->driver_data;
+
+	if (ddr_hidden)
+		mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN;
+
+	/*
+	 * Now reset the HCA before we touch the PCI capabilities or
+	 * attempt a firmware command, since a boot ROM may have left
+	 * the HCA in an undefined state.
+	 */
+	err = mthca_reset(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to reset HCA, aborting.\n");
+		goto err_out_free_dev;
+	}
+
+	if (msi_x && !mthca_enable_msi_x(mdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
+	if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) &&
+	    !pci_enable_msi(pdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI;
+
+	sema_init(&mdev->cmd.hcr_sem, 1);
+	sema_init(&mdev->cmd.poll_sem, 1);
+	mdev->cmd.use_events = 0;
+
+	mthca_base = pci_resource_start(pdev, 0);
+	mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE);
+	if (!mdev->hcr) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_dev;
+	}
+	mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE,
+				 MTHCA_CLR_INT_SIZE);
+	if (!mdev->clr_base) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap;
+	}
+
+	mthca_base = pci_resource_start(pdev, 2);
+	mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE);
+	if (!mdev->kar) {
+		mthca_err(mdev, "Couldn't map kernel access region, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap_clr;
+	}
+
+	err = mthca_tune_pci(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_init_hca(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_setup_hca(mdev);
+	if (err)
+		goto err_out_close;
+
+	err = mthca_register_device(mdev);
+	if (err)
+		goto err_out_cleanup;
+
+	err = mthca_create_agents(mdev);
+	if (err)
+		goto err_out_unregister;
+
+	pci_set_drvdata(pdev, mdev);
+
+	return 0;
+
+err_out_unregister:
+	mthca_unregister_device(mdev);
+
+err_out_cleanup:
+	mthca_cleanup_mcg_table(mdev);
+	mthca_cleanup_av_table(mdev);
+	mthca_cleanup_qp_table(mdev);
+	mthca_cleanup_cq_table(mdev);
+	mthca_cmd_use_polling(mdev);
+	mthca_cleanup_eq_table(mdev);
+
+	mthca_pd_free(mdev, &mdev->driver_pd);
+
+	mthca_cleanup_mr_table(mdev);
+	mthca_cleanup_pd_table(mdev);
+
+err_out_close:
+	mthca_close_hca(mdev);
+
+err_out_iounmap_kar:
+	iounmap(mdev->kar);
+
+err_out_iounmap_clr:
+	iounmap(mdev->clr_base);
+
+err_out_iounmap:
+	iounmap(mdev->hcr);
+
+err_out_free_dev:
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+		pci_disable_msi(pdev);
+
+	ib_dealloc_device(&mdev->ib_dev);
+
+err_out_free_res:
+	mthca_release_regions(pdev, ddr_hidden);
+
+err_out_disable_pdev:
+	pci_disable_device(pdev);
+	pci_set_drvdata(pdev, NULL);
+	return err;
+}
+
+static void __devexit mthca_remove_one(struct pci_dev *pdev)
+{
+	struct mthca_dev *mdev = pci_get_drvdata(pdev);
+	u8 status;
+	int p;
+
+	if (mdev) {
+		mthca_free_agents(mdev);
+		mthca_unregister_device(mdev);
+
+		for (p = 1; p <= mdev->limits.num_ports; ++p)
+			mthca_CLOSE_IB(mdev, p, &status);
+
+		mthca_cleanup_mcg_table(mdev);
+		mthca_cleanup_av_table(mdev);
+		mthca_cleanup_qp_table(mdev);
+		mthca_cleanup_cq_table(mdev);
+		mthca_cmd_use_polling(mdev);
+		mthca_cleanup_eq_table(mdev);
+
+		mthca_pd_free(mdev, &mdev->driver_pd);
+
+		mthca_cleanup_mr_table(mdev);
+		mthca_cleanup_pd_table(mdev);
+
+		mthca_close_hca(mdev);
+
+		iounmap(mdev->hcr);
+		iounmap(mdev->clr_base);
+
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+			pci_disable_msix(pdev);
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+			pci_disable_msi(pdev);
+
+		ib_dealloc_device(&mdev->ib_dev);
+		mthca_release_regions(pdev, mdev->mthca_flags &
+				      MTHCA_FLAG_DDR_HIDDEN);
+		pci_disable_device(pdev);
+		pci_set_drvdata(pdev, NULL);
+	}
+}
+
+static struct pci_device_id mthca_pci_table[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ 0, }
+};
+
+MODULE_DEVICE_TABLE(pci, mthca_pci_table);
+
+static struct pci_driver mthca_driver = {
+	.name		= "ib_mthca",
+	.id_table	= mthca_pci_table,
+	.probe		= mthca_init_one,
+	.remove		= __devexit_p(mthca_remove_one)
+};
+
+static int __init mthca_init(void)
+{
+	int ret;
+
+	/*
+	 * TODO: measure whether dynamically choosing doorbell code at
+	 * runtime affects our performance.  Is there a "magic" way to
+	 * choose without having to follow a function pointer every
+	 * time we ring a doorbell?
+	 */
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+	if (!cpu_has_xmm) {
+		printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n");
+		printk(KERN_ERR PFX "the current CPU does not support SSE.\n");
+		printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL "
+		       "and recompile.\n");
+		return -ENODEV;
+	}
+#endif
+
+	ret = pci_register_driver(&mthca_driver);
+	return ret < 0 ? ret : 0;
+}
+
+static void __exit mthca_cleanup(void)
+{
+	pci_unregister_driver(&mthca_driver);
+}
+
+module_init(mthca_init);
+module_exit(mthca_cleanup);
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c	2004-11-21 21:25:54.747078919 -0800
@@ -0,0 +1,372 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $
+ */
+
+#include <linux/init.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2)
+};
+
+struct mthca_mgm {
+	u32 next_gid_index;
+	u32 reserved[3];
+	u8  gid[16];
+	u32 qp[MTHCA_QP_PER_MGM];
+} __attribute__((packed));
+
+static const u8 zero_gid[16];	/* automatically initialized to 0 */
+
+/*
+ * Caller must hold MCG table semaphore.  gid and mgm parameters must
+ * be properly aligned for command interface.
+ *
+ *  Returns 0 unless a firmware command error occurs.
+ *
+ * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1
+ * and *mgm holds MGM entry.
+ *
+ * if GID is found in AMGM, *index = index in AMGM, *prev = index of
+ * previous entry in hash chain and *mgm holds AMGM entry.
+ *
+ * If no AMGM exists for given gid, *index = -1, *prev = index of last
+ * entry in hash chain and *mgm holds end of hash chain.
+ */
+static int find_mgm(struct mthca_dev *dev,
+		    u8 *gid, struct mthca_mgm *mgm,
+		    u16 *hash, int *prev, int *index)
+{
+	void *mailbox;
+	u8 *mgid;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgid = MAILBOX_ALIGN(mailbox);
+
+	memcpy(mgid, gid, 16);
+
+	err = mthca_MGID_HASH(dev, mgid, hash, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "MGID_HASH returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (0)
+		mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:"
+			  "%04x:%04x:%04x:%04x is %04x\n",
+			  be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]),
+			  be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]),
+			  be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]),
+			  be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]),
+			  *hash);
+
+	*index = *hash;
+	*prev  = -1;
+
+	do {
+		err = mthca_READ_MGM(dev, *index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			return -EINVAL;
+		}
+
+		if (!memcmp(mgm->gid, zero_gid, 16)) {
+			if (*index != *hash) {
+				mthca_err(dev, "Found zero MGID in AMGM.\n");
+				err = -EINVAL;
+			}
+			goto out;
+		}
+
+		if (!memcmp(mgm->gid, gid, 16))
+			goto out;
+
+		*prev = *index;
+		*index = be32_to_cpu(mgm->next_gid_index) >> 5;
+	} while (*index);
+
+	*index = -1;
+
+ out:
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int index, prev;
+	int link = 0;
+	int i;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index != -1) {
+		if (!memcmp(mgm->gid, zero_gid, 16))
+			memcpy(mgm->gid, gid->raw, 16);
+	} else {
+		link = 1;
+
+		index = mthca_alloc(&dev->mcg_table.alloc);
+		if (index == -1) {
+			mthca_err(dev, "No AMGM entries left\n");
+			err = -ENOMEM;
+			goto out;
+		}
+
+		err = mthca_READ_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		memcpy(mgm->gid, gid->raw, 16);
+		mgm->next_gid_index = 0;
+	}
+
+	for (i = 0; i < MTHCA_QP_PER_MGM; ++i)
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) {
+			mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31));
+			break;
+		}
+
+	if (i == MTHCA_QP_PER_MGM) {
+		mthca_err(dev, "MGM at index %x is full.\n", index);
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+	if (!link)
+		goto out;
+
+	err = mthca_READ_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "READ_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->next_gid_index = cpu_to_be32(index << 5);
+
+	err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int prev, index;
+	int i, loc;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index == -1) {	
+		mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x "
+			  "not found\n",
+			  be16_to_cpu(((u16 *) gid->raw)[0]),
+			  be16_to_cpu(((u16 *) gid->raw)[1]),
+			  be16_to_cpu(((u16 *) gid->raw)[2]),
+			  be16_to_cpu(((u16 *) gid->raw)[3]),
+			  be16_to_cpu(((u16 *) gid->raw)[4]),
+			  be16_to_cpu(((u16 *) gid->raw)[5]),
+			  be16_to_cpu(((u16 *) gid->raw)[6]),
+			  be16_to_cpu(((u16 *) gid->raw)[7]));
+		err = -EINVAL;
+		goto out;
+	}
+
+	for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) {
+		if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31)))
+			loc = i;
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31)))
+			break;
+	}
+
+	if (loc == -1) {
+		mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->qp[loc]   = mgm->qp[i - 1];
+	mgm->qp[i - 1] = 0;
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (i != 1)
+		goto out;
+
+	goto out;
+
+	if (prev == -1) {
+		/* Remove entry from MGM */
+		if (be32_to_cpu(mgm->next_gid_index) >> 5) {
+			err = mthca_READ_MGM(dev,
+					     be32_to_cpu(mgm->next_gid_index) >> 5,
+					     mgm, &status);
+			if (err)
+				goto out;
+			if (status) {
+				mthca_err(dev, "READ_MGM returned status %02x\n",
+					  status);
+				err = -EINVAL;
+				goto out;
+			}
+		} else
+			memset(mgm->gid, 0, 16);
+
+		err = mthca_WRITE_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		/* Remove entry from AMGM */
+		index = be32_to_cpu(mgm->next_gid_index) >> 5;
+		err = mthca_READ_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		mgm->next_gid_index = cpu_to_be32(index << 5);
+
+		err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int __devinit mthca_init_mcg_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->mcg_table.alloc,
+			       dev->limits.num_amgms,
+			       dev->limits.num_amgms - 1,
+			       0);
+	if (err)
+		return err;
+
+	init_MUTEX(&dev->mcg_table.sem);
+
+	return 0;
+}
+
+void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev)
+{
+	mthca_alloc_cleanup(&dev->mcg_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c	2004-11-21 21:25:54.772075211 -0800
@@ -0,0 +1,389 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+struct mthca_mpt_entry {
+	u32 flags;
+	u32 page_size;
+	u32 key;
+	u32 pd;
+	u64 start;
+	u64 length;
+	u32 lkey;
+	u32 window_count;
+	u32 window_count_limit;
+	u64 mtt_seg;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_MPT_FLAG_SW_OWNS       (0xfUL << 28)
+#define MTHCA_MPT_FLAG_MIO           (1 << 17)
+#define MTHCA_MPT_FLAG_BIND_ENABLE   (1 << 15)
+#define MTHCA_MPT_FLAG_PHYSICAL      (1 <<  9)
+#define MTHCA_MPT_FLAG_REGION        (1 <<  8)
+
+#define MTHCA_MTT_FLAG_PRESENT       1
+
+/*
+ * Buddy allocator for MTT segments (currently not very efficient
+ * since it doesn't keep a free list and just searches linearly
+ * through the bitmaps)
+ */
+
+static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+{
+	int o;
+	int m;
+	u32 seg;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	for (o = order; o <= dev->mr_table.max_mtt_order; ++o) {
+		m = 1 << (dev->mr_table.max_mtt_order - o);
+		seg = find_first_bit(dev->mr_table.mtt_buddy[o], m);
+		if (seg < m)
+			goto found;
+	}
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	return -1;
+
+ found:
+	clear_bit(seg, dev->mr_table.mtt_buddy[o]);
+
+	while (o > order) {
+		--o;
+		seg <<= 1;
+		set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]);
+	}
+					  
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+
+	seg <<= order;
+
+	return seg;
+}
+
+static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+{
+	seg >>= order;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) {
+		clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]);
+		seg >>= 1;
+		++order;
+	}
+
+	set_bit(seg, dev->mr_table.mtt_buddy[order]);
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+}
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	struct mthca_mpt_entry *mpt_entry;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mr->order = -1;
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+		return -ENOMEM;
+	}
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_PHYSICAL    |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+	mpt_entry->page_size = 0;
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = 0;
+	mpt_entry->length    = ~0ULL;
+
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	u64 *mtt_entry;
+	struct mthca_mpt_entry *mpt_entry;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+	WARN_ON(buffer_size_shift >= 32);
+
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	for (i = dev->limits.mtt_seg_size / 8, mr->order = 0;
+	     i < list_len;
+	     i <<= 1, ++mr->order)
+		/* nothing */ ;
+
+	mr->first_seg = mthca_alloc_mtt(dev, mr->order);
+	if (mr->first_seg == -1)
+		goto err_out_mpt_free;
+
+	/*
+	 * If list_len is odd, we add one more dummy entry for
+	 * firmware efficiency.
+	 */
+	mailbox = kmalloc(max(sizeof *mpt_entry,
+			      (size_t) 8 * (list_len + (list_len & 1) + 2)) +
+			  MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free_mtt;
+
+	mtt_entry = MAILBOX_ALIGN(mailbox);
+
+	mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base +
+				   mr->first_seg * dev->limits.mtt_seg_size);
+	mtt_entry[1] = 0;
+	for (i = 0; i < list_len; ++i)
+		mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+	if (list_len & 1) {
+		mtt_entry[i + 2] = 0;
+		++list_len;
+	}
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry\n");
+		for (i = 0; i < list_len + 2; ++i)
+			printk(KERN_ERR "[%2d] %016llx\n",
+			       i, (unsigned long long) be64_to_cpu(mtt_entry[i]));
+	}
+
+	err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status);
+	if (err) {
+		mthca_warn(dev, "WRITE_MTT failed (%d)\n", err);
+		goto err_out_mailbox_free;
+	}
+	if (status) {
+		mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_mailbox_free;
+	}
+
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+
+	mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12);
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = cpu_to_be64(iova);
+	mpt_entry->length    = cpu_to_be64(total_size);
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+	mpt_entry->mtt_seg   = cpu_to_be64(dev->mr_table.mtt_base +
+					   mr->first_seg * dev->limits.mtt_seg_size);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
+		for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i]));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+
+ err_out_mailbox_free:
+	kfree(mailbox);
+
+ err_out_free_mtt:
+	mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+ err_out_mpt_free:
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	return err;
+}
+
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr)
+{
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	err = mthca_HW2SW_MPT(dev, NULL,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n",
+			   status);
+
+	if (mr->order >= 0)
+		mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);		   
+}
+
+int __devinit mthca_init_mr_table(struct mthca_dev *dev)
+{
+	int err;
+	int i, s;
+
+	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
+			       dev->limits.num_mpts,
+			       ~0, dev->limits.reserved_mrws);
+	if (err)
+		return err;
+
+	err = -ENOMEM;
+
+	for (i = 1, dev->mr_table.max_mtt_order = 0;
+	     i < dev->limits.num_mtt_segs;
+	     i <<= 1, ++dev->mr_table.max_mtt_order)
+		/* nothing */ ;
+
+	dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) *
+					  sizeof (long *),
+					  GFP_KERNEL);
+	if (!dev->mr_table.mtt_buddy)
+		goto err_out;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		dev->mr_table.mtt_buddy[i] = NULL;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) {
+		s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i));
+		dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long),
+						     GFP_KERNEL);
+		if (!dev->mr_table.mtt_buddy[i])
+			goto err_out_free;
+		bitmap_zero(dev->mr_table.mtt_buddy[i],
+			    1 << (dev->mr_table.max_mtt_order - i));
+	}
+
+	set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]);
+
+	for (i = 0; i < dev->mr_table.max_mtt_order; ++i)
+		if (1 << i >= dev->limits.reserved_mtts)
+			break;
+
+	if (i == dev->mr_table.max_mtt_order) {
+		mthca_err(dev, "MTT table of order %d is "
+			  "too small.\n", i);
+		goto err_out_free;
+	}
+
+	(void) mthca_alloc_mtt(dev, i);
+
+	return 0;
+
+ err_out_free:
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+
+ err_out:
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev)
+{
+	int i;
+
+	/* XXX check if any MRs are still allocated? */
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+	kfree(dev->mr_table.mtt_buddy);
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c	2004-11-21 21:25:54.797071503 -0800
@@ -0,0 +1,76 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	int err;
+
+	might_sleep();
+
+	atomic_set(&pd->sqp_count, 0);
+	pd->pd_num = mthca_alloc(&dev->pd_table.alloc);
+	if (pd->pd_num == -1)
+		return -ENOMEM;
+
+	err = mthca_mr_alloc_notrans(dev, pd->pd_num,
+				     MTHCA_MPT_FLAG_LOCAL_READ |
+				     MTHCA_MPT_FLAG_LOCAL_WRITE,
+				     &pd->ntmr);
+	if (err)
+		mthca_free(&dev->pd_table.alloc, pd->pd_num);
+
+	return err;
+}
+
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	might_sleep();
+	mthca_free_mr(dev, &pd->ntmr);
+	mthca_free(&dev->pd_table.alloc, pd->pd_num);
+}
+
+int __devinit mthca_init_pd_table(struct mthca_dev *dev)
+{
+	return mthca_alloc_init(&dev->pd_table.alloc,
+				dev->limits.num_pds,
+				(1 << 24) - 1,
+				dev->limits.reserved_pds);
+}
+
+void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev)
+{
+	/* XXX check if any PDs are still allocated? */
+	mthca_alloc_cleanup(&dev->pd_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c	2004-11-21 21:25:54.822067796 -0800
@@ -0,0 +1,222 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include "mthca_profile.h"
+
+static int default_profile[MTHCA_RES_NUM] = {
+	[MTHCA_RES_QP]    = 1 << 16,
+	[MTHCA_RES_EQP]   = 1 << 16,
+	[MTHCA_RES_CQ]    = 1 << 16,
+	[MTHCA_RES_EQ]    = 32,
+	[MTHCA_RES_RDB]   = 1 << 18,
+	[MTHCA_RES_MCG]   = 1 << 13,
+	[MTHCA_RES_MPT]   = 1 << 17,
+	[MTHCA_RES_MTT]   = 1 << 20,
+	[MTHCA_RES_UDAV]  = 1 << 15
+};
+
+enum {
+	MTHCA_RDB_ENTRY_SIZE = 32,
+	MTHCA_MTT_SEG_SIZE   = 64
+};
+
+enum {
+	MTHCA_NUM_PDS = 1 << 15
+};
+
+int mthca_make_profile(struct mthca_dev *dev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca)
+{
+	/* just use default profile for now */
+	struct mthca_resource {
+		u64 size;
+		u64 start;
+		int type;
+		int num;
+		int log_num;
+	};
+
+	u64 total_size = 0;
+	struct mthca_resource *profile;
+	struct mthca_resource tmp;
+	int i, j;
+
+	default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE;
+
+	profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL);
+	if (!profile)
+		return -ENOMEM;
+
+	profile[MTHCA_RES_QP].size   = dev_lim->qpc_entry_sz;
+	profile[MTHCA_RES_EEC].size  = dev_lim->eec_entry_sz;
+	profile[MTHCA_RES_SRQ].size  = dev_lim->srq_entry_sz;
+	profile[MTHCA_RES_CQ].size   = dev_lim->cqc_entry_sz;
+	profile[MTHCA_RES_EQP].size  = dev_lim->eqpc_entry_sz;
+	profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz;
+	profile[MTHCA_RES_EQ].size   = dev_lim->eqc_entry_sz;
+	profile[MTHCA_RES_RDB].size  = MTHCA_RDB_ENTRY_SIZE;
+	profile[MTHCA_RES_MCG].size  = MTHCA_MGM_ENTRY_SIZE;
+	profile[MTHCA_RES_MPT].size  = MTHCA_MPT_ENTRY_SIZE;
+	profile[MTHCA_RES_MTT].size  = MTHCA_MTT_SEG_SIZE;
+	profile[MTHCA_RES_UAR].size  = dev_lim->uar_scratch_entry_sz;
+	profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE;
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		profile[i].type     = i;
+		profile[i].num      = default_profile[i];
+		profile[i].log_num  = max(ffs(default_profile[i]) - 1, 0);
+		profile[i].size    *= default_profile[i];
+	}
+
+	/* 
+	 * Sort the resources in decreasing order of size.  Since they
+	 * all have sizes that are powers of 2, we'll be able to keep
+	 * resources aligned to their size and pack them without gaps
+	 * using the sorted order.
+	 */
+	for (i = MTHCA_RES_NUM; i > 0; --i)
+		for (j = 1; j < i; ++j) {
+			if (profile[j].size > profile[j - 1].size) {
+				tmp            = profile[j];
+				profile[j]     = profile[j - 1];
+				profile[j - 1] = tmp;
+			}
+		}
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		if (profile[i].size) {
+			profile[i].start = dev->ddr_start + total_size;
+			total_size      += profile[i].size;
+		}
+		if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) {
+			mthca_err(dev, "Profile requires 0x%llx bytes; "
+				  "won't fit between DDR start at 0x%016llx "
+				  "and FW start at 0x%016llx.\n",
+				  (unsigned long long) total_size,
+				  (unsigned long long) dev->ddr_start,
+				  (unsigned long long) dev->fw.tavor.fw_start);
+			kfree(profile);
+			return -ENOMEM;
+		}
+
+		if (profile[i].size)
+			mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx "
+				  "(size 0x%8llx)\n",
+				  i, profile[i].type, profile[i].log_num,
+				  (unsigned long long) profile[i].start,
+				  (unsigned long long) profile[i].size);
+	}
+
+	mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n",
+		  (int) (total_size >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10));
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		switch (profile[i].type) {
+		case MTHCA_RES_QP:
+			dev->limits.num_qps   = profile[i].num;
+			init_hca->qpc_base    = profile[i].start;
+			init_hca->log_num_qps = profile[i].log_num;
+			break;
+		case MTHCA_RES_EEC:
+			dev->limits.num_eecs   = profile[i].num;
+			init_hca->eec_base     = profile[i].start;
+			init_hca->log_num_eecs = profile[i].log_num;
+			break;
+		case MTHCA_RES_SRQ:
+			dev->limits.num_srqs   = profile[i].num;
+			init_hca->srqc_base    = profile[i].start;
+			init_hca->log_num_srqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_CQ:
+			dev->limits.num_cqs   = profile[i].num;
+			init_hca->cqc_base    = profile[i].start;
+			init_hca->log_num_cqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_EQP:
+			init_hca->eqpc_base = profile[i].start;
+			break;
+		case MTHCA_RES_EEEC:
+			init_hca->eeec_base = profile[i].start;
+			break;
+		case MTHCA_RES_EQ:
+			dev->limits.num_eqs   = profile[i].num;
+			init_hca->eqc_base    = profile[i].start;
+			init_hca->log_num_eqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_RDB:
+			dev->limits.num_rdbs = profile[i].num;
+			init_hca->rdb_base   = profile[i].start;
+			break;
+		case MTHCA_RES_MCG:
+			dev->limits.num_mgms      = profile[i].num >> 1;
+			dev->limits.num_amgms     = profile[i].num >> 1;
+			init_hca->mc_base         = profile[i].start;
+			init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1;
+			init_hca->log_mc_table_sz = profile[i].log_num;
+			init_hca->mc_hash_sz      = 1 << (profile[i].log_num - 1);
+			break;
+		case MTHCA_RES_MPT:
+			dev->limits.num_mpts = profile[i].num;
+			init_hca->mpt_base   = profile[i].start;
+			init_hca->log_mpt_sz = profile[i].log_num;
+			break;
+		case MTHCA_RES_MTT:
+			dev->limits.num_mtt_segs = profile[i].num;
+			dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE;
+			dev->mr_table.mtt_base   = profile[i].start;
+			init_hca->mtt_base       = profile[i].start;
+			init_hca->mtt_seg_sz     = ffs(MTHCA_MTT_SEG_SIZE) - 7;
+			break;
+		case MTHCA_RES_UAR:
+			init_hca->uar_scratch_base = profile[i].start;
+			break;
+		case MTHCA_RES_UDAV:
+			dev->av_table.ddr_av_base = profile[i].start;
+			dev->av_table.num_ddr_avs = profile[i].num;
+		default:
+			break;
+		}
+	}
+
+	/*
+	 * PDs don't take any HCA memory, but we assign them as part
+	 * of the HCA profile anyway.
+	 */
+	dev->limits.num_pds = MTHCA_NUM_PDS;
+
+	kfree(profile);
+	return 0;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h	2004-11-21 21:25:54.847064088 -0800
@@ -0,0 +1,58 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $
+ */
+
+#ifndef MTHCA_PROFILE_H
+#define MTHCA_PROFILE_H
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_RES_QP,
+	MTHCA_RES_EEC,
+	MTHCA_RES_SRQ,
+	MTHCA_RES_CQ,
+	MTHCA_RES_EQP,
+	MTHCA_RES_EEEC,
+	MTHCA_RES_EQ,
+	MTHCA_RES_RDB,
+	MTHCA_RES_MCG,
+	MTHCA_RES_MPT,
+	MTHCA_RES_MTT,
+	MTHCA_RES_UAR,
+	MTHCA_RES_UDAV,
+	MTHCA_RES_NUM
+};
+
+int mthca_make_profile(struct mthca_dev *mdev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca);
+
+#endif /* MTHCA_PROFILE_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c	2004-11-21 21:25:54.873060232 -0800
@@ -0,0 +1,629 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $
+ */
+
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+/* Temporary until we get core support straightened out */
+enum {
+	IB_SMP_ATTRIB_NODE_INFO        = 0x0011,
+	IB_SMP_ATTRIB_GUID_INFO        = 0x0014,
+	IB_SMP_ATTRIB_PORT_INFO        = 0x0015,
+	IB_SMP_ATTRIB_PKEY_TABLE       = 0x0016
+};
+
+static int mthca_query_device(struct ib_device *ibdev,
+			      struct ib_device_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	props->fw_ver        = to_mdev(ibdev)->fw_ver;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    1, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->vendor_id      = be32_to_cpup((u32 *) (out_mad->data + 76)) &
+		0xffffff;
+	props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70));
+	props->hw_ver         = be16_to_cpup((u16 *) (out_mad->data + 72));
+	memcpy(&props->sys_image_guid, out_mad->data + 44, 8);
+	memcpy(&props->node_guid,      out_mad->data + 52, 8);
+
+	err = 0;
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_port(struct ib_device *ibdev,
+			    u8 port, struct ib_port_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->lid               = be16_to_cpup((u16 *) (out_mad->data + 56));
+	props->lmc               = (*(u8 *) (out_mad->data + 74)) & 0x7;
+	props->sm_lid            = be16_to_cpup((u16 *) (out_mad->data + 58));
+	props->sm_sl             = (*(u8 *) (out_mad->data + 76)) & 0xf;
+	props->state             = (*(u8 *) (out_mad->data + 72)) & 0xf;
+	props->port_cap_flags    = be32_to_cpup((u32 *) (out_mad->data + 60));
+	props->gid_tbl_len       = to_mdev(ibdev)->limits.gid_table_len;
+	props->pkey_tbl_len      = to_mdev(ibdev)->limits.pkey_table_len;
+	props->qkey_viol_cntr    = be16_to_cpup((u16 *) (out_mad->data + 88));
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_modify_port(struct ib_device *ibdev,
+			     u8 port, int port_modify_mask,
+			     struct ib_port_modify *props)
+{
+	return 0;
+}
+
+static int mthca_query_pkey(struct ib_device *ibdev,
+			    u8 port, u16 index, u16 *pkey)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 32);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	*pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_gid(struct ib_device *ibdev, u8 port,
+			   int index, union ib_gid *gid)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw, out_mad->data + 48, 8);
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 8);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev)
+{
+	struct mthca_pd *pd;
+	int err;
+
+	pd = kmalloc(sizeof *pd, GFP_KERNEL);
+	if (!pd)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_pd_alloc(to_mdev(ibdev), pd);
+	if (err) {
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	return &pd->ibpd;
+}
+
+static int mthca_dealloc_pd(struct ib_pd *pd)
+{
+	mthca_pd_free(to_mdev(pd->device), to_mpd(pd));
+	kfree(pd);
+
+	return 0;
+}
+
+static struct ib_ah *mthca_ah_create(struct ib_pd *pd,
+				     struct ib_ah_attr *ah_attr)
+{
+	int err;
+	struct mthca_ah *ah;
+
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah);
+	if (err) {
+		kfree(ah);
+		return ERR_PTR(err);
+	}
+
+	return &ah->ibah;
+}
+
+static int mthca_ah_destroy(struct ib_ah *ah)
+{
+	mthca_destroy_ah(to_mdev(ah->device), to_mah(ah));
+	kfree(ah);
+
+	return 0;
+}
+
+static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
+				     struct ib_qp_init_attr *init_attr)
+{
+	struct mthca_qp *qp;
+	int err;
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_RC:
+	case IB_QPT_UC:
+	case IB_QPT_UD:
+	{
+		qp = kmalloc(sizeof *qp, GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd),
+				     to_mcq(init_attr->send_cq),
+				     to_mcq(init_attr->recv_cq),
+				     init_attr->qp_type, init_attr->sq_sig_type,
+				     init_attr->rq_sig_type, qp);
+		qp->ibqp.qp_num = qp->qpn;
+		break;
+	}
+	case IB_QPT_SMI:
+	case IB_QPT_GSI:
+	{
+		qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1;
+
+		err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
+				      to_mcq(init_attr->send_cq),
+				      to_mcq(init_attr->recv_cq),
+				      init_attr->sq_sig_type, init_attr->rq_sig_type,
+				      qp->ibqp.qp_num, init_attr->port_num,
+				      to_msqp(qp));
+		break;
+	}
+	default:
+		/* Don't support raw QPs */
+		return ERR_PTR(-ENOSYS);
+	}
+
+	if (err) {
+		kfree(qp);
+		return ERR_PTR(err);
+	}
+
+        init_attr->cap.max_inline_data = 0;
+
+	return &qp->ibqp;
+}
+
+static int mthca_destroy_qp(struct ib_qp *qp)
+{
+	mthca_free_qp(to_mdev(qp->device), to_mqp(qp));
+	kfree(qp);
+	return 0;
+}
+
+static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries)
+{
+	struct mthca_cq *cq;
+	int nent;
+	int err;
+
+	cq = kmalloc(sizeof *cq, GFP_KERNEL);
+	if (!cq)
+		return ERR_PTR(-ENOMEM);
+
+	for (nent = 1; nent < entries; nent <<= 1)
+		; /* nothing */
+
+	err = mthca_init_cq(to_mdev(ibdev), nent, cq);
+	if (err) {
+		kfree(cq);
+		cq = ERR_PTR(err);
+	} else
+		cq->ibcq.cqe = nent;
+
+	return &cq->ibcq;
+}
+
+static int mthca_destroy_cq(struct ib_cq *cq)
+{
+	mthca_free_cq(to_mdev(cq->device), to_mcq(cq));
+	kfree(cq);
+
+	return 0;
+}
+
+static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+{
+	mthca_arm_cq(to_mdev(cq->device), to_mcq(cq),
+		     notify == IB_CQ_SOLICITED);
+	return 0;
+}
+
+static inline u32 convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC       : 0) |
+	       (acc & IB_ACCESS_REMOTE_WRITE  ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) |
+	       (acc & IB_ACCESS_REMOTE_READ   ? MTHCA_MPT_FLAG_REMOTE_READ  : 0) |
+	       (acc & IB_ACCESS_LOCAL_WRITE   ? MTHCA_MPT_FLAG_LOCAL_WRITE  : 0) |
+	       MTHCA_MPT_FLAG_LOCAL_READ;
+}
+
+static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct mthca_mr *mr;
+	int err;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_mr_alloc_notrans(to_mdev(pd->device),
+				     to_mpd(pd)->pd_num,
+				     convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+}
+
+static struct ib_mr *mthca_reg_phys_mr(struct ib_pd       *pd,
+				       struct ib_phys_buf *buffer_list,
+				       int                 num_phys_buf,
+				       int                 acc,
+				       u64                *iova_start)
+{
+	struct mthca_mr *mr;
+	u64 *page_list;
+	u64 total_size;
+	u64 mask;
+	int shift;
+	int npages;
+	int err;
+	int i, j, n;
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	mask = 0;
+	total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (buffer_list[i].addr & ~PAGE_MASK)
+			return ERR_PTR(-EINVAL);
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return ERR_PTR(-EINVAL);
+
+		total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	/* Find largest page shift we can use to cover buffers */
+	for (shift = PAGE_SHIFT; shift < 31; ++shift)
+		if (num_phys_buf > 1) {
+			if ((1ULL << shift) & mask)
+				break;
+		} else {
+			if (1ULL << shift >= 
+			    buffer_list[0].size + 
+			    (buffer_list[0].addr & ((1ULL << shift) - 1)))
+				break;
+		}
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1);
+	buffer_list[0].addr &= ~0ull << shift;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+
+	if (!npages)
+		return &mr->ibmr;
+
+	page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL);
+	if (!page_list) {
+		kfree(mr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+		     ++j)
+			page_list[n++] = buffer_list[i].addr + ((u64) j << shift);
+
+	mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) "
+		  "in PD %x; shift %d, npages %d.\n",
+		  (unsigned long long) buffer_list[0].addr,
+		  (unsigned long long) *iova_start,
+		  to_mpd(pd)->pd_num,
+		  shift, npages);
+
+	err = mthca_mr_alloc_phys(to_mdev(pd->device),
+				  to_mpd(pd)->pd_num,
+				  page_list, shift, npages,
+				  *iova_start, total_size,
+				  convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	kfree(page_list);
+	return &mr->ibmr;
+}
+
+static int mthca_dereg_mr(struct ib_mr *mr)
+{
+	mthca_free_mr(to_mdev(mr->device), to_mmr(mr));
+	kfree(mr);
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x\n", dev->rev_id);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32),
+		       (int) (dev->fw_ver >> 16) & 0xffff,
+		       (int) dev->fw_ver & 0xffff);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	switch (dev->hca_type) {
+	case TAVOR:        return sprintf(buf, "MT23108\n");
+	case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n");
+	case ARBEL_NATIVE: return sprintf(buf, "MT25208\n");
+	default:           return sprintf(buf, "unknown\n");
+	}
+}
+
+static CLASS_DEVICE_ATTR(hw_rev,   S_IRUGO, show_rev,    NULL);
+static CLASS_DEVICE_ATTR(fw_ver,   S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca,    NULL);
+
+static struct class_device_attribute *mthca_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type
+};
+
+int mthca_register_device(struct mthca_dev *dev)
+{
+	int ret;
+	int i;
+
+	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
+	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
+	dev->ib_dev.dma_device           = dev->pdev;
+	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
+	dev->ib_dev.query_device         = mthca_query_device;
+	dev->ib_dev.query_port           = mthca_query_port;
+	dev->ib_dev.modify_port          = mthca_modify_port;
+	dev->ib_dev.query_pkey           = mthca_query_pkey;
+	dev->ib_dev.query_gid            = mthca_query_gid;
+	dev->ib_dev.alloc_pd             = mthca_alloc_pd;
+	dev->ib_dev.dealloc_pd           = mthca_dealloc_pd;
+	dev->ib_dev.create_ah            = mthca_ah_create;
+	dev->ib_dev.destroy_ah           = mthca_ah_destroy;
+	dev->ib_dev.create_qp            = mthca_create_qp;
+	dev->ib_dev.modify_qp            = mthca_modify_qp;
+	dev->ib_dev.destroy_qp           = mthca_destroy_qp;
+	dev->ib_dev.post_send            = mthca_post_send;
+	dev->ib_dev.post_recv            = mthca_post_receive;
+	dev->ib_dev.create_cq            = mthca_create_cq;
+	dev->ib_dev.destroy_cq           = mthca_destroy_cq;
+	dev->ib_dev.poll_cq              = mthca_poll_cq;
+	dev->ib_dev.req_notify_cq        = mthca_req_notify_cq;
+	dev->ib_dev.get_dma_mr           = mthca_get_dma_mr;
+	dev->ib_dev.reg_phys_mr          = mthca_reg_phys_mr;
+	dev->ib_dev.dereg_mr             = mthca_dereg_mr;
+	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
+	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
+	dev->ib_dev.process_mad          = mthca_process_mad;
+
+	ret = ib_register_device(&dev->ib_dev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ib_dev.class_dev,
+					       mthca_class_attributes[i]);
+		if (ret) {
+			ib_unregister_device(&dev->ib_dev);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+void mthca_unregister_device(struct mthca_dev *dev)
+{
+	ib_unregister_device(&dev->ib_dev);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h	2004-11-21 21:25:54.898056524 -0800
@@ -0,0 +1,221 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $
+ */
+
+#ifndef MTHCA_PROVIDER_H
+#define MTHCA_PROVIDER_H
+
+#include <ib_verbs.h>
+#include <ib_pack.h>
+
+#define MTHCA_MPT_FLAG_ATOMIC        (1 << 14)
+#define MTHCA_MPT_FLAG_REMOTE_WRITE  (1 << 13)
+#define MTHCA_MPT_FLAG_REMOTE_READ   (1 << 12)
+#define MTHCA_MPT_FLAG_LOCAL_WRITE   (1 << 11)
+#define MTHCA_MPT_FLAG_LOCAL_READ    (1 << 10)
+
+struct mthca_buf_list {
+	void *buf;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+struct mthca_mr {
+	struct ib_mr ibmr;
+	int order;
+	u32 first_seg;
+};
+
+struct mthca_pd {
+	struct ib_pd    ibpd;
+	u32             pd_num;
+	atomic_t        sqp_count;
+	struct mthca_mr ntmr;
+};
+
+struct mthca_eq {
+	struct mthca_dev      *dev;
+	int                    eqn;
+	u32                    ecr_mask;
+	u16                    msi_x_vector;
+	u16                    msi_x_entry;
+	int                    have_irq;
+	int                    nent;
+	int                    cons_index;
+	struct mthca_buf_list *page_list;
+	struct mthca_mr        mr;
+};
+
+struct mthca_av;
+
+struct mthca_ah {
+	struct ib_ah     ibah;
+	int              on_hca;
+	u32              key;
+	struct mthca_av *av;
+	dma_addr_t       avdma;
+};
+
+/*
+ * Quick description of our CQ/QP locking scheme:
+ *
+ * We have one global lock that protects dev->cq/qp_table.  Each
+ * struct mthca_cq/qp also has its own lock.  An individual qp lock
+ * may be taken inside of an individual cq lock.  Both cqs attached to
+ * a qp may be locked, with the send cq locked first.  No other
+ * nesting should be done.
+ *
+ * Each struct mthca_cq/qp also has an atomic_t ref count.  The
+ * pointer from the cq/qp_table to the struct counts as one reference.
+ * This reference also is good for access through the consumer API, so
+ * modifying the CQ/QP etc doesn't need to take another reference.
+ * Access because of a completion being polled does need a reference.
+ *
+ * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the
+ * destroy function to sleep on.
+ *
+ * This means that access from the consumer API requires nothing but
+ * taking the struct's lock.
+ *
+ * Access because of a completion event should go as follows:
+ * - lock cq/qp_table and look up struct
+ * - increment ref count in struct
+ * - drop cq/qp_table lock
+ * - lock struct, do your thing, and unlock struct
+ * - decrement ref count; if zero, wake up waiters
+ *
+ * To destroy a CQ/QP, we can do the following:
+ * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock
+ * - decrement ref count
+ * - wait_event until ref count is zero
+ *
+ * It is the consumer's responsibilty to make sure that no QP
+ * operations (WQE posting or state modification) are pending when the
+ * QP is destroyed.  Also, the consumer must make sure that calls to
+ * qp_modify are serialized.
+ *
+ * Possible optimizations (wait for profile data to see if/where we
+ * have locks bouncing between CPUs):
+ * - split cq/qp table lock into n separate (cache-aligned) locks,
+ *   indexed (say) by the page in the table
+ * - split QP struct lock into three (one for common info, one for the
+ *   send queue and one for the receive queue)
+ */
+
+struct mthca_cq {
+	struct ib_cq           ibcq;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	int                    cqn;
+	int                    cons_index;
+	int                    is_direct;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+	struct mthca_mr        mr;
+	wait_queue_head_t      wait;
+};
+
+struct mthca_wq {
+	int   max;
+	int   cur;
+	int   next;
+	int   last_comp;
+	void *last;
+	int   max_gs;
+	int   wqe_shift;
+	enum ib_sig_type policy;
+};
+
+struct mthca_qp {
+	struct ib_qp           ibqp;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	u32                    qpn;
+	int                    transport;
+	enum ib_qp_state       state;
+	int                    is_direct;
+	struct mthca_mr        mr;
+
+	struct mthca_wq        rq;
+	struct mthca_wq        sq;
+	int                    send_wqe_offset;
+
+	u64                   *wrid;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+
+	wait_queue_head_t      wait;
+};
+
+struct mthca_sqp {
+	struct mthca_qp qp;
+	int             port;
+	int             pkey_index;
+	u32             qkey;
+	u32             send_psn;
+	struct ib_ud_header ud_header;
+	int             header_buf_size;
+	void           *header_buf;
+	dma_addr_t      header_dma;
+};
+
+static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct mthca_mr, ibmr);
+}
+
+static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct mthca_pd, ibpd);
+}
+
+static inline struct mthca_ah *to_mah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct mthca_ah, ibah);
+}
+
+static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct mthca_cq, ibcq);
+}
+
+static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct mthca_qp, ibqp);
+}
+
+static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp)
+{
+	return container_of(qp, struct mthca_sqp, qp);
+}
+
+#endif /* MTHCA_PROVIDER_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c	2004-11-21 21:25:54.927052223 -0800
@@ -0,0 +1,1485 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_qp.c 1270 2004-11-18 21:47:31Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE,
+	MTHCA_ACK_REQ_FREQ       = 10,
+	MTHCA_FLIGHT_LIMIT       = 9,
+	MTHCA_UD_HEADER_SIZE     = 72 /* largest UD header possible */
+};
+
+enum {
+	MTHCA_QP_STATE_RST  = 0,
+	MTHCA_QP_STATE_INIT = 1,
+	MTHCA_QP_STATE_RTR  = 2,
+	MTHCA_QP_STATE_RTS  = 3,
+	MTHCA_QP_STATE_SQE  = 4,
+	MTHCA_QP_STATE_SQD  = 5,
+	MTHCA_QP_STATE_ERR  = 6,
+	MTHCA_QP_STATE_DRAINING = 7
+};
+
+enum {
+	MTHCA_QP_ST_RC 	= 0x0,
+	MTHCA_QP_ST_UC 	= 0x1,
+	MTHCA_QP_ST_RD 	= 0x2,
+	MTHCA_QP_ST_UD 	= 0x3,
+	MTHCA_QP_ST_MLX = 0x7
+};
+
+enum {
+	MTHCA_QP_PM_MIGRATED = 0x3,
+	MTHCA_QP_PM_ARMED    = 0x0,
+	MTHCA_QP_PM_REARM    = 0x1
+};
+
+enum {
+	/* qp_context flags */
+	MTHCA_QP_BIT_DE  = 1 <<  8,
+	/* params1 */
+	MTHCA_QP_BIT_SRE = 1 << 15,
+	MTHCA_QP_BIT_SWE = 1 << 14,
+	MTHCA_QP_BIT_SAE = 1 << 13,
+	MTHCA_QP_BIT_SIC = 1 <<  4,
+	MTHCA_QP_BIT_SSC = 1 <<  3,
+	/* params2 */
+	MTHCA_QP_BIT_RRE = 1 << 15,
+	MTHCA_QP_BIT_RWE = 1 << 14,
+	MTHCA_QP_BIT_RAE = 1 << 13,
+	MTHCA_QP_BIT_RIC = 1 <<  4,
+	MTHCA_QP_BIT_RSC = 1 <<  3
+};
+
+struct mthca_qp_path {
+	u32 port_pkey;
+	u8  rnr_retry;
+	u8  g_mylmc;
+	u16 rlid;
+	u8  ackto;
+	u8  mgid_index;
+	u8  static_rate;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u8  rgid[16];
+} __attribute__((packed));
+
+struct mthca_qp_context {
+	u32 flags;
+	u32 sched_queue;
+	u32 mtu_msgmax;
+	u32 usr_page;
+	u32 local_qpn;
+	u32 remote_qpn;
+	u32 reserved1[2];
+	struct mthca_qp_path pri_path;
+	struct mthca_qp_path alt_path;
+	u32 rdd;
+	u32 pd;
+	u32 wqe_base;
+	u32 wqe_lkey;
+	u32 params1;
+	u32 reserved2;
+	u32 next_send_psn;
+	u32 cqn_snd;
+	u32 next_snd_wqe[2];
+	u32 last_acked_psn;
+	u32 ssn;
+	u32 params2;
+	u32 rnr_nextrecvpsn;
+	u32 ra_buff_indx;
+	u32 cqn_rcv;
+	u32 next_rcv_wqe[2];
+	u32 qkey;
+	u32 srqn;
+	u32 rmsn;
+	u32 reserved3[19];
+} __attribute__((packed));
+
+struct mthca_qp_param {
+	u32 opt_param_mask;
+	u32 reserved1;
+	struct mthca_qp_context context;
+	u32 reserved2[62];
+} __attribute__((packed));
+
+enum {
+	MTHCA_QP_OPTPAR_ALT_ADDR_PATH     = 1 << 0,
+	MTHCA_QP_OPTPAR_RRE               = 1 << 1,
+	MTHCA_QP_OPTPAR_RAE               = 1 << 2,
+	MTHCA_QP_OPTPAR_REW               = 1 << 3,
+	MTHCA_QP_OPTPAR_PKEY_INDEX        = 1 << 4,
+	MTHCA_QP_OPTPAR_Q_KEY             = 1 << 5,
+	MTHCA_QP_OPTPAR_RNR_TIMEOUT       = 1 << 6,
+	MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7,
+	MTHCA_QP_OPTPAR_SRA_MAX           = 1 << 8,
+	MTHCA_QP_OPTPAR_RRA_MAX           = 1 << 9,
+	MTHCA_QP_OPTPAR_PM_STATE          = 1 << 10,
+	MTHCA_QP_OPTPAR_PORT_NUM          = 1 << 11,
+	MTHCA_QP_OPTPAR_RETRY_COUNT       = 1 << 12,
+	MTHCA_QP_OPTPAR_ALT_RNR_RETRY     = 1 << 13,
+	MTHCA_QP_OPTPAR_ACK_TIMEOUT       = 1 << 14,
+	MTHCA_QP_OPTPAR_RNR_RETRY         = 1 << 15,
+	MTHCA_QP_OPTPAR_SCHED_QUEUE       = 1 << 16
+};
+
+enum {
+	MTHCA_OPCODE_NOP            = 0x00,
+	MTHCA_OPCODE_RDMA_WRITE     = 0x08,
+	MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09,
+	MTHCA_OPCODE_SEND           = 0x0a,
+	MTHCA_OPCODE_SEND_IMM       = 0x0b,
+	MTHCA_OPCODE_RDMA_READ      = 0x10,
+	MTHCA_OPCODE_ATOMIC_CS      = 0x11,
+	MTHCA_OPCODE_ATOMIC_FA      = 0x12,
+	MTHCA_OPCODE_BIND_MW        = 0x18,
+	MTHCA_OPCODE_INVALID        = 0xff
+};
+
+enum {
+	MTHCA_NEXT_DBD       = 1 << 7,
+	MTHCA_NEXT_FENCE     = 1 << 6,
+	MTHCA_NEXT_CQ_UPDATE = 1 << 3,
+	MTHCA_NEXT_EVENT_GEN = 1 << 2,
+	MTHCA_NEXT_SOLICIT   = 1 << 1,
+
+	MTHCA_MLX_VL15       = 1 << 17,
+	MTHCA_MLX_SLR        = 1 << 16
+};
+
+struct mthca_next_seg {
+	u32 nda_op;		/* [31:6] next WQE [4:0] next opcode */
+	u32 ee_nds;		/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
+	u32 flags;		/* [3] CQ [2] Event [1] Solicit */
+	u32 imm;		/* immediate data */
+} __attribute__((packed));
+
+struct mthca_ud_seg {
+	u32 reserved1;
+	u32 lkey;
+	u64 av_addr;
+	u32 reserved2[4];
+	u32 dqpn;
+	u32 qkey;
+	u32 reserved3[2];
+} __attribute__((packed));
+
+struct mthca_bind_seg {
+	u32 flags;		/* [31] Atomic [30] rem write [29] rem read */
+	u32 reserved;
+	u32 new_rkey;
+	u32 lkey;
+	u64 addr;
+	u64 length;
+} __attribute__((packed));
+
+struct mthca_raddr_seg {
+	u64 raddr;
+	u32 rkey;
+	u32 reserved;
+} __attribute__((packed));
+
+struct mthca_atomic_seg {
+	u64 swap_add;
+	u64 compare;
+} __attribute__((packed));
+
+struct mthca_data_seg {
+	u32 byte_count;
+	u32 lkey;
+	u64 addr;
+} __attribute__((packed));
+
+struct mthca_mlx_seg {
+	u32 nda_op;
+	u32 nds;
+	u32 flags;		/* [17] VL15 [16] SLR [14:12] static rate
+				   [11:8] SL [3] C [2] E */
+	u16 rlid;
+	u16 vcrc;
+} __attribute__((packed));
+
+static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 3;
+}
+
+static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 1;
+}
+
+static void *get_recv_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + (n << qp->rq.wqe_shift);
+	else
+		return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf +
+			((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1));
+}
+
+static void *get_send_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + qp->send_wqe_offset +
+			(n << qp->sq.wqe_shift);
+	else
+		return qp->queue.page_list[(qp->send_wqe_offset +
+					    (n << qp->sq.wqe_shift)) >>
+					   PAGE_SHIFT].buf +
+			((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) &
+			 (PAGE_SIZE - 1));
+}
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type)
+{
+	struct mthca_qp *qp;
+	struct ib_event event;
+
+	spin_lock(&dev->qp_table.lock);
+	qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1));
+	if (qp)
+		atomic_inc(&qp->refcount);
+	spin_unlock(&dev->qp_table.lock);
+
+	if (!qp) {
+		mthca_warn(dev, "Async event for bogus QP %08x\n", qpn);
+		return;
+	}
+
+	event.device      = &dev->ib_dev;
+	event.event       = event_type;
+	event.element.qp  = &qp->ibqp;
+	if (qp->ibqp.event_handler)
+		qp->ibqp.event_handler(&event, qp->ibqp.qp_context);
+
+	if (atomic_dec_and_test(&qp->refcount))
+		wake_up(&qp->wait);
+}
+
+static int to_mthca_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET: return MTHCA_QP_STATE_RST;
+	case IB_QPS_INIT:  return MTHCA_QP_STATE_INIT;
+	case IB_QPS_RTR:   return MTHCA_QP_STATE_RTR;
+	case IB_QPS_RTS:   return MTHCA_QP_STATE_RTS;
+	case IB_QPS_SQD:   return MTHCA_QP_STATE_SQD;
+	case IB_QPS_SQE:   return MTHCA_QP_STATE_SQE;
+	case IB_QPS_ERR:   return MTHCA_QP_STATE_ERR;
+	default:                return -1;
+	}
+}
+
+enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS };
+
+static int to_mthca_st(int transport)
+{
+	switch (transport) {
+	case RC:  return MTHCA_QP_ST_RC;
+	case UC:  return MTHCA_QP_ST_UC;
+	case UD:  return MTHCA_QP_ST_UD;
+	case RD:  return MTHCA_QP_ST_RD;
+	case MLX: return MTHCA_QP_ST_MLX;
+	default:  return -1;
+	}
+}
+
+static const struct {
+	int trans;
+	u32 req_param[NUM_TRANS];
+	u32 opt_param[NUM_TRANS];
+} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = {
+	[IB_QPS_RESET] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_RST2INIT,
+			.req_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			},
+			/* bug-for-bug compatibility with VAPI: */
+			.opt_param = {
+				[MLX] = IB_QP_PORT
+			}
+		},
+	},
+	[IB_QPS_INIT]  = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_INIT2INIT,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_RTR]   = {
+			.trans = MTHCA_TRANS_INIT2RTR,
+			.req_param = {
+				[RC]  = (IB_QP_AV                  |
+					 IB_QP_PATH_MTU            |
+					 IB_QP_DEST_QPN            |
+					 IB_QP_RQ_PSN              |
+					 IB_QP_MAX_DEST_RD_ATOMIC  |
+					 IB_QP_MIN_RNR_TIMER),
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ALT_PATH     |
+					 IB_QP_ACCESS_FLAGS |
+					 IB_QP_PKEY_INDEX),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTR]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTR2RTS,
+			.req_param = {
+				[UD]  = IB_QP_SQ_PSN,
+				[RC]  = (IB_QP_TIMEOUT           |
+					 IB_QP_RETRY_CNT         |
+					 IB_QP_RNR_RETRY         |
+					 IB_QP_SQ_PSN            |
+					 IB_QP_MAX_QP_RD_ATOMIC),
+				[MLX] = IB_QP_SQ_PSN,
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTS]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTS2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ACCESS_FLAGS          |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_PATH_MIG_STATE        |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_RTS2SQD,
+		},
+	},
+	[IB_QPS_SQD]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQD2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_SQD2SQD,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_AV                    |
+					 IB_QP_TIMEOUT               |
+					 IB_QP_RETRY_CNT             |
+					 IB_QP_RNR_RETRY             |
+					 IB_QP_MAX_QP_RD_ATOMIC      |
+					 IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_SQE]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQERR2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_ERR] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }
+	}
+};
+
+static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr,
+			int attr_mask)
+{
+	if (attr_mask & IB_QP_PKEY_INDEX)
+		sqp->pkey_index = attr->pkey_index;
+	if (attr_mask & IB_QP_QKEY)
+		sqp->qkey = attr->qkey;
+	if (attr_mask & IB_QP_SQ_PSN)
+		sqp->send_psn = attr->sq_psn;
+}
+
+static void init_port(struct mthca_dev *dev, int port)
+{
+	int err;
+	u8 status;
+	struct mthca_init_ib_param param;
+
+	memset(&param, 0, sizeof param);
+
+	param.enable_1x = 1;
+	param.enable_4x = 1;
+	param.vl_cap    = dev->limits.vl_cap;
+	param.mtu_cap   = dev->limits.mtu_cap;
+	param.gid_cap   = dev->limits.gid_table_len;
+	param.pkey_cap  = dev->limits.pkey_table_len;
+
+	err = mthca_INIT_IB(dev, &param, port, &status);
+	if (err)
+		mthca_warn(dev, "INIT_IB failed, return code %d.\n", err);
+	if (status)
+		mthca_warn(dev, "INIT_IB returned status %02x.\n", status);
+}
+
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	enum ib_qp_state cur_state, new_state;
+	void *mailbox = NULL;
+	struct mthca_qp_param *qp_param;
+	struct mthca_qp_context *qp_context;
+	u32 req_param, opt_param;
+	u8 status;
+	int err;
+
+	if (attr_mask & IB_QP_CUR_STATE) {
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			cur_state = attr->cur_qp_state;
+	} else {
+		spin_lock_irq(&qp->lock);
+		cur_state = qp->state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	if (attr_mask & IB_QP_STATE) {
+               if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
+			return -EINVAL;
+		new_state = attr->qp_state;
+	} else
+		new_state = cur_state;
+
+	if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) {
+		mthca_dbg(dev, "Illegal QP transition "
+			  "%d->%d\n", cur_state, new_state);
+		return -EINVAL;
+	}
+
+	req_param = state_table[cur_state][new_state].req_param[qp->transport];
+	opt_param = state_table[cur_state][new_state].opt_param[qp->transport];
+
+	if ((req_param & attr_mask) != req_param) {
+		mthca_dbg(dev, "QP transition "
+			  "%d->%d missing req attr 0x%08x\n",
+			  cur_state, new_state,
+			  req_param & ~attr_mask);
+		return -EINVAL;
+	}
+
+	if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) {
+		mthca_dbg(dev, "QP transition (transport %d) "
+			  "%d->%d has extra attr 0x%08x\n",
+			  qp->transport,
+			  cur_state, new_state,
+			  attr_mask & ~(req_param | opt_param |
+						 IB_QP_STATE));
+		return -EINVAL;
+	}
+
+	mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	qp_param = MAILBOX_ALIGN(mailbox);
+	qp_context = &qp_param->context;
+	memset(qp_param, 0, sizeof *qp_param);
+
+	qp_context->flags      = cpu_to_be32((to_mthca_state(new_state) << 28) |
+					     (to_mthca_st(qp->transport) << 16));
+	qp_context->flags     |= cpu_to_be32(MTHCA_QP_BIT_DE);
+	if (!(attr_mask & IB_QP_PATH_MIG_STATE))
+		qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+	else {
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE);
+		switch (attr->path_mig_state) {
+		case IB_MIG_MIGRATED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+			break;
+		case IB_MIG_REARM:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11);
+			break;
+		case IB_MIG_ARMED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11);
+			break;
+		}
+	}
+	/* leave sched_queue as 0 */
+	if (qp->transport == MLX || qp->transport == UD)
+		qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) |
+						     (11 << 24));
+	else if (attr_mask & IB_QP_PATH_MTU) {
+		qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) |
+						     (31 << 24));
+	}
+	qp_context->usr_page   = cpu_to_be32(MTHCA_KAR_PAGE);
+	qp_context->local_qpn  = cpu_to_be32(qp->qpn);
+	if (attr_mask & IB_QP_DEST_QPN) {
+		qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
+	}
+
+	if (qp->transport == MLX)
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(to_msqp(qp)->port << 24);
+	else {
+		if (attr_mask & IB_QP_PORT) {
+			qp_context->pri_path.port_pkey |=
+				cpu_to_be32(attr->port_num << 24);
+			qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM);
+		}
+	}
+
+	if (attr_mask & IB_QP_PKEY_INDEX) {
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(attr->pkey_index);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX);
+	}
+
+	if (attr_mask & IB_QP_RNR_RETRY) {
+		qp_context->pri_path.rnr_retry = attr->rnr_retry << 5;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY);
+	}
+
+	if (attr_mask & IB_QP_AV) {
+		qp_context->pri_path.g_mylmc     = attr->ah_attr.src_path_bits & 0x7f;
+		qp_context->pri_path.rlid        = cpu_to_be16(attr->ah_attr.dlid);
+		qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3;
+		if (attr->ah_attr.ah_flags & IB_AH_GRH) {
+			qp_context->pri_path.g_mylmc |= 1 << 7;
+			qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index;
+			qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit;
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32((attr->ah_attr.sl << 28)                |
+					    (attr->ah_attr.grh.traffic_class << 20) |
+					    (attr->ah_attr.grh.flow_label));
+			memcpy(qp_context->pri_path.rgid,
+			       attr->ah_attr.grh.dgid.raw, 16);
+		} else {
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32(attr->ah_attr.sl << 28);
+		}
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH);	
+	}
+
+	if (attr_mask & IB_QP_TIMEOUT) {
+		qp_context->pri_path.ackto = attr->timeout;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT);
+	}
+
+	/* XXX alt_path */
+
+	/* leave rdd as 0 */
+	qp_context->pd         = cpu_to_be32(to_mpd(ibqp->pd)->pd_num);
+	/* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */
+	qp_context->wqe_lkey   = cpu_to_be32(qp->mr.ibmr.lkey);
+	qp_context->params1    = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) |
+					     (MTHCA_FLIGHT_LIMIT << 24) |
+					     MTHCA_QP_BIT_SRE           |
+					     MTHCA_QP_BIT_SWE           |
+					     MTHCA_QP_BIT_SAE);
+	if (qp->sq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC);
+	if (attr_mask & IB_QP_RETRY_CNT) {
+		qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT);
+	}
+
+	/* XXX initiator resources */
+	if (attr_mask & IB_QP_SQ_PSN)
+		qp_context->next_send_psn = cpu_to_be32(attr->sq_psn);
+	qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn);
+
+	/* XXX RDMA/atomic enable, responder resources */
+
+	if (qp->rq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC);
+	if (attr_mask & IB_QP_MIN_RNR_TIMER) {
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT);
+	}
+	if (attr_mask & IB_QP_RQ_PSN)
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn);
+
+	/* XXX ra_buff_indx */
+
+	qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn);
+
+	if (attr_mask & IB_QP_QKEY) {
+		qp_context->qkey = cpu_to_be32(attr->qkey);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY);
+	}
+
+	err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans,
+			      qp->qpn, 0, qp_param, 0, &status);
+	if (status) {
+		mthca_warn(dev, "modify QP %d returned status %02x.\n",
+			   state_table[cur_state][new_state].trans, status);
+		err = -EINVAL;
+	}
+
+	if (!err) {
+		spin_lock_irq(&qp->lock);
+		/* XXX deal with async transitions to ERROR */
+		qp->state = new_state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	kfree(mailbox);
+
+	if (is_sqp(dev, qp))
+		store_attrs(to_msqp(qp), attr, attr_mask);
+
+	/* 
+	 * If we are moving QP0 to RTR, bring the IB link up; if we
+	 * are moving QP0 to RESET or ERROR, bring the link back down.
+	 */
+	if (is_qp0(dev, qp)) {
+		if (cur_state != IB_QPS_RTR &&
+		    new_state == IB_QPS_RTR)
+			init_port(dev, to_msqp(qp)->port);
+
+		if (cur_state != IB_QPS_RESET &&
+		    cur_state != IB_QPS_ERR &&
+		    (new_state == IB_QPS_RESET ||
+		     new_state == IB_QPS_ERR))
+			mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status);
+	}
+
+	return err;
+}
+
+/*
+ * Allocate and register buffer for WQEs.  qp->rq.max, sq.max,
+ * rq.max_gs and sq.max_gs must all be assigned.
+ * mthca_alloc_wqe_buf will calculate rq.wqe_shift and
+ * sq.wqe_shift (as well as send_wqe_offset, is_direct, and
+ * queue)
+ */
+static int mthca_alloc_wqe_buf(struct mthca_dev *dev,
+			       struct mthca_pd *pd,
+			       struct mthca_qp *qp)
+{
+	int size;
+	int i;
+	int npages, shift;
+	dma_addr_t t;
+	u64 *dma_list = NULL;
+	int err = -ENOMEM;
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->rq.max_gs * sizeof (struct mthca_data_seg);
+
+	for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size;
+	     qp->rq.wqe_shift++)
+		; /* nothing */
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->sq.max_gs * sizeof (struct mthca_data_seg);
+	if (qp->transport == MLX)
+		size += 2 * sizeof (struct mthca_data_seg);
+	else if (qp->transport == UD)
+		size += sizeof (struct mthca_ud_seg);
+	else /* bind seg is as big as atomic + raddr segs */
+		size += sizeof (struct mthca_bind_seg);
+
+	for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size;
+	     qp->sq.wqe_shift++)
+		; /* nothing */
+
+	qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift,
+				    1 << qp->sq.wqe_shift);
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64),
+			   GFP_KERNEL);
+	if (!qp->wrid)
+		goto err_out;
+
+	if (size <= MTHCA_MAX_DIRECT_QP_SIZE) {
+		qp->is_direct = 1;
+		npages = 1;
+		shift = get_order(size) + PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n",
+				  size, shift);
+
+		qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t);
+		if (!qp->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&qp->queue.direct, mapping, t);
+
+		memset(qp->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		qp->is_direct = 0;
+		npages = size / PAGE_SIZE;
+		shift = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		qp->queue.page_list = kmalloc(npages *
+					      sizeof *qp->queue.page_list,
+					      GFP_KERNEL);
+		if (!qp->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i) {
+			qp->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!qp->queue.page_list[i].buf)
+				goto err_out_free;
+
+			memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE);
+
+			pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t);
+			dma_list[i] = t;
+		}
+	}
+
+	err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift,
+				  npages, 0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &qp->mr);
+	if (err)
+		goto err_out_free;
+
+	kfree(dma_list);
+	return 0;
+
+ err_out_free:
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else
+		for (i = 0; i < npages; ++i) {
+			if (qp->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    qp->queue.page_list[i].buf,
+						    pci_unmap_addr(&qp->queue.page_list[i],
+								   mapping));
+
+		}
+
+ err_out:
+	kfree(qp->wrid);
+	kfree(dma_list);
+	return err;
+}
+
+static int mthca_alloc_qp_common(struct mthca_dev *dev,
+				 struct mthca_pd *pd,
+				 struct mthca_cq *send_cq,
+				 struct mthca_cq *recv_cq,
+				 enum ib_sig_type send_policy,
+				 enum ib_sig_type recv_policy,
+				 struct mthca_qp *qp)
+{
+	int err;
+
+	spin_lock_init(&qp->lock);
+	atomic_set(&qp->refcount, 1);
+	qp->state    	 = IB_QPS_RESET;
+	qp->sq.policy    = send_policy;
+	qp->rq.policy    = recv_policy;
+	qp->rq.cur       = 0;
+	qp->sq.cur       = 0;
+	qp->rq.next      = 0;
+	qp->sq.next      = 0;
+	qp->rq.last_comp = qp->rq.max - 1;
+	qp->sq.last_comp = qp->sq.max - 1;
+	qp->rq.last      = NULL;
+	qp->sq.last      = NULL;
+
+	err = mthca_alloc_wqe_buf(dev, pd, qp);
+	return err;
+}
+
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp)
+{
+	int err;
+
+	switch (type) {
+	case IB_QPT_RC: qp->transport = RC; break;
+	case IB_QPT_UC: qp->transport = UC; break;
+	case IB_QPT_UD: qp->transport = UD; break;
+	default: return -EINVAL;
+	}		
+
+	qp->qpn = mthca_alloc(&dev->qp_table.alloc);
+	if (qp->qpn == -1)
+		return -ENOMEM;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy, qp);
+	if (err) {
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+		return err;
+	}
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_set(&dev->qp_table.qp,
+			qp->qpn & (dev->limits.num_qps - 1), qp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	return 0;
+}
+
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp)
+{
+	int err = 0;
+	u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1;
+
+	sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE;
+	sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size,
+					     &sqp->header_dma, GFP_KERNEL);
+	if (!sqp->header_buf)
+		return -ENOMEM;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	if (mthca_array_get(&dev->qp_table.qp, mqpn))
+		err = -EBUSY;
+	else
+		mthca_array_set(&dev->qp_table.qp, mqpn, sqp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	if (err)
+		goto err_out;
+
+	sqp->port = port;
+	sqp->qp.qpn       = mqpn;
+	sqp->qp.transport = MLX;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy,
+				    &sqp->qp);
+	if (err)
+		goto err_out_free;
+
+	atomic_inc(&pd->sqp_count);
+
+	return 0;
+
+ err_out_free:
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp, mqpn);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+ err_out:
+	dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size,
+			  sqp->header_buf, sqp->header_dma);
+
+	return err;
+}
+
+void mthca_free_qp(struct mthca_dev *dev,
+		   struct mthca_qp *qp)
+{
+	u8 status;
+	int size;
+	int i;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp,
+			  qp->qpn & (dev->limits.num_qps - 1));
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	atomic_dec(&qp->refcount);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
+
+	if (qp->state != IB_QPS_RESET)
+		mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status);
+
+	mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn);
+	if (qp->ibqp.send_cq != qp->ibqp.recv_cq)
+		mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn);
+
+	mthca_free_mr(dev, &qp->mr);
+
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else {
+		for (i = 0; i < size / PAGE_SIZE; ++i) {
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    qp->queue.page_list[i].buf,
+					    pci_unmap_addr(&qp->queue.page_list[i],
+							   mapping));
+		}
+	}
+
+	kfree(qp->wrid);
+
+	if (is_sqp(dev, qp)) {
+		atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count));
+		dma_free_coherent(&dev->pdev->dev,
+				  to_msqp(qp)->header_buf_size,
+				  to_msqp(qp)->header_buf,
+				  to_msqp(qp)->header_dma);
+	}
+	else
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+}
+
+/* Create UD header for an MLX send and build a data segment for it */
+static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp,
+			    int ind, struct ib_send_wr *wr,
+			    struct mthca_mlx_seg *mlx,
+			    struct mthca_data_seg *data)
+{
+	int header_size;
+	int err;
+
+	ib_ud_header_init(256, /* assume a MAD */
+			  sqp->ud_header.grh_present,
+			  &sqp->ud_header);
+
+	err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header);
+	if (err)
+		return err;
+	mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1);
+	mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) |
+				  (sqp->ud_header.lrh.destination_lid == 0xffff ?
+				   MTHCA_MLX_SLR : 0) |
+				  (sqp->ud_header.lrh.service_level << 8));
+	mlx->rlid = sqp->ud_header.lrh.destination_lid;
+	mlx->vcrc = 0;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY;
+		sqp->ud_header.immediate_present = 0;
+		break;
+	case IB_WR_SEND_WITH_IMM:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+		sqp->ud_header.immediate_present = 1;
+		sqp->ud_header.immediate_data = wr->imm_data;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	sqp->ud_header.lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
+	if (sqp->ud_header.lrh.destination_lid == 0xffff)
+		sqp->ud_header.lrh.source_lid = 0xffff;
+	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+	if (!sqp->qp.ibqp.qp_num)
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   sqp->pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	else
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   wr->wr.ud.pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	cpu_to_be16s(&sqp->ud_header.bth.pkey);
+	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
+	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
+	sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
+					       sqp->qkey : wr->wr.ud.remote_qkey);
+	sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
+
+	header_size = ib_ud_header_pack(&sqp->ud_header,
+					sqp->header_buf +
+					ind * MTHCA_UD_HEADER_SIZE);
+
+	data->byte_count = cpu_to_be32(header_size);
+	data->lkey       = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey);
+	data->addr       = cpu_to_be64(sqp->header_dma +
+				       ind * MTHCA_UD_HEADER_SIZE);
+
+	return 0;
+}
+
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	void *wqe;
+	void *prev_wqe;
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	u32 f0 = 0;
+	int ind;
+	u8 op0 = 0;
+
+	static const u8 opcode[] = {
+		[IB_WR_SEND]                 = MTHCA_OPCODE_SEND,
+		[IB_WR_SEND_WITH_IMM]        = MTHCA_OPCODE_SEND_IMM,
+		[IB_WR_RDMA_WRITE]           = MTHCA_OPCODE_RDMA_WRITE,
+		[IB_WR_RDMA_WRITE_WITH_IMM]  = MTHCA_OPCODE_RDMA_WRITE_IMM,
+		[IB_WR_RDMA_READ]            = MTHCA_OPCODE_RDMA_READ,
+		[IB_WR_ATOMIC_CMP_AND_SWP]   = MTHCA_OPCODE_ATOMIC_CS,
+		[IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA,
+	};
+
+	spin_lock_irqsave(&qp->lock, flags);
+
+	/* XXX check that state is OK to post send */
+
+	ind = qp->sq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->sq.cur + nreq >= qp->sq.max) {
+			mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n",
+				  qp->sq.cur, qp->sq.max, nreq);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_send_wqe(qp, ind);
+		prev_wqe = qp->sq.last;
+		qp->sq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 0;
+		((struct mthca_next_seg *) wqe)->flags =
+			((wr->send_flags & IB_SEND_SIGNALED) ?
+			 cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) |
+			((wr->send_flags & IB_SEND_SOLICITED) ?
+			 cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0)   |
+			cpu_to_be32(1);
+		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
+		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+			((struct mthca_next_seg *) wqe)->flags = wr->imm_data;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (qp->transport == UD) {
+			((struct mthca_ud_seg *) wqe)->lkey =
+				cpu_to_be32(to_mah(wr->wr.ud.ah)->key);
+			((struct mthca_ud_seg *) wqe)->av_addr =
+				cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma);
+			((struct mthca_ud_seg *) wqe)->dqpn =
+				cpu_to_be32(wr->wr.ud.remote_qpn);
+			((struct mthca_ud_seg *) wqe)->qkey =
+				cpu_to_be32(wr->wr.ud.remote_qkey);
+
+			wqe += sizeof (struct mthca_ud_seg);
+			size += sizeof (struct mthca_ud_seg) / 16;
+		} else if (qp->transport == MLX) {
+			err = build_mlx_header(dev, to_msqp(qp), ind, wr,
+					       wqe - sizeof (struct mthca_next_seg),
+					       wqe);
+			if (err) {
+				*bad_wr = wr;
+				goto out;
+			}
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		if (wr->num_sge > qp->sq.max_gs) {
+			mthca_err(dev, "too many gathers\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		/* Add one more inline data segment for ICRC */
+		if (qp->transport == MLX) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32((1 << 31) | 4);
+			((u32 *) wqe)[1] = 0;
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind + qp->rq.max] = wr->wr_id;
+
+		if (wr->opcode >= ARRAY_SIZE(opcode)) {
+			mthca_err(dev, "opcode invalid\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32(((ind << qp->sq.wqe_shift) +
+					     qp->send_wqe_offset) |
+					    opcode[wr->opcode]);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size);
+		}
+
+		if (!size0) {
+			size0 = size;
+			op0   = opcode[wr->opcode];
+		}
+
+		++ind;
+		if (unlikely(ind >= qp->sq.max))
+			ind -= qp->sq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) +
+					   qp->send_wqe_offset) | f0 | op0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_SEND_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->sq.cur += nreq;
+	qp->sq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	int ind;
+	void *wqe;
+	void *prev_wqe;
+
+	spin_lock_irqsave(&qp->lock, flags);
+	
+	/* XXX check that state is OK to post receive */
+
+	ind = qp->rq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->rq.cur + nreq >= qp->rq.max) {
+			mthca_err(dev, "RQ %06x full\n", qp->qpn);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_recv_wqe(qp, ind);
+		prev_wqe = qp->rq.last;
+		qp->rq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 
+			cpu_to_be32(MTHCA_NEXT_DBD);
+		((struct mthca_next_seg *) wqe)->flags =
+			(wr->recv_flags & IB_RECV_SIGNALED) ?
+			cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (wr->num_sge > qp->rq.max_gs) {
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind] = wr->wr_id;
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32(MTHCA_NEXT_DBD | size);
+		}
+
+		if (!size0)
+			size0 = size;
+
+		++ind;
+		if (unlikely(ind >= qp->rq.max))
+			ind -= qp->rq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->rq.cur += nreq;
+	qp->rq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe)
+{
+	struct mthca_next_seg *next;
+
+	if (is_send)
+		next = get_send_wqe(qp, index);
+	else
+		next = get_recv_wqe(qp, index);
+
+	*dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD));
+	if (next->ee_nds & cpu_to_be32(0x3f))
+		*new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) |
+			(next->ee_nds & cpu_to_be32(0x3f));
+	else
+		*new_wqe = 0;
+
+	return 0;
+}
+
+int __devinit mthca_init_qp_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	int i;
+
+	spin_lock_init(&dev->qp_table.lock);
+
+	/*
+	 * We reserve 2 extra QPs per port for the special QPs.  The
+	 * special QP for port 1 has to be even, so round up.
+	 */
+	dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL;
+	err = mthca_alloc_init(&dev->qp_table.alloc,
+			       dev->limits.num_qps,
+			       (1 << 24) - 1,
+			       dev->qp_table.sqp_start +
+			       MTHCA_MAX_PORTS * 2);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->qp_table.qp,
+			       dev->limits.num_qps);
+	if (err) {
+		mthca_alloc_cleanup(&dev->qp_table.alloc);
+		return err;
+	}
+
+	for (i = 0; i < 2; ++i) {
+		err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI,
+					    dev->qp_table.sqp_start + i * 2,
+					    &status);
+		if (err)
+			goto err_out;
+		if (status) {
+			mthca_warn(dev, "CONF_SPECIAL_QP returned "
+				   "status %02x, aborting.\n",
+				   status);
+			err = -EINVAL;
+			goto err_out;
+		}
+	}
+	return 0;
+
+ err_out:
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps);
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev)
+{
+	int i;
+	u8 status;
+
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
Index: linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c	2004-11-21 21:25:54.952048515 -0800
@@ -0,0 +1,228 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+int mthca_reset(struct mthca_dev *mdev)
+{
+	int i;
+	int err = 0;
+	u32 *hca_header    = NULL;
+	u32 *bridge_header = NULL;
+	struct pci_dev *bridge = NULL;
+
+#define MTHCA_RESET_OFFSET 0xf0010
+#define MTHCA_RESET_VALUE  cpu_to_be32(1)
+
+	/*
+	 * Reset the chip.  This is somewhat ugly because we have to
+	 * save off the PCI header before reset and then restore it
+	 * after the chip reboots.  We skip config space offsets 22
+	 * and 23 since those have a special meaning.
+	 *
+	 * To make matters worse, for Tavor (PCI-X HCA) we have to
+	 * find the associated bridge device and save off its PCI
+	 * header as well.
+	 */
+
+	if (mdev->hca_type == TAVOR) {
+		/* Look for the bridge -- its device ID will be 2 more
+		   than HCA's device ID. */
+		while ((bridge = pci_get_device(mdev->pdev->vendor,
+						mdev->pdev->device + 2,
+						bridge)) != NULL) {
+			if (bridge->hdr_type    == PCI_HEADER_TYPE_BRIDGE &&
+			    bridge->subordinate == mdev->pdev->bus) {
+				mthca_dbg(mdev, "Found bridge: %s (%s)\n",
+					  pci_pretty_name(bridge), pci_name(bridge));
+				break;
+			}
+		}
+
+		if (!bridge) {
+			/*
+			 * Didn't find a bridge for a Tavor device --
+			 * assume we're in no-bridge mode and hope for
+			 * the best.
+			 */
+			mthca_warn(mdev, "No bridge found for %s (%s)\n",
+				  pci_pretty_name(mdev->pdev), pci_name(mdev->pdev));
+		}
+			
+	}
+
+	/* For Arbel do we need to save off the full 4K PCI Express header?? */
+	hca_header = kmalloc(256, GFP_KERNEL);
+	if (!hca_header) {
+		err = -ENOMEM;
+		mthca_err(mdev, "Couldn't allocate memory to save HCA "
+			  "PCI header, aborting.\n");
+		goto out;
+	}
+
+	for (i = 0; i < 64; ++i) {
+		if (i == 22 || i == 23)
+			continue;
+		if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't save HCA "
+				  "PCI header, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (bridge) {
+		bridge_header = kmalloc(256, GFP_KERNEL);
+		if (!bridge_header) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't allocate memory to save HCA "
+				  "bridge PCI header, aborting.\n");
+			goto out;
+		}
+
+		for (i = 0; i < 64; ++i) {
+			if (i == 22 || i == 23)
+				continue;
+			if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't save HCA bridge "
+					  "PCI header, aborting.\n");
+				goto out;
+			}
+		}
+	}
+
+	/* actually hit reset */
+	{
+		void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) +
+					      MTHCA_RESET_OFFSET, 4);
+
+		if (!reset) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't map HCA reset register, "
+				  "aborting.\n");
+			goto out;
+		}
+
+		writel(MTHCA_RESET_VALUE, reset);
+		iounmap(reset);
+	}
+
+	/* Docs say to wait one second before accessing device */
+	msleep(1000);
+
+	/* Now wait for PCI device to start responding again */
+	{
+		u32 v;
+		int c = 0;
+
+		for (c = 0; c < 100; ++c) {
+			if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't access HCA after reset, "
+					  "aborting.\n");
+				goto out;
+			}
+
+			if (v != 0xffffffff)
+				goto good;
+
+			msleep(100);
+		}
+
+		err = -ENODEV;
+		mthca_err(mdev, "PCI device did not come back after reset, "
+			  "aborting.\n");
+		goto out;
+	}
+
+good:
+	/* Now restore the PCI headers */
+	if (bridge) {
+		/*
+		 * Bridge control register is at 0x3e, so we'll
+		 * naturally restore it last in this loop.
+		 */
+		for (i = 0; i < 16; ++i) {
+			if (i * 4 == PCI_COMMAND)
+				continue;
+
+			if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't restore HCA bridge reg %x, "
+					  "aborting.\n", i);
+				goto out;
+			}
+		}
+
+		if (pci_write_config_dword(bridge, PCI_COMMAND,
+					   bridge_header[PCI_COMMAND / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, "
+				  "aborting.\n");
+			goto out;
+		}
+	}
+
+	for (i = 0; i < 16; ++i) {
+		if (i * 4 == PCI_COMMAND)
+			continue;
+
+		if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA reg %x, "
+				  "aborting.\n", i);
+			goto out;
+		}
+	}
+
+	if (pci_write_config_dword(mdev->pdev, PCI_COMMAND,
+				   hca_header[PCI_COMMAND / 4])) {
+		err = -ENODEV;
+		mthca_err(mdev, "Couldn't restore HCA COMMAND, "
+			  "aborting.\n");
+		goto out;
+	}
+
+out:
+	if (bridge)
+		pci_dev_put(bridge);
+	kfree(bridge_header);
+	kfree(hca_header);
+
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Mon Nov 22 07:13:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:54 -0800
Subject: [openib-general] [PATCH][RFC/v1][6/12] IPoIB IPv4 multicast
In-Reply-To: <20041122713.cSeT4UFKGqJDdZ8T@topspin.com>
Message-ID: <20041122713.Md0y3UqVvYcRT3Zf@topspin.com>

Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB
hardware addresses.  Also add <linux/if_infiniband.h> so INFINIBAND_ALEN
has a home.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/include/linux/if_infiniband.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/include/linux/if_infiniband.h	2004-11-21 21:25:56.078881371 -0800
@@ -0,0 +1,29 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#ifndef _LINUX_IF_INFINIBAND_H
+#define _LINUX_IF_INFINIBAND_H
+
+#define INFINIBAND_ALEN		20	/* Octets in IPoIB HW addr	*/
+
+#endif /* _LINUX_IF_INFINIBAND_H */
Index: linux-bk/include/net/ip.h
===================================================================
--- linux-bk.orig/include/net/ip.h	2004-11-21 21:07:12.110687532 -0800
+++ linux-bk/include/net/ip.h	2004-11-21 21:25:56.078881371 -0800
@@ -229,6 +229,39 @@
 	buf[3]=addr&0x7F;
 }
 
+/*
+ *	Map a multicast IP onto multicast MAC for type IP-over-InfiniBand.
+ *	Leave P_Key as 0 to be filled in by driver.
+ */
+
+static inline void ip_ib_mc_map(u32 addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	addr    = ntohl(addr);
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x40;		/* IPv4 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	buf[10] = 0;
+	buf[11] = 0;
+	buf[12] = 0;
+	buf[13] = 0;
+	buf[14] = 0;
+	buf[15] = 0;
+	buf[19] = addr & 0xff;
+	addr  >>= 8;
+	buf[18] = addr & 0xff;
+	addr  >>= 8;
+	buf[17] = addr & 0xff;
+	addr  >>= 8;
+	buf[16] = addr & 0x0f;
+}
+
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 #include <linux/ipv6.h>
 #endif
Index: linux-bk/net/ipv4/arp.c
===================================================================
--- linux-bk.orig/net/ipv4/arp.c	2004-11-21 21:07:24.904787535 -0800
+++ linux-bk/net/ipv4/arp.c	2004-11-21 21:25:56.079881223 -0800
@@ -213,6 +213,9 @@
 	case ARPHRD_IEEE802_TR:
 		ip_tr_mc_map(addr, haddr);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ip_ib_mc_map(addr, haddr);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(haddr, dev->broadcast, dev->addr_len);


From roland at topspin.com  Mon Nov 22 07:13:59 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:13:59 -0800
Subject: [openib-general] [PATCH][RFC/v1][7/12] IPoIB IPv6 support
In-Reply-To: <20041122713.Md0y3UqVvYcRT3Zf@topspin.com>
Message-ID: <20041122713.FnSlYodJYum7s82D@topspin.com>

Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB
hardware addresses, and add support for autoconfiguration for devices
with type ARPHRD_INFINIBAND.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Nitin Hande <Nitin.Hande at Sun.Com>
Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/include/net/if_inet6.h
===================================================================
--- linux-bk.orig/include/net/if_inet6.h	2004-11-21 21:07:35.126269616 -0800
+++ linux-bk/include/net/if_inet6.h	2004-11-21 21:25:56.386835692 -0800
@@ -266,5 +266,20 @@
 {
 	buf[0] = 0x00;
 }
+
+static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x60;		/* IPv6 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	memcpy(buf + 10, addr->s6_addr + 6, 10);
+}
 #endif
 #endif
Index: linux-bk/net/ipv6/addrconf.c
===================================================================
--- linux-bk.orig/net/ipv6/addrconf.c	2004-11-21 21:07:29.222146392 -0800
+++ linux-bk/net/ipv6/addrconf.c	2004-11-21 21:25:56.387835544 -0800
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/if_arp.h>
 #include <linux/if_arcnet.h>
+#include <linux/if_infiniband.h>
 #include <linux/route.h>
 #include <linux/inetdevice.h>
 #include <linux/init.h>
@@ -1098,6 +1099,12 @@
 		memset(eui, 0, 7);
 		eui[7] = *(u8*)dev->dev_addr;
 		return 0;
+	case ARPHRD_INFINIBAND:
+		if (dev->addr_len != INFINIBAND_ALEN)
+			return -1;
+		memcpy(eui, dev->dev_addr + 12, 8);
+		eui[0] |= 2;
+		return 0;
 	}
 	return -1;
 }
@@ -1797,6 +1804,7 @@
 	if ((dev->type != ARPHRD_ETHER) && 
 	    (dev->type != ARPHRD_FDDI) &&
 	    (dev->type != ARPHRD_IEEE802_TR) &&
+	    (dev->type != ARPHRD_INFINIBAND) &&
 	    (dev->type != ARPHRD_ARCNET)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;
Index: linux-bk/net/ipv6/ndisc.c
===================================================================
--- linux-bk.orig/net/ipv6/ndisc.c	2004-11-21 21:07:06.642499599 -0800
+++ linux-bk/net/ipv6/ndisc.c	2004-11-21 21:25:56.388835395 -0800
@@ -260,6 +260,9 @@
 	case ARPHRD_ARCNET:
 		ipv6_arcnet_mc_map(addr, buf);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ipv6_ib_mc_map(addr, buf);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(buf, dev->broadcast, dev->addr_len);


From roland at topspin.com  Mon Nov 22 07:14:11 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:14:11 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][9/12] Add
	InfiniBand userspace MAD support
Message-ID: <20041122714.9zlcKGKvXlpga8EP@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/26f27907/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support
Date: Mon, 22 Nov 2004 07:14:11 -0800
Size: 23296
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/26f27907/attachment.mht>

From roland at topspin.com  Mon Nov 22 07:14:04 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:14:04 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][8/12] Add IPoIB
	(IP-over-InfiniBand) driver
Message-ID: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/8bf5a540/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: Roland Dreier <roland at topspin.com>
Subject: [PATCH][RFC/v1][8/12] Add IPoIB (IP-over-InfiniBand) driver
Date: Mon, 22 Nov 2004 07:14:04 -0800
Size: 101204
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/8bf5a540/attachment.mht>

From roland at topspin.com  Mon Nov 22 07:14:17 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:14:17 -0800
Subject: [openib-general] [PATCH][RFC/v1][10/12] Document InfiniBand ioctl
	use
In-Reply-To: <20041122714.9zlcKGKvXlpga8EP@topspin.com>
Message-ID: <20041122714.taTI3zcdWo5JfuMd@topspin.com>

Add the 0x1b ioctl magic number used by ib_umad module to
Documentation/ioctl-number.txt.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/Documentation/ioctl-number.txt
===================================================================
--- linux-bk.orig/Documentation/ioctl-number.txt	2004-11-21 21:07:31.047875266 -0800
+++ linux-bk/Documentation/ioctl-number.txt	2004-11-21 21:25:57.971600622 -0800
@@ -72,6 +72,7 @@
 0x09	all	linux/md.h
 0x12	all	linux/fs.h
 		linux/blkpg.h
+0x1b	all	InfiniBand Subsystem	<http://www.openib.org/>
 0x20	all	drivers/cdrom/cm206.h
 0x22	all	scsi/sg.h
 '#'	00-3F	IEEE 1394 Subsystem	Block for the entire subsystem


From roland at topspin.com  Mon Nov 22 07:14:22 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:14:22 -0800
Subject: [openib-general] [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
Message-ID: <20041122714.AyIOvRY195EGFTaO@topspin.com>

Add files to Documentation/infiniband that describe the tree under
/sys/class/infiniband, the IPoIB driver and the userspace MAD access driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/Documentation/infiniband/ipoib.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/ipoib.txt	2004-11-21 21:25:58.205565918 -0800
@@ -0,0 +1,55 @@
+IP OVER INFINIBAND
+
+  The ib_ipoib driver is an implementation of the IP over InfiniBand
+  protocol as specified by the latest Internet-Drafts issued by the
+  IETF ipoib working group.  It is a "native" implementation in the
+  sense of setting the interface type to ARPHRD_INFINIBAND and the
+  hardware address length to 20 (earlier proprietary implementations
+  masqueraded to the kernel as ethernet interfaces).
+
+Partitions and P_Keys
+
+  When the IPoIB driver is loaded, it creates one interface for each
+  port using the P_Key at index 0.  To create an interface with a
+  different P_Key, write the desired P_Key into the main interface's
+  /sys/class/net/<intf name>/create_child file.  For example:
+
+    echo 0x8001 > /sys/class/net/ib0/create_child
+
+  This will create an interface named ib0.8001 with P_Key 0x8001.  To
+  remove a subinterface, use the "delete_child" file:
+
+    echo 0x8001 > /sys/class/net/ib0/delete_child
+
+  The P_Key for any interface is given by the "pkey" file, and the
+  main interface for a subinterface is in "parent."
+
+Debugging Information
+
+  By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
+  to 'y', tracing messages are compiled into the driver.  They are
+  turned on by setting the module parameters debug_level and
+  mcast_debug_level to 1.  These parameters can be controlled at
+  runtime through files in /sys/module/ib_ipoib/.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs"
+  virtual filesystem.  By mounting this filesystem, for example with
+
+    mkdir -p /ipoib_debugfs
+    mount -t ipoib_debugfs none /ipoib_debufs
+
+  it is possible to get statistics about multicast groups from the
+  files /ipoib_debugfs/ib0_mcg and so on.
+
+  The performance impact of this option is negligible, so it
+  is safe to enable this option with debug_level set to 0 for normal
+  operation.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output
+  in the data path when debug_level is set to 2.  However, even with
+  the output disabled, this option will affect performance.
+
+References
+
+  IETF IP over InfiniBand (ipoib) Working Group
+    http://ietf.org/html.charters/ipoib-charter.html
Index: linux-bk/Documentation/infiniband/sysfs.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/sysfs.txt	2004-11-21 21:25:58.231562062 -0800
@@ -0,0 +1,63 @@
+SYSFS FILES
+
+  For each InfiniBand device, the InfiniBand drivers create the
+  following files under /sys/class/infiniband/<device name>:
+
+    node_guid      - Node GUID
+    sys_image_guid - System image GUID
+
+  In addition, there is a "ports" subdirectory, with one subdirectory
+  for each port.  For example, if mthca0 is a 2-port HCA, there will
+  be two directories:
+
+    /sys/class/infiniband/mthca0/ports/1
+    /sys/class/infiniband/mthca0/ports/2
+
+  (A switch will only have a single "0" subdirectory for switch port
+  0; no subdirectory is created for normal switch ports)
+
+  In each port subdirectory, the following files are created:
+
+    cap_mask       - Port capability mask
+    lid            - Port LID
+    lid_mask_count - Port LID mask count
+    sm_lid         - Subnet manager LID for port's subnet
+    sm_sl          - Subnet manager SL for port's subnet
+    state          - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
+
+  There is also a "counters" subdirectory, with files
+
+    VL15_dropped
+    excessive_buffer_overrun_errors
+    link_downed
+    link_error_recovery
+    local_link_integrity_errors
+    port_rcv_constraint_errors
+    port_rcv_data
+    port_rcv_errors
+    port_rcv_packets
+    port_rcv_remote_physical_errors
+    port_rcv_switch_relay_errors
+    port_xmit_constraint_errors
+    port_xmit_data
+    port_xmit_discards
+    port_xmit_packets
+    symbol_error
+
+  Each of these files contains the corresponding value from the port's
+  Performance Management PortCounters attribute, as described in
+  section 16.1.3.5 of the InfiniBand Architecture Specification.
+
+  The "pkeys" and "gids" subdirectories contain one file for each
+  entry in the port's P_Key or GID table respectively.  For example,
+  ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
+  table.
+
+MTHCA
+
+  The Mellanox HCA driver also creates the files:
+
+    hw_rev   - Hardware revision number
+    fw_ver   - Firmware version
+    hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
+               or "MT25208"
Index: linux-bk/Documentation/infiniband/user_mad.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/user_mad.txt	2004-11-21 21:25:58.258558058 -0800
@@ -0,0 +1,77 @@
+USERSPACE MAD ACCESS
+
+Device files
+
+  Each port of each InfiniBand device has a "umad" device attached.
+  For example, a two-port HCA will have two devices, while a switch
+  will have one device (for switch port 0).
+
+Creating MAD agents
+
+  A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+  and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+  descriptor for the appropriate device file.  If the registration
+  request succeeds, a 32-bit id will be returned in the structure.
+  For example:
+
+	struct ib_user_mad_reg_req req = { /* ... */ };
+	ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+        if (!ret)
+		my_agent = req.id;
+	else
+		perror("agent register");
+
+  Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+  ioctl.  Also, all agents registered through a file descriptor will
+  be unregistered when the descriptor is closed.
+
+Receiving MADs
+
+  MADs are received using read().  The buffer passed to read() must be
+  large enough to hold at least one struct ib_user_mad.  For example:
+
+	struct ib_user_mad mad;
+	ret = read(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("read");
+
+  In addition to the actual MAD contents, the other struct ib_user_mad
+  fields will be filled in with information on the received MAD.  For
+  example, the remote LID will be in mad.lid.
+
+  If a send times out, a receive will be generated with mad.status set
+  to ETIMEDOUT.  Otherwise when a MAD has been successfully received,
+  mad.status will be 0.
+
+  poll()/select() may be used to wait until a MAD can be read.
+
+Sending MADs
+
+  MADs are sent using write().  The agent ID for sending should be
+  filled into the id field of the MAD, the destination LID should be
+  filled into the lid field, and so on.  For example:
+
+	struct ib_user_mad mad;
+
+	/* fill in mad.data */
+
+	mad.id  = my_agent;	/* req.id from agent registration */
+	mad.lid = my_dest;	/* in network byte order... */
+	/* etc. */
+
+	ret = write(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("write");
+
+/dev files
+
+  To create the appropriate character device files automatically with
+  udev, a rule like
+
+    KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad"
+
+  can be used.  This will create a device node named
+
+    /dev/infiniband/mthca0/ports/1/mad
+
+  for port 1 of device mthca0, and so on.


From roland at topspin.com  Mon Nov 22 07:14:27 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:14:27 -0800
Subject: [openib-general] [PATCH][RFC/v1][12/12] InfiniBand MAINTAINERS entry
In-Reply-To: <20041122714.AyIOvRY195EGFTaO@topspin.com>
Message-ID: <20041122714.y3rav5uMdVVNMNlz@topspin.com>

Add OpenIB maintainers information to MAINTAINERS.

Signed-off-by: Roland Dreier <roland at topspin.com>


Index: linux-bk/MAINTAINERS
===================================================================
--- linux-bk.orig/MAINTAINERS	2004-11-21 21:07:06.694491878 -0800
+++ linux-bk/MAINTAINERS	2004-11-21 21:25:58.537516680 -0800
@@ -1075,6 +1075,17 @@
 L:	linux-fbdev-devel at lists.sourceforge.net
 S:	Maintained
 
+INFINIBAND SUBSYSTEM
+P:	Roland Dreier
+M:	roland at topspin.com
+P:	Sean Hefty
+M:	mshefty at ichips.intel.com
+P:	Hal Rosenstock
+M:	halr at voltaire.com
+L:	openib-general at openib.org
+W:	http://www.openib.org/
+S:	Supported
+
 INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS
 P:	Vojtech Pavlik
 M:	vojtech at suse.cz


From tziporet at mellanox.co.il  Mon Nov 22 07:22:26 2004
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 22 Nov 2004 17:22:26 +0200
Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of 
	InfiniBand patches for review
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BEAAD@mtlex01.yok.mtl.com>

Congratulations for this important step toward the inclusion of Infiniband
drivers in Linux kernel.

Tziporet

-----Original Message-----
From: Roland Dreier [mailto:roland at topspin.com]
Sent: Monday, November 22, 2004 5:13 PM
To: linux-kernel at vger.kernel.org
Cc: openib-general at openib.org
Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of
InfiniBand patches for review


I'm very happy to be able to post an initial version of InfiniBand
patches for review.  Although this code should be far closer to kernel
coding standards than previous open source InfiniBand drivers, this
initial posting should be treated as a request for comments and not a
request for inclusion; our ultimate goal is to have these drivers
included in the mainline kernel, but we expect that fixes and
improvements will need to be made before the code is completely
acceptable.

These patches add a minimal but complete level of InfiniBand support,
including an IB midlayer, a low-level driver for Mellanox HCAs, an
IP-over-InfiniBand driver, and a mechanism for MADs (management
datagrams) to be passed to and from userspace.  This means that these
patches are all that is required for the kernel to bring up and use an
IP-over-InfiniBand link.  (The OpenSM subnet manager has not been
ported to this kernel API yet, although this work is underway.  This
means that at the moment, a kernel with these patches cannot be used
to bring up a fabric; however, the kernel side is complete)

The code has not been through extreme stress testing yet, but it has
been used successfully on i386, x86_64, ppc64, ia64 and sparc64
systems, including mixed 32/64 systems.

Feedback on both details of the code as well as the high-level
organization of the code will be very much appreciated.  For example,
the current set of patches puts include files in driver/infiniband/include;
would it be preferred to put include files in include/linux/infiniband/,
directly in include/linux, or perhaps in include/infiniband?

We would also like to explore the best avenue for having these patches
merged.  It may be desirable for the patches to spend some time in -mm
before moving into Linus's kernel; on the other hand, the patches make
only very minimal and safe changes outside of drivers/infiniband, so
it is quite reasonable to merge them directly into the mainline
kernel.  Although 2.6.10 is now closed, 2.6.11 will probably be open
by the time the review process is complete.

We look forward to the community's comments and criticisms!

Thanks,
  Roland Dreier
  OpenIB Alliance
  www.openib.org

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041122/ecb9faa4/attachment.html>

From hch at infradead.org  Mon Nov 22 07:31:44 2004
From: hch at infradead.org (Christoph Hellwig)
Date: Mon, 22 Nov 2004 15:31:44 +0000
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122714.AyIOvRY195EGFTaO@topspin.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
Message-ID: <20041122153144.GA4821@infradead.org>

> +  When the IPoIB driver is loaded, it creates one interface for each
> +  port using the P_Key at index 0.  To create an interface with a
> +  different P_Key, write the desired P_Key into the main interface's
> +  /sys/class/net/<intf name>/create_child file.  For example:
> +
> +    echo 0x8001 > /sys/class/net/ib0/create_child
> +
> +  This will create an interface named ib0.8001 with P_Key 0x8001.  To
> +  remove a subinterface, use the "delete_child" file:
> +
> +    echo 0x8001 > /sys/class/net/ib0/delete_child
> +
> +  The P_Key for any interface is given by the "pkey" file, and the
> +  main interface for a subinterface is in "parent."

Any reason this doesn't use an interface similar to the normal vlan code?

And what is a P_Key?


From roland at topspin.com  Mon Nov 22 07:41:47 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 07:41:47 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122153144.GA4821@infradead.org> (Christoph Hellwig's
	message of "Mon, 22 Nov 2004 15:31:44 +0000")
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122153144.GA4821@infradead.org>
Message-ID: <52k6sdevr8.fsf@topspin.com>

    Christoph> Any reason this doesn't use an interface similar to the
    Christoph> normal vlan code?

The normal vlan code uses an ioctl().  I thought a simple sysfs
interface would be more palatable than a new socket ioctl.

    Christoph> And what is a P_Key?

It is a 16-bit identifier carried by IB packets that says which
partition the packet is in.  End ports have P_Key tables that list
which partitions they are members of (a port can be a member of one or
more partitions, and can only receive packets from that partition).

 - Roland


From roland at topspin.com  Mon Nov 22 10:31:10 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 10:31:10 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
In-Reply-To: <20041122713.TMt4584EVSreQOO2@topspin.com> (Roland Dreier's
	message of "Mon, 22 Nov 2004 07:13:29 -0800")
References: <20041122713.TMt4584EVSreQOO2@topspin.com>
Message-ID: <528y8tenwx.fsf@topspin.com>

Seems like spamassassin is still overzealous.  I'm confused by the
tests that applied to this message:

    > 2.4 RATWARE_HASH_2_V2      Bulk email fingerprint (hash 2 v2) found
    > 1.2 RATWARE_HASH_2         Bulk email fingerprint (hash 2) found

I did some digging on this.  This SA rules seem pretty bogus -- they
just look at if the X-Mailer line has at least 14 (resp. 16)
characters from [A-Za-z0-9_].  My patch script sets X-Mailer to
"roland_patchbomb", which is 16 characters long, so all my patch mail
gets a log factor of 3.6 right off the bat.  I'll work around this by
changing my X-Mailer...

    > 1.8 DOMAIN_BODY            BODY: Domain registration spam body

This test thinks a kernel patch looks like domain registration spam ??

    > 1.1 REMOVE_REMOVAL_NEAR    List removal information

And a log factor of 1.1 for list removal information, which is added
to every message by mailman...

 - R.


From iod00d at hp.com  Mon Nov 22 10:47:56 2004
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 22 Nov 2004 10:47:56 -0800
Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core
	InfiniBand support
In-Reply-To: <528y8tenwx.fsf@topspin.com>
References: <20041122713.TMt4584EVSreQOO2@topspin.com>
	<528y8tenwx.fsf@topspin.com>
Message-ID: <20041122184756.GG4100@esmail.cup.hp.com>

On Mon, Nov 22, 2004 at 10:31:10AM -0800, Roland Dreier wrote:
> Seems like spamassassin is still overzealous.  I'm confused by the
> tests that applied to this message:

For my personal use, I added overrides for how certain tests score
to ~/.spamassassin/user_prefs. I'm sure something
similar exists for mailing list use. e.g.:
...
score MIME_HTML_ONLY 5.00
score MIME_HTML_NO_CHARSET 2.00  
score MICROSOFT_EXECUTABLE 3.50
score HTML_FONT_INVISIBLE 5.00
...

The four tests that you pointed out could just have their
"hit" score lowered so they add less (zero?) to the total score.
Hacking around, it looks like adding similar lines to
	/etc/spamassissin/local.cf

would change the "default" test scores system wide.
(at least on my debian machine)

grant


From sam at ravnborg.org  Mon Nov 22 11:33:50 2004
From: sam at ravnborg.org (Sam Ravnborg)
Date: Mon, 22 Nov 2004 20:33:50 +0100
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
	(Subnet Administration) query support
In-Reply-To: <20041122713.g6bh6aqdXIN4RJYR@topspin.com>
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
Message-ID: <20041122193350.GB8150@mars.ravnborg.org>

Nitpicking.

	Sam
	
> --- linux-bk.orig/drivers/infiniband/core/Makefile	2004-11-21 21:25:53.101323036 -0800
> +++ linux-bk/drivers/infiniband/core/Makefile	2004-11-21 21:25:53.879207651 -0800
> @@ -2,7 +2,8 @@
>  
>  obj-$(CONFIG_INFINIBAND) += \
>      ib_core.o \
> -    ib_mad.o
> +    ib_mad.o \
> +    ib_sa.o
It's more readable to keep .o files on one line.


>  
>  ib_core-objs := \
>      packer.o \

For new stuff please use ib_core-y :=

> @@ -17,3 +18,5 @@
>      mad.o \
>      smi.o \
>      agent.o
> +
> +ib_sa-objs := sa_query.o
ib_sa-y := please.

> +#include <linux/idr.h>
> +
> +#include <ib_pack.h>
> +#include <ib_sa.h>
If they are in same dir as .c file use:
#include "ib_pack.h"
#include "ib_sa.h"

> Index: linux-bk/drivers/infiniband/include/ib_sa.h

.h files for a subsystem like this ought to be placed in include/infiniband
if they will be used by files in other directories than drivers/infiniband


From sam at ravnborg.org  Mon Nov 22 11:40:03 2004
From: sam at ravnborg.org (Sam Ravnborg)
Date: Mon, 22 Nov 2004 20:40:03 +0100
Subject: [openib-general] Re: [PATCH][RFC/v1][8/12] Add IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
References: <20041122713.FnSlYodJYum7s82D@topspin.com>
	<20041122714.nKCPmH9LMhT0X7WE@topspin.com>
Message-ID: <20041122194003.GC8150@mars.ravnborg.org>

More nitpicking..

	Sam
	
> +++ linux-bk/drivers/infiniband/Makefile	2004-11-21 21:25:56.794775182 -0800
> @@ -1,2 +1,3 @@
>  obj-$(CONFIG_INFINIBAND)		+= core/
No reason to use $(CONFIG_INFINIBAND) here - it's already done in
drivers/infiniband/Makefile

> +EXTRA_CFLAGS += -Idrivers/infiniband/include
This will get killed if you move the include files...

 +
> +obj-$(CONFIG_INFINIBAND_IPOIB)			+= ib_ipoib.o
> +
> +ib_ipoib-y					:= ipoib_main.o \
> +						   ipoib_ib.o \
> +						   ipoib_multicast.o \
> +						   ipoib_verbs.o \
> +						   ipoib_vlan.o
One or two lines.

> +#include <asm/semaphore.h>
> +
> +#include "ipoib_proto.h"
Shoulb be included as the last file - since it's the most local one.
> +
> +#include <ib_verbs.h>
> +#include <ib_pack.h>
> +#include <ib_sa.h>
> 


From roland at topspin.com  Mon Nov 22 13:28:56 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 13:28:56 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
 (Subnet Administration) query support
In-Reply-To: <20041122193350.GB8150@mars.ravnborg.org> (Sam Ravnborg's
	message of "Mon, 22 Nov 2004 20:33:50 +0100")
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
	<20041122193350.GB8150@mars.ravnborg.org>
Message-ID: <521xeld147.fsf@topspin.com>

    Sam> Nitpicking.

Great, thanks for the help :)  I'll fix these up before our next
version of the patches are posted.

    Sam> It's more readable to keep .o files on one line.

OK, I will reformat our Makefiles.  (I used the old style because it's
easier to add/remove source files, but I think you're right that it's
better to optimize for readability rather than the rare event of
adding/removing sources)

    Sam> For new stuff please use ib_core-y :=

OK, no problem (until a few days ago I didn't even know -y was
equivalent to -obj, let alone preferred).

    Sam> .h files for a subsystem like this ought to be placed in
    Sam> include/infiniband if they will be used by files in other
    Sam> directories than drivers/infiniband

Right now all the code is in drivers/infiniband.  However Christoph
suggested moving the .h files to include/infiniband as well.  I have
no problem moving the includes (and as you point out this eliminates
having to add a -I to our CFLAGS), but on the other hand do we want to
add a new toplevel include directory for what is still admittedly a
minor subsystem?

Thanks,
  Roland


From greg at kroah.com  Mon Nov 22 14:25:07 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 14:25:07 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
	(Subnet Administration) query support
In-Reply-To: <20041122713.g6bh6aqdXIN4RJYR@topspin.com>
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
Message-ID: <20041122222507.GB15634@kroah.com>

On Mon, Nov 22, 2004 at 07:13:48AM -0800, Roland Dreier wrote:
> 
> Index: linux-bk/drivers/infiniband/core/Makefile
> ===================================================================

Please hack your submit script to not add these headers, when importing
to bk they end up showing up in the change log comments :(

> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-bk/drivers/infiniband/core/sa_query.c	2004-11-21 21:25:53.928200384 -0800
> @@ -0,0 +1,815 @@
> +/*
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available at
> + * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
> + * license, available in the LICENSE.TXT file accompanying this
> + * software.  These details are also available at
> + * <http://openib.org/license.html>.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Copyright (c) 2004 Topspin Communications.  All rights reserved.

No email address of who to bug with issues?

> + *
> + * $Id$

Not needed :)

> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/err.h>
> +#include <linux/random.h>
> +#include <linux/spinlock.h>
> +#include <linux/slab.h>
> +#include <linux/pci.h>
> +#include <linux/kref.h>
> +#include <linux/idr.h>
> +
> +#include <ib_pack.h>
> +#include <ib_sa.h>
> +
> +MODULE_AUTHOR("Roland Dreier");
> +MODULE_DESCRIPTION("InfiniBand subnet administration query support");
> +MODULE_LICENSE("Dual BSD/GPL");
> +
> +struct ib_sa_hdr {
> +	u64			sm_key;
> +	u16			attr_offset;
> +	u16			reserved;
> +	ib_sa_comp_mask		comp_mask;
> +} __attribute__ ((packed));

Why is this packed?

> +struct ib_sa_mad {
> +	struct ib_mad_hdr	mad_hdr;
> +	struct ib_rmpp_hdr	rmpp_hdr;
> +	struct ib_sa_hdr	sa_hdr;
> +	u8			data[200];
> +} __attribute__ ((packed));

Same here?

> +
> +struct ib_sa_sm_ah {
> +	struct ib_ah        *ah;
> +	struct kref          ref;
> +};
> +
> +struct ib_sa_port {
> +	struct ib_mad_agent *agent;
> +	struct ib_mr        *mr;
> +	struct ib_sa_sm_ah  *sm_ah;
> +	struct work_struct   update_task;
> +	spinlock_t           ah_lock;
> +	u8                   port_num;
> +};
> +
> +struct ib_sa_device {
> +	int                     start_port, end_port;
> +	struct ib_event_handler event_handler;
> +	struct ib_sa_port port[0];
> +};
> +
> +struct ib_sa_query {
> +	void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *);
> +	void (*release)(struct ib_sa_query *);
> +	struct ib_sa_port  *port;
> +	struct ib_sa_mad   *mad;
> +	struct ib_sa_sm_ah *sm_ah;
> +	DECLARE_PCI_UNMAP_ADDR(mapping)
> +	int                 id;
> +};
> +
> +struct ib_sa_path_query {
> +	void (*callback)(int, struct ib_sa_path_rec *, void *);
> +	void *context;
> +	struct ib_sa_query sa_query;
> +};
> +
> +struct ib_sa_mcmember_query {
> +	void (*callback)(int, struct ib_sa_mcmember_rec *, void *);
> +	void *context;
> +	struct ib_sa_query sa_query;
> +};
> +
> +static void ib_sa_add_one(struct ib_device *device);
> +static void ib_sa_remove_one(struct ib_device *device);
> +
> +static struct ib_client sa_client = {
> +	.name   = "sa",
> +	.add    = ib_sa_add_one,
> +	.remove = ib_sa_remove_one
> +};
> +
> +static spinlock_t idr_lock;
> +DEFINE_IDR(query_idr);

Should this be global or static?

> +
> +static spinlock_t tid_lock;
> +static u32 tid;
> +
> +enum {
> +        IB_SA_ATTR_CLASS_PORTINFO    = 0x01,
> +        IB_SA_ATTR_NOTICE            = 0x02,
> +        IB_SA_ATTR_INFORM_INFO       = 0x03,
> +        IB_SA_ATTR_NODE_REC          = 0x11,
> +        IB_SA_ATTR_PORT_INFO_REC     = 0x12,
> +        IB_SA_ATTR_SL2VL_REC         = 0x13,
> +        IB_SA_ATTR_SWITCH_REC        = 0x14,
> +        IB_SA_ATTR_LINEAR_FDB_REC    = 0x15,
> +        IB_SA_ATTR_RANDOM_FDB_REC    = 0x16,
> +        IB_SA_ATTR_MCAST_FDB_REC     = 0x17,
> +        IB_SA_ATTR_SM_INFO_REC       = 0x18,
> +        IB_SA_ATTR_LINK_REC          = 0x20,
> +        IB_SA_ATTR_GUID_INFO_REC     = 0x30,
> +        IB_SA_ATTR_SERVICE_REC       = 0x31,
> +        IB_SA_ATTR_PARTITION_REC     = 0x33,
> +        IB_SA_ATTR_RANGE_REC         = 0x34,
> +        IB_SA_ATTR_PATH_REC          = 0x35,
> +        IB_SA_ATTR_VL_ARB_REC        = 0x36,
> +        IB_SA_ATTR_MC_GROUP_REC      = 0x37,
> +        IB_SA_ATTR_MC_MEMBER_REC     = 0x38,
> +	IB_SA_ATTR_TRACE_REC         = 0x39,
> +	IB_SA_ATTR_MULTI_PATH_REC    = 0x3a,
> +	IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b
> +};

Oops, tabs vs. spaces.

Care to use the __bitwise field here so that you can have sparse check
to see that you are actually using the proper enum values in all places?
See the kobject_action code for an example of this.

> +
> +#define PATH_REC_FIELD(field) \
> +	.struct_offset_bytes = offsetof(struct ib_sa_path_rec, field),		\
> +	.struct_size_bytes   = sizeof ((struct ib_sa_path_rec *) 0)->field,	\
> +	.field_name          = "sa_path_rec:" #field
> +
> +static const struct ib_field path_rec_table[] = {
> +	{ RESERVED,
> +	  .offset_words = 0,
> +	  .offset_bits  = 0,
> +	  .size_bits    = 32 },

What is "RESERVED"?  I must be missing a previous patch somewhere, I
currently don't see all of the series yet.

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 14:13:04 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 14:13:04 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of
	InfiniBand patches for review
In-Reply-To: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
Message-ID: <20041122221304.GA15634@kroah.com>

On Mon, Nov 22, 2004 at 07:13:24AM -0800, Roland Dreier wrote:
> organization of the code will be very much appreciated.  For example,
> the current set of patches puts include files in driver/infiniband/include;
> would it be preferred to put include files in include/linux/infiniband/,
> directly in include/linux, or perhaps in include/infiniband?

Who would be including these files, only drivers in drivers/infiniband?
Or from files in other parts of the kernel?

If from other parts of the kernel, use include/linux/infiniband.

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 14:34:32 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 14:34:32 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][8/12] Add IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
References: <20041122713.FnSlYodJYum7s82D@topspin.com>
	<20041122714.nKCPmH9LMhT0X7WE@topspin.com>
Message-ID: <20041122223432.GC15634@kroah.com>

On Mon, Nov 22, 2004 at 07:14:04AM -0800, Roland Dreier wrote:
> 
> +#define ipoib_printk(level, priv, format, arg...)	\
> +	printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg)
> +#define ipoib_warn(priv, format, arg...)		\
> +	ipoib_printk(KERN_WARNING, priv, format , ## arg)

What's wrong with using the dev_printk() and friends instead of your
own?

And why cast a pointer in a macro, don't you know the type of it anyway?

> Index: linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c	2004-11-21 21:25:56.924755902 -0800

You're using a separate filesystem to export debug data?  I'm all for
new virtual filesystems, but why not just use sysfs for this?  What are
you doing in here that you can't do with another mechanism (netlink,
sysfs, sockets, relayfs, etc.)?

> +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
> +#define DATA_PATH_DEBUG_HELP " and data path tracing if > 1"
> +#else
> +#define DATA_PATH_DEBUG_HELP ""
> +#endif
> +
> +module_param(debug_level, int, 0644);
> +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" DATA_PATH_DEBUG_HELP);

Why not just use 2 different debug variables for this?

> +
> +int mcast_debug_level;

Global?

thanks,

greg k-h


From ftillier at infiniconsys.com  Mon Nov 22 14:40:45 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Mon, 22 Nov 2004 14:40:45 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand
	SA(Subnet Administration) query support
In-Reply-To: <20041122222507.GB15634@kroah.com>
Message-ID: <000401c4d0e4$4edb43d0$655aa8c0@infiniconsys.com>

> From: Greg KH [mailto:greg at kroah.com]
> Sent: Monday, November 22, 2004 2:25 PM
> 
> > +struct ib_sa_hdr {
> > +	u64			sm_key;
> > +	u16			attr_offset;
> > +	u16			reserved;
> > +	ib_sa_comp_mask		comp_mask;
> > +} __attribute__ ((packed));
> 
> Why is this packed?
> 
> > +struct ib_sa_mad {
> > +	struct ib_mad_hdr	mad_hdr;
> > +	struct ib_rmpp_hdr	rmpp_hdr;
> > +	struct ib_sa_hdr	sa_hdr;
> > +	u8			data[200];
> > +} __attribute__ ((packed));
> 
> Same here?

These describe on-the-wire IB structures, and their definition matches the
IB spec (Version 1.1, Volume 1)

struct ib_mad_hdr matches "Standard MAD Header", Figure 144
struct ib_rmpp_hdr matches "RMPP MAD Header", Figure 168
struct ib_sa_hdr and struct ib_sa_mad match "SA Header", Figure 193

Hope that answers your question - let us know if it doesn't.

Cheers,

- Fab


From roland at topspin.com  Mon Nov 22 14:50:41 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 14:50:41 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of
 InfiniBand patches for review
In-Reply-To: <20041122221304.GA15634@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 14:13:04 -0800")
References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
	<20041122221304.GA15634@kroah.com>
Message-ID: <52wtwdbiri.fsf@topspin.com>

    Greg> Who would be including these files, only drivers in
    Greg> drivers/infiniband?  Or from files in other parts of the
    Greg> kernel?

In the current patchset all the code is under drivers/infiniband.

    Greg> If from other parts of the kernel, use include/linux/infiniband.

That's one vote for include/linux/infiniband and two votes for
include/infiniband so far...

 - R.


From greg at kroah.com  Mon Nov 22 14:50:33 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 14:50:33 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041122714.9zlcKGKvXlpga8EP@topspin.com>
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
Message-ID: <20041122225033.GD15634@kroah.com>

On Mon, Nov 22, 2004 at 07:14:11AM -0800, Roland Dreier wrote:
> Add a driver that provides a character special device for each
> InfiniBand port.  This device allows userspace to send and receive
> MADs via write() and read() (with some control operations implemented
> as ioctls).

Do you really need these ioctls?

For example:

> +static int ib_umad_ioctl(struct inode *inode, struct file *filp,
> +			 unsigned int cmd, unsigned long arg)
> +{
> +	switch (cmd) {
> +	case IB_USER_MAD_GET_ABI_VERSION:
> +		return put_user(IB_USER_MAD_ABI_VERSION,
> +				(u32 __user *) arg) ? -EFAULT : 0;

This could be in a sysfs file, right?

> +	case IB_USER_MAD_REGISTER_AGENT:
> +		return ib_umad_reg_agent(filp->private_data, arg);
> +	case IB_USER_MAD_UNREGISTER_AGENT:
> +		return ib_umad_unreg_agent(filp->private_data, arg);

You are letting any user, with any privilege register or unregister an
"agent"?

And shouldn't you lock your list of agent ids when adding or removing
one, or are you relying on the BKL of the ioctl call?  If so, please
document this.

Also, these "agents" seem to be a type of filter, right?  Is there no
other way to implement this than an ioctl?

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 14:53:35 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 14:53:35 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122714.AyIOvRY195EGFTaO@topspin.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
Message-ID: <20041122225335.GE15634@kroah.com>

On Mon, Nov 22, 2004 at 07:14:22AM -0800, Roland Dreier wrote:
> +/dev files
> +
> +  To create the appropriate character device files automatically with
> +  udev, a rule like
> +
> +    KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad"
> +
> +  can be used.  This will create a device node named
> +
> +    /dev/infiniband/mthca0/ports/1/mad
> +
> +  for port 1 of device mthca0, and so on.

Why do you propose such a "deep" nesting of directories for umad
devices?  That's not the LANNANA way.

Oh, have you asked for a real major number to be reserved for umad?

thanks,

greg k-h


From roland at topspin.com  Mon Nov 22 14:58:45 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 14:58:45 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122225335.GE15634@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 14:53:35 -0800")
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com>
Message-ID: <52sm71bie2.fsf@topspin.com>

    Greg> Why do you propose such a "deep" nesting of directories for
    Greg> umad devices?  That's not the LANNANA way.

No real reason, I'm open to better suggestions.

    Greg> Oh, have you asked for a real major number to be reserved
    Greg> for umad?

No, I think we're fine with a dynamic major.  Is there any reason to
want a real major?

 - Roland


From roland at topspin.com  Mon Nov 22 15:05:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 15:05:40 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041122225033.GD15634@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 14:50:33 -0800")
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com>
Message-ID: <52oehpbi2j.fsf@topspin.com>

    >> Add a driver that provides a character special device for each
    >> InfiniBand port.  This device allows userspace to send and
    >> receive MADs via write() and read() (with some control
    >> operations implemented as ioctls).

    Greg> Do you really need these ioctls?

    Greg> This could be in a sysfs file, right?

The API version definitely can be, good point.

    Greg> You are letting any user, with any privilege register or
    Greg> unregister an "agent"?

They have to be able to open the device node.  We could add a check
that they have it open for writing but there's not really much point
in opening this device read-only.

    Greg> And shouldn't you lock your list of agent ids when adding or
    Greg> removing one, or are you relying on the BKL of the ioctl
    Greg> call?  If so, please document this.

Each file has an "agent_mutex" rwsem that protects this... the global
list of agents handled by the lower level API is protected by its own locking.

    Greg> Also, these "agents" seem to be a type of filter, right?  Is
    Greg> there no other way to implement this than an ioctl?

ioctl seems to be the least bad way to me.  This really feels like a
legitimate use of ioctl to me -- we use read/write to handle passing
data through our file descriptor, and ioctl for control of the
properties of the descriptor.

What would you suggest as an ioctl replacement?

Thanks,
 Roland


From greg at kroah.com  Mon Nov 22 15:05:33 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 15:05:33 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <52sm71bie2.fsf@topspin.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com>
Message-ID: <20041122230533.GB13083@kroah.com>

On Mon, Nov 22, 2004 at 02:58:45PM -0800, Roland Dreier wrote:
>     Greg> Why do you propose such a "deep" nesting of directories for
>     Greg> umad devices?  That's not the LANNANA way.
> 
> No real reason, I'm open to better suggestions.

/dev/umad* 
/dev/ib/umad*

>     Greg> Oh, have you asked for a real major number to be reserved
>     Greg> for umad?
> 
> No, I think we're fine with a dynamic major.  Is there any reason to
> want a real major?

People who do not use udev will not like you.

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 15:01:28 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 15:01:28 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of
	InfiniBand patches for review
In-Reply-To: <52wtwdbiri.fsf@topspin.com>
References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
	<20041122221304.GA15634@kroah.com> <52wtwdbiri.fsf@topspin.com>
Message-ID: <20041122230128.GA13083@kroah.com>

On Mon, Nov 22, 2004 at 02:50:41PM -0800, Roland Dreier wrote:
>     Greg> Who would be including these files, only drivers in
>     Greg> drivers/infiniband?  Or from files in other parts of the
>     Greg> kernel?
> 
> In the current patchset all the code is under drivers/infiniband.

Then it should just stay in that directory.  Well, that's my preference
anyway :)

thanks,

greg k-h


From roland at topspin.com  Mon Nov 22 15:18:07 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 15:18:07 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][8/12] Add IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <20041122223432.GC15634@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 14:34:32 -0800")
References: <20041122713.FnSlYodJYum7s82D@topspin.com>
	<20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122223432.GC15634@kroah.com>
Message-ID: <52k6sdbhhs.fsf@topspin.com>

    Greg> What's wrong with using the dev_printk() and friends instead
    Greg> of your own?

dev_printk expects a struct device, not a net_device.

    Greg> And why cast a pointer in a macro, don't you know the type
    Greg> of it anyway?

this lets us pass in the return value of netdev_priv() directly
without having to have the cast in the code that uses the macro.

    Greg> You're using a separate filesystem to export debug data?
    Greg> I'm all for new virtual filesystems, but why not just use
    Greg> sysfs for this?  What are you doing in here that you can't
    Greg> do with another mechanism (netlink, sysfs, sockets, relayfs,
    Greg> etc.)?

For each multicast group, we want to export the GID, how long it's
been around, whether our join has completed and whether it's
send-only.  It wouldn't be too bad to create a kobject with all those
attributes but getting the info from so many little files is a little
bit of a pain, and so is dealing with kobject lifetime rules.  It's
even worse with netlink since then a new tool is required.  (AFAIK
relayfs isn't in Linus's kernel).

It's nice to be able to tell someone to just mount ipoib_debugfs and
send the contents of debugfs/ib0_mcg.

The actual filesystem stuff is pretty trivial using everything libfs
provides for us now...

    Greg> Why not just use 2 different debug variables for this?

No real reason... I'll fix it up.

    >> + +int mcast_debug_level;

    Greg> Global?

Good point, I'll move it into ipoib_multicast.c.

 - R.


From roland at topspin.com  Mon Nov 22 15:21:26 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 15:21:26 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122230533.GB13083@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 15:05:33 -0800")
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com>
	<20041122230533.GB13083@kroah.com>
Message-ID: <52fz31bhc9.fsf@topspin.com>

    Greg> /dev/umad* /dev/ib/umad*

Right now the umad module creates devices with kernel names like
umad0, umad1, etc, but it puts ibdev and port files in sysfs so
userspace can figure out which IB device and port the file corresponds
to.  I would really prefer to have this info reflected in the /dev name...

    Greg> People who do not use udev will not like you.

OK, I guess we will apply to LANANA.

 - R.


From johannes at erdfelt.com  Mon Nov 22 15:30:47 2004
From: johannes at erdfelt.com (Johannes Erdfelt)
Date: Mon, 22 Nov 2004 15:30:47 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122230533.GB13083@kroah.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com>
	<20041122230533.GB13083@kroah.com>
Message-ID: <20041122233047.GH27658@sventech.com>

On Mon, Nov 22, 2004, Greg KH <greg at kroah.com> wrote:
> On Mon, Nov 22, 2004 at 02:58:45PM -0800, Roland Dreier wrote:
> >     Greg> Oh, have you asked for a real major number to be reserved
> >     Greg> for umad?
> > 
> > No, I think we're fine with a dynamic major.  Is there any reason to
> > want a real major?
> 
> People who do not use udev will not like you.

I don't quite understand this. Given things like udev, wouldn't dynamic
majors work just like having a static major number?

JE


From roland at topspin.com  Mon Nov 22 15:34:23 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 15:34:23 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
 (Subnet Administration) query support
In-Reply-To: <20041122222507.GB15634@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 14:25:07 -0800")
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
	<20041122222507.GB15634@kroah.com>
Message-ID: <527jodbgqo.fsf@topspin.com>

    Greg> Please hack your submit script to not add these headers,
    Greg> when importing to bk they end up showing up in the change
    Greg> log comments :(

OK, will do.

    Greg> No email address of who to bug with issues?

There's a patch to MAINTAINERS...

    Greg> Why is this packed?
    Greg> Same here?

Both of these structures unfortunately have 64 bit fields only aligned
to 32 bits (and are sent on the wire so we can't fiddle with the
layout).  So without the "packed" they won't come out right on 64-bit archs.

    Greg> Should this be global or static?

static, fixed.

    Greg> Oops, tabs vs. spaces.

fixed.

    Greg> Care to use the __bitwise field here so that you can have
    Greg> sparse check to see that you are actually using the proper
    Greg> enum values in all places?  See the kobject_action code for
    Greg> an example of this.

Sure, that's a good idea.  I'll look for other places we can do this too.

    Greg> What is "RESERVED"?  I must be missing a previous patch
    Greg> somewhere, I currently don't see all of the series yet.

It's in part 1/12: http://article.gmane.org/gmane.linux.kernel/257531
unfortunately some people marked it as spam and it didn't get
everywhere.

 - Roland


From jjengla at sandia.gov  Mon Nov 22 17:26:04 2004
From: jjengla at sandia.gov (Josh England)
Date: Mon, 22 Nov 2004 17:26:04 -0800
Subject: [openib-general] troubles with IPoIB
Message-ID: <1101173164.18604.53.camel@localhost>

Hi all,

I've got an 85-node x86_64 PCIe cluster I'd like to run (and test)
openIB on.  I've built a kernel using the latest patches from SVN,
loaded all the modules, and I see ACTIVE on the ports, but IPoIB does
not seem to want to work.

After I 'echo 2 > module/ib_ipoib/debug_level', I get the following:

Nov 22 16:47:03 n0 kernel: ib0: called: id 14, op 0, status: 0
Nov 22 16:47:03 n0 kernel: ib0: send complete, wrid 14
Nov 22 16:47:03 n0 kernel: ib0: called: id -2147483634, op 128, status:
0
Nov 22 16:47:03 n0 kernel: ib0: received 100 bytes, SLID 0x000d
Nov 22 16:47:03 n0 kernel: ib0: dropping loopback packet
Nov 22 16:47:04 n0 kernel: ib0: sending packet, length=60
address=000001007fd17340 qpn=0xffffff

Is the 'called: id [big-fat-negative-number]' supposed to be there?  Is
there a utility (such as vping) to test even basic IB connectivity?  How
can I get IPoIB to work?

-----------------------------------------------
Josh England
Sandia National Laboratory, Livermore, CA
Visualization and Scientific Computing
email: jjengla at sandia.gov
phone: (925) 294-2076

-JE


From roland at topspin.com  Mon Nov 22 17:34:35 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 17:34:35 -0800
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101173164.18604.53.camel@localhost> (Josh England's message
	of "Mon, 22 Nov 2004 17:26:04 -0800")
References: <1101173164.18604.53.camel@localhost>
Message-ID: <52vfbx9wlw.fsf@topspin.com>

    Josh> Is the 'called: id [big-fat-negative-number]' supposed to be
    Josh> there?

Yes, that's fine (it's just an artifact of the fact that receive work
request IDs get (1<<31) ORed in).  I should clean up that debug
message though.

The debug messages show IPoIB apparently sending an ARP packet and
seeing it appear on the broadcast group (looped back locally).

    Josh> Is there a utility (such as vping) to test even
    Josh> basic IB connectivity?

Unfortunately not yet.

    Josh> How can I get IPoIB to work?

What subnet manager are you using?  If you mount ipoib_debugfs
somewhere (say /ipoib_debugfs), what do you get from

    cat /ipoib_debugfs/ib0_mcg

Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0"
show anything on the other nodes when you try to ping or something?

Thanks,
  Roland


From roland at topspin.com  Mon Nov 22 18:08:21 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 18:08:21 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041122225033.GD15634@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 14:50:33 -0800")
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com>
Message-ID: <52ekil9v1m.fsf@topspin.com>

    Greg> This could be in a sysfs file, right?

Ugh, how does one add an attribute (like the ABI version) to a
class_simple?  It shouldn't be per-device but I don't see anything
like class_create_file() that could work for class_simple.

Thanks,
  Roland


From jjengla at sandia.gov  Mon Nov 22 18:14:23 2004
From: jjengla at sandia.gov (Josh England)
Date: Mon, 22 Nov 2004 18:14:23 -0800
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <52vfbx9wlw.fsf@topspin.com>
References: <1101173164.18604.53.camel@localhost> <52vfbx9wlw.fsf@topspin.com>
Message-ID: <1101176063.17750.58.camel@localhost>

On Mon, 2004-11-22 at 17:34 -0800, Roland Dreier wrote:
> What subnet manager are you using?

Embedded Voltaire SM.

>   If you mount ipoib_debugfs
> somewhere (say /ipoib_debugfs), what do you get from
>     cat /ipoib_debugfs/ib0_mcg

How do I mount ipoib_debugfs?  Is there some device associated with it
that I should be seeing?  BTW...I'm using udev.

> Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0"
> show anything on the other nodes when you try to ping or something?

Nope...nothing's coming in...

-JE


From roland at topspin.com  Mon Nov 22 18:27:36 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 18:27:36 -0800
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101176063.17750.58.camel@localhost> (Josh England's message
	of "Mon, 22 Nov 2004 18:14:23 -0800")
References: <1101173164.18604.53.camel@localhost> <52vfbx9wlw.fsf@topspin.com>
	<1101176063.17750.58.camel@localhost>
Message-ID: <521xel9u5j.fsf@topspin.com>

    Josh> How do I mount ipoib_debugfs?  Is there some device
    Josh> associated with it that I should be seeing?  BTW...I'm using
    Josh> udev.

no device associated... just do

    mount -t ipoib_debugfs none /ipoib_debufs/

for whatever value of /ipoib_debufs/ you like.  (need to create the
directory first, just like any other mount point)

udev should be no problem (all my dev systems use it too).

 - Roland


From roland at topspin.com  Mon Nov 22 19:33:21 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 19:33:21 -0800
Subject: [openib-general] [PATCH] Convert from pci_xxx to dma_xxx functions
Message-ID: <52wtwd8cji.fsf@topspin.com>

Christoph Hellwig suggested we might as well put a generic struct
device *dma_device and use the generic dma_map functions rather than
assuming we're dealing with a PCI device.  (There's no dma_xxx
equivalent of pci_unmap_addr_set() and friends, so I left that stuff--
Christoph agrees this is OK for now).

Look OK to commit?

Thanks,
  Roland

Index: infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_main.c	(revision 1273)
+++ infiniband/ulp/ipoib/ipoib_main.c	(working copy)
@@ -773,7 +773,7 @@
 	if (!priv)
 		goto alloc_mem_failed;
 
-	SET_NETDEV_DEV(priv->dev, &hca->dma_device->dev);
+	SET_NETDEV_DEV(priv->dev, hca->dma_device);
 
 	result = ib_query_pkey(hca, port, 0, &priv->pkey);
 	if (result) {
Index: infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband/ulp/ipoib/ipoib_ib.c	(revision 1272)
+++ infiniband/ulp/ipoib/ipoib_ib.c	(working copy)
@@ -107,9 +107,9 @@
 	}
 	skb_reserve(skb, 4);	/* 16 byte align IP header */
 	priv->rx_ring[id].skb = skb;
-	addr = pci_map_single(priv->ca->dma_device,
+	addr = dma_map_single(priv->ca->dma_device,
 			      skb->data, IPOIB_BUF_SIZE,
-			      PCI_DMA_FROMDEVICE);
+			      DMA_FROM_DEVICE);
 	pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr);
 
 	ret = ipoib_ib_receive(priv, id, addr);
@@ -154,11 +154,11 @@
 
 			priv->rx_ring[wr_id].skb = NULL;
 
-			pci_unmap_single(priv->ca->dma_device,
+			dma_unmap_single(priv->ca->dma_device,
 					 pci_unmap_addr(&priv->rx_ring[wr_id],
 							mapping),
 					 IPOIB_BUF_SIZE,
-					 PCI_DMA_FROMDEVICE);
+					 DMA_FROM_DEVICE);
 
 			if (wc->status != IB_WC_SUCCESS) {
 				if (wc->status != IB_WC_WR_FLUSH_ERR)
@@ -216,10 +216,10 @@
 
 		tx_req = &priv->tx_ring[wr_id];
 
-		pci_unmap_single(priv->ca->dma_device,
+		dma_unmap_single(priv->ca->dma_device,
 				 pci_unmap_addr(tx_req, mapping),
 				 tx_req->skb->len,
-				 PCI_DMA_TODEVICE);
+				 DMA_TO_DEVICE);
 
 		++priv->stats.tx_packets;
 		priv->stats.tx_bytes += tx_req->skb->len;
@@ -318,9 +318,9 @@
 	 */
 	tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)];
 	tx_req->skb = skb;
-	addr = pci_map_single(priv->ca->dma_device,
+	addr = dma_map_single(priv->ca->dma_device,
 			      skb->data, skb->len,
-			      PCI_DMA_TODEVICE);
+			      DMA_TO_DEVICE);
 	pci_unmap_addr_set(tx_req, mapping, addr);
 
 	if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1),
Index: infiniband/include/ib_verbs.h
===================================================================
--- infiniband/include/ib_verbs.h	(revision 1272)
+++ infiniband/include/ib_verbs.h	(working copy)
@@ -679,7 +679,7 @@
 };
 
 struct ib_device {
-	struct pci_dev               *dma_device;
+	struct device                *dma_device;
 
 	char                          name[IB_DEVICE_NAME_MAX];
 
Index: infiniband/core/agent.c
===================================================================
--- infiniband/core/agent.c	(revision 1272)
+++ infiniband/core/agent.c	(working copy)
@@ -23,11 +23,15 @@
   Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
 */
 
+#include <linux/dma-mapping.h>
+
+#include <asm/bug.h>
+
 #include <ib_smi.h>
+
 #include "smi.h"
 #include "agent_priv.h"
 #include "mad_priv.h"
-#include <asm/bug.h>
 
 
 spinlock_t ib_agent_port_list_lock;
@@ -117,10 +121,10 @@
 	agent_send_wr->mad = mad;
 
 	/* PCI mapping */
-	gather_list.addr = pci_map_single(mad_agent->device->dma_device,
+	gather_list.addr = dma_map_single(mad_agent->device->dma_device,
 					  &mad->mad,
 					  sizeof(mad->mad),
-					  PCI_DMA_TODEVICE);
+					  DMA_TO_DEVICE);
 	gather_list.length = sizeof(mad->mad);
 	gather_list.lkey = (*port_priv->mr).lkey;
 
@@ -182,10 +186,10 @@
 	spin_lock_irqsave(&port_priv->send_list_lock, flags);
 	if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) {
 		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
-		pci_unmap_single(mad_agent->device->dma_device,
+		dma_unmap_single(mad_agent->device->dma_device,
 				 pci_unmap_addr(agent_send_wr, mapping),
 				 sizeof(mad->mad),
-				 PCI_DMA_TODEVICE);
+				 DMA_TO_DEVICE);
 		ib_destroy_ah(agent_send_wr->ah);
 		kfree(agent_send_wr);
 	} else {
@@ -255,10 +259,10 @@
 	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 
 	/* Unmap PCI */
-	pci_unmap_single(mad_agent->device->dma_device,
+	dma_unmap_single(mad_agent->device->dma_device,
 			 pci_unmap_addr(agent_send_wr, mapping),
 			 sizeof(agent_send_wr->mad->mad),
-			 PCI_DMA_TODEVICE);
+			 DMA_TO_DEVICE);
 
 	ib_destroy_ah(agent_send_wr->ah);
 
Index: infiniband/core/user_mad.c
===================================================================
--- infiniband/core/user_mad.c	(revision 1272)
+++ infiniband/core/user_mad.c	(working copy)
@@ -115,10 +115,10 @@
 	struct ib_umad_packet *packet =
 		(void *) (unsigned long) send_wc->wr_id;
 
-	pci_unmap_single(agent->device->dma_device,
+	dma_unmap_single(agent->device->dma_device,
 			 pci_unmap_addr(packet, mapping),
 			 sizeof packet->mad.data,
-			 PCI_DMA_TODEVICE);
+			 DMA_TO_DEVICE);
 	ib_destroy_ah(packet->ah);
 
 	if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) {
@@ -267,10 +267,10 @@
 		goto err_up;
 	}
 
-	gather_list.addr = pci_map_single(agent->device->dma_device,
+	gather_list.addr = dma_map_single(agent->device->dma_device,
 					  packet->mad.data,
 					  sizeof packet->mad.data,
-					  PCI_DMA_TODEVICE);
+					  DMA_TO_DEVICE);
 	gather_list.length = sizeof packet->mad.data;
 	gather_list.lkey   = file->mr[packet->mad.id]->lkey;
 	pci_unmap_addr_set(packet, mapping, gather_list.addr);
@@ -285,10 +285,10 @@
 
 	ret = ib_post_send_mad(agent, &wr, &bad_wr);
 	if (ret) {
-		pci_unmap_single(agent->device->dma_device,
+		dma_unmap_single(agent->device->dma_device,
 				 pci_unmap_addr(packet, mapping),
 				 sizeof packet->mad.data,
-				 PCI_DMA_TODEVICE);
+				 DMA_TO_DEVICE);
 		goto err_up;
 	}
 
@@ -549,7 +549,7 @@
 		umad_dev->port[i - s].class_dev =
 			class_simple_device_add(umad_class,
 						umad_dev->port[i - s].dev.dev,
-						&device->dma_device->dev,
+						device->dma_device,
 						"umad%d", umad_dev->port[i - s].devnum);
 		if (IS_ERR(umad_dev->port[i - s].class_dev))
 			goto err_class;
Index: infiniband/core/mad.c
===================================================================
--- infiniband/core/mad.c	(revision 1272)
+++ infiniband/core/mad.c	(working copy)
@@ -53,16 +53,16 @@
  * and/or other materials provided with the distribution.
  */
 
+#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
 
 #include <ib_mad.h>
+
 #include "mad_priv.h"
 #include "smi.h"
 #include "agent.h"
 
-#include <linux/smp_lock.h>
-#include <linux/interrupt.h>
 
-
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
@@ -1094,11 +1094,11 @@
 	mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header,
 				    mad_list);
 	recv = container_of(mad_priv_hdr, struct ib_mad_private, header);
-	pci_unmap_single(port_priv->device->dma_device,
+	dma_unmap_single(port_priv->device->dma_device,
 			 pci_unmap_addr(&recv->header, mapping),
 			 sizeof(struct ib_mad_private) -
 			 sizeof(struct ib_mad_private_header),
-			 PCI_DMA_FROMDEVICE);
+			 DMA_FROM_DEVICE);
 
 	/* Setup MAD receive work completion from "normal" work completion */
 	recv->header.recv_wc.wc = wc;
@@ -1627,12 +1627,12 @@
 				break;
 			}
 		}
-		sg_list.addr = pci_map_single(qp_info->port_priv->
+		sg_list.addr = dma_map_single(qp_info->port_priv->
 						device->dma_device,
 					&mad_priv->grh,
 					sizeof *mad_priv -
 						sizeof mad_priv->header,
-					PCI_DMA_FROMDEVICE);
+					DMA_FROM_DEVICE);
 		pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
 		recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
 		mad_priv->header.mad_list.mad_queue = recv_queue;
@@ -1648,12 +1648,12 @@
 			list_del(&mad_priv->header.mad_list.list);
 			recv_queue->count--;
 			spin_unlock_irqrestore(&recv_queue->lock, flags);
-			pci_unmap_single(qp_info->port_priv->device->dma_device,
+			dma_unmap_single(qp_info->port_priv->device->dma_device,
 					 pci_unmap_addr(&mad_priv->header,
 							mapping),
 					 sizeof *mad_priv -
 					   sizeof mad_priv->header,
-					 PCI_DMA_FROMDEVICE);
+					 DMA_FROM_DEVICE);
 			kmem_cache_free(ib_mad_cache, mad_priv);
 			printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret);
 			break;
@@ -1686,11 +1686,11 @@
 		list_del(&mad_list->list);
 
 		/* Undo PCI mapping */
-		pci_unmap_single(qp_info->port_priv->device->dma_device,
+		dma_unmap_single(qp_info->port_priv->device->dma_device,
 				 pci_unmap_addr(&recv->header, mapping),
 				 sizeof(struct ib_mad_private) -
 				 sizeof(struct ib_mad_private_header),
-				 PCI_DMA_FROMDEVICE);
+				 DMA_FROM_DEVICE);
 		kmem_cache_free(ib_mad_cache, recv);
 	}
 
Index: infiniband/core/sa_query.c
===================================================================
--- infiniband/core/sa_query.c	(revision 1276)
+++ infiniband/core/sa_query.c	(working copy)
@@ -28,6 +28,7 @@
 #include <linux/spinlock.h>
 #include <linux/slab.h>
 #include <linux/pci.h>
+#include <linux/dma-mapping.h>
 #include <linux/kref.h>
 #include <linux/idr.h>
 
@@ -43,14 +44,14 @@
 	u16			attr_offset;
 	u16			reserved;
 	ib_sa_comp_mask		comp_mask;
-} __attribute__((packed));
+} __attribute__ ((packed));
 
 struct ib_sa_mad {
 	struct ib_mad_hdr	mad_hdr;
 	struct ib_rmpp_hdr	rmpp_hdr;
 	struct ib_sa_hdr	sa_hdr;
 	u8			data[200];
-} __attribute__((packed));
+} __attribute__ ((packed));
 
 struct ib_sa_sm_ah {
 	struct ib_ah        *ah;
@@ -460,20 +461,20 @@
 	wr.wr.ud.ah  = port->sm_ah->ah;
 	spin_unlock_irqrestore(&port->ah_lock, flags);
 
-	gather_list.addr   = pci_map_single(port->agent->device->dma_device,
+	gather_list.addr   = dma_map_single(port->agent->device->dma_device,
 					    query->mad,
 					    sizeof (struct ib_sa_mad),
-					    PCI_DMA_TODEVICE);
+					    DMA_TO_DEVICE);
 	gather_list.length = sizeof (struct ib_sa_mad);
 	gather_list.lkey   = port->mr->lkey;
 	pci_unmap_addr_set(query, mapping, gather_list.addr);
 
 	ret = ib_post_send_mad(port->agent, &wr, &bad_wr);
 	if (ret) {
-		pci_unmap_single(port->agent->device->dma_device,
+		dma_unmap_single(port->agent->device->dma_device,
 				 pci_unmap_addr(query, mapping),
 				 sizeof (struct ib_sa_mad),
-				 PCI_DMA_TODEVICE);
+				 DMA_TO_DEVICE);
 		kref_put(&query->sm_ah->ref, free_sm_ah);
 		spin_lock_irqsave(&idr_lock, flags);
 		idr_remove(&query_idr, query->id);
@@ -662,10 +663,10 @@
 		break;
 	}
 
-	pci_unmap_single(agent->device->dma_device,
+	dma_unmap_single(agent->device->dma_device,
 			 pci_unmap_addr(query, mapping),
 			 sizeof (struct ib_sa_mad),
-			 PCI_DMA_TODEVICE);
+			 DMA_TO_DEVICE);
 	kref_put(&query->sm_ah->ref, free_sm_ah);
 
 	query->release(query);
Index: infiniband/hw/mthca/mthca_dev.h
===================================================================
--- infiniband/hw/mthca/mthca_dev.h	(revision 1272)
+++ infiniband/hw/mthca/mthca_dev.h	(working copy)
@@ -27,6 +27,7 @@
 #include <linux/spinlock.h>
 #include <linux/kernel.h>
 #include <linux/pci.h>
+#include <linux/dma-mapping.h>
 #include <asm/semaphore.h>
 #include <asm/scatterlist.h>
 
Index: infiniband/hw/mthca/mthca_main.c
===================================================================
--- infiniband/hw/mthca/mthca_main.c	(revision 1272)
+++ infiniband/hw/mthca/mthca_main.c	(working copy)
@@ -28,7 +28,6 @@
 #include <linux/errno.h>
 #include <linux/pci.h>
 #include <linux/interrupt.h>
-#include <linux/dma-mapping.h>
 
 #ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
 #include <asm/cpufeature.h>
Index: infiniband/hw/mthca/mthca_provider.c
===================================================================
--- infiniband/hw/mthca/mthca_provider.c	(revision 1272)
+++ infiniband/hw/mthca/mthca_provider.c	(working copy)
@@ -573,7 +573,7 @@
 	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
 	dev->ib_dev.node_type            = IB_NODE_CA;
 	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
-	dev->ib_dev.dma_device           = dev->pdev;
+	dev->ib_dev.dma_device           = &dev->pdev->dev;
 	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
 	dev->ib_dev.query_device         = mthca_query_device;
 	dev->ib_dev.query_port           = mthca_query_port;
Index: infiniband/hw/mthca/mthca_mad.c
===================================================================
--- infiniband/hw/mthca/mthca_mad.c	(revision 1272)
+++ infiniband/hw/mthca/mthca_mad.c	(working copy)
@@ -144,10 +144,10 @@
 		wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr;
 		wr.wr_id         = (unsigned long) tmad;
 
-		gather_list.addr   = pci_map_single(agent->device->dma_device,
+		gather_list.addr   = dma_map_single(agent->device->dma_device,
 						    tmad->mad,
 						    sizeof *tmad->mad,
-						    PCI_DMA_TODEVICE);
+						    DMA_TO_DEVICE);
 		gather_list.length = sizeof *tmad->mad;
 		gather_list.lkey   = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey;
 		pci_unmap_addr_set(tmad, mapping, gather_list.addr);
@@ -167,10 +167,10 @@
 		spin_unlock_irqrestore(&dev->sm_lock, flags);
 
 		if (ret) {
-			pci_unmap_single(agent->device->dma_device,
+			dma_unmap_single(agent->device->dma_device,
 					 pci_unmap_addr(tmad, mapping),
 					 sizeof *tmad->mad,
-					 PCI_DMA_TODEVICE);
+					 DMA_TO_DEVICE);
 			kfree(tmad->mad);
 			kfree(tmad);
 		}
@@ -259,10 +259,10 @@
 	struct mthca_trap_mad *tmad =
 		(void *) (unsigned long) mad_send_wc->wr_id;
 
-	pci_unmap_single(agent->device->dma_device,
+	dma_unmap_single(agent->device->dma_device,
 			 pci_unmap_addr(tmad, mapping),
 			 sizeof *tmad->mad,
-			 PCI_DMA_TODEVICE);
+			 DMA_TO_DEVICE);
 	kfree(tmad->mad);
 	kfree(tmad);
 }


From halr at voltaire.com  Mon Nov 22 20:24:37 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 22 Nov 2004 23:24:37 -0500
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101176063.17750.58.camel@localhost>
References: <1101173164.18604.53.camel@localhost> <52vfbx9wlw.fsf@topspin.com>
	<1101176063.17750.58.camel@localhost>
Message-ID: <1101183877.4124.545.camel@localhost.localdomain>

On Mon, 2004-11-22 at 21:14, Josh England wrote:
> On Mon, 2004-11-22 at 17:34 -0800, Roland Dreier wrote:
> > What subnet manager are you using?
> 
> Embedded Voltaire SM.

I run this way all the time (with smaller configurations). Multicast
works in general.

-- Hal


From halr at voltaire.com  Mon Nov 22 20:26:18 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 22 Nov 2004 23:26:18 -0500
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101173164.18604.53.camel@localhost>
References: <1101173164.18604.53.camel@localhost>
Message-ID: <1101183978.4124.548.camel@localhost.localdomain>

Hi Josh,

On Mon, 2004-11-22 at 20:26, Josh England wrote:
> I've got an 85-node x86_64 PCIe cluster I'd like to run (and test)
> openIB on.  I've built a kernel using the latest patches from SVN,
> loaded all the modules, and I see ACTIVE on the ports, but IPoIB does
> not seem to want to work.

What is the firmware version of the PCIe adapters ? I have seen problems
like this when not all the adapters were at 4.5.3.

You can get this via:

cat /sys/class/infiniband/mthca0/fw_ver

-- Hal


From mlleinin at hpcn.ca.sandia.gov  Mon Nov 22 21:15:38 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Mon, 22 Nov 2004 21:15:38 -0800
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101183978.4124.548.camel@localhost.localdomain>
References: <1101173164.18604.53.camel@localhost>
	<1101183978.4124.548.camel@localhost.localdomain>
Message-ID: <1101186939.29554.92.camel@trinity>

On Mon, 2004-11-22 at 23:26 -0500, Hal Rosenstock wrote:
> Hi Josh,
> 
> On Mon, 2004-11-22 at 20:26, Josh England wrote:
> > I've got an 85-node x86_64 PCIe cluster I'd like to run (and test)
> > openIB on.  I've built a kernel using the latest patches from SVN,
> > loaded all the modules, and I see ACTIVE on the ports, but IPoIB does
> > not seem to want to work.
> 
> What is the firmware version of the PCIe adapters ? I have seen problems
> like this when not all the adapters were at 4.5.3.
> 
> You can get this via:
> 
> cat /sys/class/infiniband/mthca0/fw_ver
> 

  We are using fw_ver 4.5.0.  Looks like we need to upgrade.  Time to
try the user space firmware burning tools. 

	- Matt


From greg at kroah.com  Mon Nov 22 22:30:45 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 22:30:45 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <52ekil9v1m.fsf@topspin.com>
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com>
Message-ID: <20041123063045.GA22493@kroah.com>

On Mon, Nov 22, 2004 at 06:08:21PM -0800, Roland Dreier wrote:
>     Greg> This could be in a sysfs file, right?
> 
> Ugh, how does one add an attribute (like the ABI version) to a
> class_simple?  It shouldn't be per-device but I don't see anything
> like class_create_file() that could work for class_simple.

class_simple_device_add returns a pointer to a struct class_device *
that you can then use to create a file in sysfs with.  That should be
what you're looking for.

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 22:41:20 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 22:41:20 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
	(Subnet Administration) query support
In-Reply-To: <527jodbgqo.fsf@topspin.com>
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
	<20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com>
Message-ID: <20041123064120.GB22493@kroah.com>

On Mon, Nov 22, 2004 at 03:34:23PM -0800, Roland Dreier wrote:
>     Greg> No email address of who to bug with issues?
> 
> There's a patch to MAINTAINERS...

Yeah, but a name in each file is much nicer.

>     Greg> What is "RESERVED"?  I must be missing a previous patch
>     Greg> somewhere, I currently don't see all of the series yet.
> 
> It's in part 1/12: http://article.gmane.org/gmane.linux.kernel/257531
> unfortunately some people marked it as spam and it didn't get
> everywhere.

Thanks for pointing this out. 

One comment, the file drivers/infiniband/core/cache.c has a license that
is illegal due to the contents of the file.  Please change the license
of the file to GPL only.

Oh, and how about kernel-doc comments for all functions that are
EXPORT_SYMBOL() marked?  And for your core big structures?

thanks,

greg k-h


From roland at topspin.com  Mon Nov 22 22:45:02 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 22:45:02 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041123063045.GA22493@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 22:30:45 -0800")
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com>
	<20041123063045.GA22493@kroah.com>
Message-ID: <52llct83o1.fsf@topspin.com>

    Greg> class_simple_device_add returns a pointer to a struct
    Greg> class_device * that you can then use to create a file in
    Greg> sysfs with.  That should be what you're looking for.

Shouldn't the ABI version be an attribute in /sys/class/infiniband_mad
rather than being per-device?  (I'm already creating several
per-device attributes for the devices I get back from class_simple_device_add).

 - R.


From greg at kroah.com  Mon Nov 22 22:45:08 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 22:45:08 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122233047.GH27658@sventech.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com>
	<20041122230533.GB13083@kroah.com>
	<20041122233047.GH27658@sventech.com>
Message-ID: <20041123064508.GC22493@kroah.com>

On Mon, Nov 22, 2004 at 03:30:47PM -0800, Johannes Erdfelt wrote:
> On Mon, Nov 22, 2004, Greg KH <greg at kroah.com> wrote:
> > On Mon, Nov 22, 2004 at 02:58:45PM -0800, Roland Dreier wrote:
> > >     Greg> Oh, have you asked for a real major number to be reserved
> > >     Greg> for umad?
> > > 
> > > No, I think we're fine with a dynamic major.  Is there any reason to
> > > want a real major?
> > 
> > People who do not use udev will not like you.
> 
> I don't quite understand this. Given things like udev, wouldn't dynamic
> majors work just like having a static major number?

Yes, but people who do not use udev, will have a hard time creating the
device nodes by hand every time.

thanks,

greg k-h


From roland at topspin.com  Mon Nov 22 22:47:29 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 22 Nov 2004 22:47:29 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
 (Subnet Administration) query support
In-Reply-To: <20041123064120.GB22493@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 22:41:20 -0800")
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
	<20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com>
	<20041123064120.GB22493@kroah.com>
Message-ID: <52hdnh83jy.fsf@topspin.com>

    Greg> Yeah, but a name in each file is much nicer.

Very little of the kernel seems to follow this rule right now.

    Greg> One comment, the file drivers/infiniband/core/cache.c has a
    Greg> license that is illegal due to the contents of the file.
    Greg> Please change the license of the file to GPL only.

?? Can you explain this?  What makes that file special?

    Greg> Oh, and how about kernel-doc comments for all functions that
    Greg> are EXPORT_SYMBOL() marked?  And for your core big
    Greg> structures?

I guess we'll start working on it...

 - Roland


From johannes at erdfelt.com  Mon Nov 22 22:51:10 2004
From: johannes at erdfelt.com (Johannes Erdfelt)
Date: Mon, 22 Nov 2004 22:51:10 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041123064508.GC22493@kroah.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com>
	<20041122230533.GB13083@kroah.com>
	<20041122233047.GH27658@sventech.com>
	<20041123064508.GC22493@kroah.com>
Message-ID: <20041123065110.GA3959@sventech.com>

On Mon, Nov 22, 2004, Greg KH <greg at kroah.com> wrote:
> On Mon, Nov 22, 2004 at 03:30:47PM -0800, Johannes Erdfelt wrote:
> > On Mon, Nov 22, 2004, Greg KH <greg at kroah.com> wrote:
> > > People who do not use udev will not like you.
> > 
> > I don't quite understand this. Given things like udev, wouldn't dynamic
> > majors work just like having a static major number?
> 
> Yes, but people who do not use udev, will have a hard time creating the
> device nodes by hand every time.

Ok, I can understand that for now.

Is the eventual plan to move to dynamic majors for all devices?

JE


From greg at kroah.com  Mon Nov 22 23:29:45 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 23:29:45 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
	(Subnet Administration) query support
In-Reply-To: <52hdnh83jy.fsf@topspin.com>
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
	<20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com>
	<20041123064120.GB22493@kroah.com> <52hdnh83jy.fsf@topspin.com>
Message-ID: <20041123072944.GA22786@kroah.com>

On Mon, Nov 22, 2004 at 10:47:29PM -0800, Roland Dreier wrote:
>     Greg> Yeah, but a name in each file is much nicer.
> 
> Very little of the kernel seems to follow this rule right now.

I agree, but it's good to add this for new files.

>     Greg> One comment, the file drivers/infiniband/core/cache.c has a
>     Greg> license that is illegal due to the contents of the file.
>     Greg> Please change the license of the file to GPL only.
> 
> ?? Can you explain this?  What makes that file special?

You are using a specific data structure that is only licensed to be used
in GPL code.  By using it in code that has a non-GPL license (like the
dual license you have) you are violating the license of that code, and
open yourself up to lawsuits by the holder of that code.

There, can I be vague enough?  :)

To be straightforward, either drop the RCU code completely, or change
the license of your code.  

Hm, because of the fact that you are linking in GPL only code into this
code (because of the .h files you are using) how could you ever expect
to use a BSD-like license for this collected work?

Aren't licenses fun...

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 23:38:27 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 23:38:27 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041123065110.GA3959@sventech.com>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com>
	<20041122230533.GB13083@kroah.com>
	<20041122233047.GH27658@sventech.com>
	<20041123064508.GC22493@kroah.com>
	<20041123065110.GA3959@sventech.com>
Message-ID: <20041123073827.GA23122@kroah.com>

On Mon, Nov 22, 2004 at 10:51:10PM -0800, Johannes Erdfelt wrote:
> 
> Is the eventual plan to move to dynamic majors for all devices?

No, some people will not allow that to happen, it would break too many
old programs and configurations.

It will probably be a config option if people wish to try it out (it's
only about a 3 line change to the kernel to enable this, I need to just
submit the patch one of these days...)

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 23:43:37 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 23:43:37 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <52llct83o1.fsf@topspin.com>
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com>
	<20041123063045.GA22493@kroah.com> <52llct83o1.fsf@topspin.com>
Message-ID: <20041123074337.GB23194@kroah.com>

On Mon, Nov 22, 2004 at 10:45:02PM -0800, Roland Dreier wrote:
>     Greg> class_simple_device_add returns a pointer to a struct
>     Greg> class_device * that you can then use to create a file in
>     Greg> sysfs with.  That should be what you're looking for.
> 
> Shouldn't the ABI version be an attribute in /sys/class/infiniband_mad
> rather than being per-device?

Yes, it probably should be.  Hm, no, we don't allow you to put class
specific files if you use the class_simple API, sorry I misread your
question.  You can just handle the class yourself and use the
CLASS_ATTR() macro to define your api version function.

thanks,

greg k-h


From greg at kroah.com  Mon Nov 22 23:45:51 2004
From: greg at kroah.com (Greg KH)
Date: Mon, 22 Nov 2004 23:45:51 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <52oehpbi2j.fsf@topspin.com>
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52oehpbi2j.fsf@topspin.com>
Message-ID: <20041123074551.GC23194@kroah.com>

On Mon, Nov 22, 2004 at 03:05:40PM -0800, Roland Dreier wrote:
>     Greg> You are letting any user, with any privilege register or
>     Greg> unregister an "agent"?
> 
> They have to be able to open the device node.  We could add a check
> that they have it open for writing but there's not really much point
> in opening this device read-only.

Ok, I remember this conversation a while ago.  We discussed this same
thing a number of months back on the openib mailing list.  Nevermind :)

>     Greg> Also, these "agents" seem to be a type of filter, right?  Is
>     Greg> there no other way to implement this than an ioctl?
> 
> ioctl seems to be the least bad way to me.  This really feels like a
> legitimate use of ioctl to me -- we use read/write to handle passing
> data through our file descriptor, and ioctl for control of the
> properties of the descriptor.
> 
> What would you suggest as an ioctl replacement?

I really can't think of anything else.  It just will require a _lot_ of
vigilant attention to prevent people from adding other ioctls to this
one, right?

Do you have other ioctls planned for this same interface for stage 2 and
future stages of ib implementation for Linux?

thanks,

greg k-h


From ebiederman at lnxi.com  Tue Nov 23 00:49:16 2004
From: ebiederman at lnxi.com (Eric W. Biederman)
Date: 23 Nov 2004 01:49:16 -0700
Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand
	Documentation files
In-Reply-To: <20041122153144.GA4821@infradead.org>
References: <20041122714.taTI3zcdWo5JfuMd@topspin.com>
	<20041122714.AyIOvRY195EGFTaO@topspin.com>
	<20041122153144.GA4821@infradead.org>
Message-ID: <m37jodlzlf.fsf@maxwell.lnxi.com>

Christoph Hellwig <hch at infradead.org> writes:

> > +  When the IPoIB driver is loaded, it creates one interface for each
> > +  port using the P_Key at index 0.  To create an interface with a
> > +  different P_Key, write the desired P_Key into the main interface's
> > +  /sys/class/net/<intf name>/create_child file.  For example:
> > +
> > +    echo 0x8001 > /sys/class/net/ib0/create_child
> > +
> > +  This will create an interface named ib0.8001 with P_Key 0x8001.  To
> > +  remove a subinterface, use the "delete_child" file:
> > +
> > +    echo 0x8001 > /sys/class/net/ib0/delete_child
> > +
> > +  The P_Key for any interface is given by the "pkey" file, and the
> > +  main interface for a subinterface is in "parent."
> 
> Any reason this doesn't use an interface similar to the normal vlan code?
> 
> And what is a P_Key?

IB version of a vlan identifier.

Eric


From arnd at arndb.de  Tue Nov 23 04:13:57 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Tue, 23 Nov 2004 13:13:57 +0100
Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of
	InfiniBand patches for review
In-Reply-To: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
Message-ID: <200411231313.57758.arnd@arndb.de>

On Maandag 22 November 2004 16:13, Roland Dreier wrote:
> I'm very happy to be able to post an initial version of InfiniBand
> patches for review. 

Patches 1, 3 and 5 didn't make it to lkml. Did you hit the 100kb size
limit for mails?

	Arnd <><


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041123/3df407c1/attachment.sig>

From roland at topspin.com  Tue Nov 23 07:04:44 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 07:04:44 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041123074551.GC23194@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 23:45:51 -0800")
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52oehpbi2j.fsf@topspin.com>
	<20041123074551.GC23194@kroah.com>
Message-ID: <527joc8v3n.fsf@topspin.com>

    Greg> Do you have other ioctls planned for this same interface for
    Greg> stage 2 and future stages of ib implementation for Linux?

Not that I know of.

 - Roland


From roland at topspin.com  Tue Nov 23 07:06:07 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 07:06:07 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041123074337.GB23194@kroah.com> (Greg KH's message of "Mon,
	22 Nov 2004 23:43:37 -0800")
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com>
	<20041123063045.GA22493@kroah.com> <52llct83o1.fsf@topspin.com>
	<20041123074337.GB23194@kroah.com>
Message-ID: <523bz08v1c.fsf@topspin.com>

    Greg> Yes, it probably should be.  Hm, no, we don't allow you to
    Greg> put class specific files if you use the class_simple API,
    Greg> sorry I misread your question.  You can just handle the
    Greg> class yourself and use the CLASS_ATTR() macro to define your
    Greg> api version function.

Ugh, then we end up duplicating the class_simple code.  Would you
accept a patch that adds class_simple_create_file()/class_simple_remove_file()?

Thanks,
  Roland


From greg at kroah.com  Tue Nov 23 07:17:47 2004
From: greg at kroah.com (Greg KH)
Date: Tue, 23 Nov 2004 07:17:47 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace
	MAD support
In-Reply-To: <523bz08v1c.fsf@topspin.com>
References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com>
	<20041122714.9zlcKGKvXlpga8EP@topspin.com>
	<20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com>
	<20041123063045.GA22493@kroah.com> <52llct83o1.fsf@topspin.com>
	<20041123074337.GB23194@kroah.com> <523bz08v1c.fsf@topspin.com>
Message-ID: <20041123151747.GA26986@kroah.com>

On Tue, Nov 23, 2004 at 07:06:07AM -0800, Roland Dreier wrote:
>     Greg> Yes, it probably should be.  Hm, no, we don't allow you to
>     Greg> put class specific files if you use the class_simple API,
>     Greg> sorry I misread your question.  You can just handle the
>     Greg> class yourself and use the CLASS_ATTR() macro to define your
>     Greg> api version function.
> 
> Ugh, then we end up duplicating the class_simple code.  Would you
> accept a patch that adds class_simple_create_file()/class_simple_remove_file()?

Ick, ok, sure.  Just make sure to mark them as EXPORT_SYMBOL_GPL() :)

thanks,

greg k-h


From roland at topspin.com  Tue Nov 23 07:20:35 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 07:20:35 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of
 InfiniBand patches for review
In-Reply-To: <200411231313.57758.arnd@arndb.de> (Arnd Bergmann's message of
	"Tue, 23 Nov 2004 13:13:57 +0100")
References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com>
	<200411231313.57758.arnd@arndb.de>
Message-ID: <52vfbw7fss.fsf@topspin.com>

    Arnd> Patches 1, 3 and 5 didn't make it to lkml. Did you hit the
    Arnd> 100kb size limit for mails?

Ah, that must be what happened.  I was confused because gmane.org did
pick them up, but I think that's because gmane is also subscribed to
openib-general (which is cc'ed).

I'll reroll the patches, splitting the too-large pieces, and send
soon.

Thanks,
  Roland


From roland at topspin.com  Tue Nov 23 07:39:03 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 07:39:03 -0800
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101186939.29554.92.camel@trinity> (Matt Leininger's message
	of "Mon, 22 Nov 2004 21:15:38 -0800")
References: <1101173164.18604.53.camel@localhost>
	<1101183978.4124.548.camel@localhost.localdomain>
	<1101186939.29554.92.camel@trinity>
Message-ID: <52llcs7ey0.fsf@topspin.com>

    Matt>   We are using fw_ver 4.5.0.  Looks like we need to upgrade.
    Matt> Time to try the user space firmware burning tools.

I would recommend _not_ using tvflash to upgrade PCIe HCAs from FW
4.5.0 to 4.5.3 right now.  The invariant sector of flash needs to be
rewritten, and the version of tvflash checked in right now doesn't
handle that properly yet.  Give me a day or so to fix it...

 - Roland


From rddunlap at osdl.org  Tue Nov 23 07:27:05 2004
From: rddunlap at osdl.org (Randy.Dunlap)
Date: Tue, 23 Nov 2004 07:27:05 -0800
Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA
 (Subnet Administration) query support
In-Reply-To: <20041123064120.GB22493@kroah.com>
References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com>
	<20041122713.g6bh6aqdXIN4RJYR@topspin.com>
	<20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com>
	<20041123064120.GB22493@kroah.com>
Message-ID: <41A356C9.40607@osdl.org>

Greg KH wrote:
> On Mon, Nov 22, 2004 at 03:34:23PM -0800, Roland Dreier wrote:
> 
>>    Greg> No email address of who to bug with issues?
>>
>>There's a patch to MAINTAINERS...
> 
> 
> Yeah, but a name in each file is much nicer.

I disagree.  I'd rather be able to look in MAINTAINERS | CREDITS
for all such references.


-- 
~Randy


From roland at topspin.com  Tue Nov 23 08:14:08 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:08 -0800
Subject: [openib-general] [PATCH][RFC/v2][0/21] Second submission of
	InfiniBand patches for review
Message-ID: <20041123814.p0AnYzTlx42JeVes@topspin.com>

Here is the second version of the InfiniBand driver patch set.  These
patches incorporate most but not all of the feedback received since
Monday.  However, the main reason for posting this new set is that
several of the patches in the first batch ran afoul of the 100K limit
on linux-kernel.  This batch is split into smaller pieces so all the
parts should make it through this time.

Thanks,
  Roland Dreier
  OpenIB Alliance
  www.openib.org


From roland at topspin.com  Tue Nov 23 08:14:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:14 -0800
Subject: [openib-general] [PATCH][RFC/v2][1/21] Add core InfiniBand support
	(public headers)
In-Reply-To: <20041123814.p0AnYzTlx42JeVes@topspin.com>
Message-ID: <20041123814.rXLIXw020elfd6Da@topspin.com>

Add public headers for core InfiniBand support.  This can be thought
of as a midlayer that provides an abstraction between low-level
hardware drivers and upper level protocols (such as
IP-over-InfiniBand).

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_cache.h	2004-11-23 08:10:15.790234096 -0800
@@ -0,0 +1,49 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ib_cache.h 1255 2004-11-17 17:20:41Z roland $
+ */
+
+#ifndef _IB_CACHE_H
+#define _IB_CACHE_H
+
+#include <ib_verbs.h>
+
+int ib_cached_gid_get(struct ib_device    *device,
+		      u8                   port,
+		      int                  index,
+		      union ib_gid        *gid);
+int ib_cached_pkey_get(struct ib_device    *device_handle,
+		       u8                   port,
+		       int                  index,
+		       u16                 *pkey);
+int ib_cached_pkey_find(struct ib_device    *device,
+			u8                   port,
+			u16                  pkey,
+			u16                 *index);
+
+#endif /* _IB_CACHE_H */
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_fmr_pool.h	2004-11-23 08:10:15.847225692 -0800
@@ -0,0 +1,69 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ *
+ * $Id: ib_fmr_pool.h 696 2004-08-28 03:10:21Z roland $
+ */
+
+#if !defined(IB_FMR_POOL_H)
+#define IB_FMR_POOL_H
+
+#include <ib_verbs.h>
+
+struct ib_fmr_pool;
+
+struct ib_fmr_pool_param {
+	int                     max_pages_per_fmr;
+	enum ib_access_flags    access;
+	int                     pool_size;
+	int                     dirty_watermark;
+	void                  (*flush_function)(struct ib_fmr_pool *pool,
+						void *              arg);
+	void                   *flush_arg;
+	unsigned                cache:1;
+};
+
+struct ib_pool_fmr {
+	struct ib_fmr      *fmr;
+	struct ib_fmr_pool *pool;
+	struct list_head    list;
+	struct hlist_node   cache_node;
+	int                 ref_count;
+	int                 remap_count;
+	u64                 io_virtual_address;
+	int                 page_list_len;
+	u64                 page_list[0];
+};
+
+int ib_create_fmr_pool(struct ib_pd             *pd,
+		       struct ib_fmr_pool_param *params,
+		       struct ib_fmr_pool      **pool_handle);
+
+int ib_destroy_fmr_pool(struct ib_fmr_pool *pool);
+
+int ib_flush_fmr_pool(struct ib_fmr_pool *pool);
+
+struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle,
+					 u64                *page_list,
+					 int                 list_len,
+					 u64                *io_virtual_address);
+
+int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr);
+
+#endif /* IB_FMR_POOL_H */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_pack.h	2004-11-23 08:10:15.909216552 -0800
@@ -0,0 +1,241 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ *
+ * $Id: ib_pack.h 1051 2004-10-25 02:47:17Z roland $
+ */
+
+#ifndef IB_PACK_H
+#define IB_PACK_H
+
+#include <ib_verbs.h>
+
+enum {
+	IB_LRH_BYTES  = 8,
+	IB_GRH_BYTES  = 40,
+	IB_BTH_BYTES  = 12,
+	IB_DETH_BYTES = 8
+};
+
+struct ib_field {
+	size_t struct_offset_bytes;
+	size_t struct_size_bytes;
+	int    offset_words;
+	int    offset_bits;
+	int    size_bits;
+	char  *field_name;
+};
+
+#define RESERVED \
+	.field_name          = "reserved"
+
+/*
+ * This macro cleans up the definitions of constants for BTH opcodes.
+ * It is used to define constants such as IB_OPCODE_UD_SEND_ONLY,
+ * which becomes IB_OPCODE_UD + IB_OPCODE_SEND_ONLY, and this gives
+ * the correct value.
+ *
+ * In short, user code should use the constants defined using the
+ * macro rather than worrying about adding together other constants.
+*/
+#define IB_OPCODE(transport, op) \
+	IB_OPCODE_ ## transport ## _ ## op = \
+		IB_OPCODE_ ## transport + IB_OPCODE_ ## op
+
+enum {
+	/* transport types -- just used to define real constants */
+	IB_OPCODE_RC                                = 0x00,
+	IB_OPCODE_UC                                = 0x20,
+	IB_OPCODE_RD                                = 0x40,
+	IB_OPCODE_UD                                = 0x60,
+
+	/* operations -- just used to define real constants */
+	IB_OPCODE_SEND_FIRST                        = 0x00,
+	IB_OPCODE_SEND_MIDDLE                       = 0x01,
+	IB_OPCODE_SEND_LAST                         = 0x02,
+	IB_OPCODE_SEND_LAST_WITH_IMMEDIATE          = 0x03,
+	IB_OPCODE_SEND_ONLY                         = 0x04,
+	IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE          = 0x05,
+	IB_OPCODE_RDMA_WRITE_FIRST                  = 0x06,
+	IB_OPCODE_RDMA_WRITE_MIDDLE                 = 0x07,
+	IB_OPCODE_RDMA_WRITE_LAST                   = 0x08,
+	IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE    = 0x09,
+	IB_OPCODE_RDMA_WRITE_ONLY                   = 0x0a,
+	IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE    = 0x0b,
+	IB_OPCODE_RDMA_READ_REQUEST                 = 0x0c,
+	IB_OPCODE_RDMA_READ_RESPONSE_FIRST          = 0x0d,
+	IB_OPCODE_RDMA_READ_RESPONSE_MIDDLE         = 0x0e,
+	IB_OPCODE_RDMA_READ_RESPONSE_LAST           = 0x0f,
+	IB_OPCODE_RDMA_READ_RESPONSE_ONLY           = 0x10,
+	IB_OPCODE_ACKNOWLEDGE                       = 0x11,
+	IB_OPCODE_ATOMIC_ACKNOWLEDGE                = 0x12,
+	IB_OPCODE_COMPARE_SWAP                      = 0x13,
+	IB_OPCODE_FETCH_ADD                         = 0x14,
+
+	/* real constants follow -- see comment about above IB_OPCODE()
+	   macro for more details */
+
+	/* RC */
+	IB_OPCODE(RC, SEND_FIRST),
+	IB_OPCODE(RC, SEND_MIDDLE),
+	IB_OPCODE(RC, SEND_LAST),
+	IB_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(RC, SEND_ONLY),
+	IB_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(RC, RDMA_WRITE_FIRST),
+	IB_OPCODE(RC, RDMA_WRITE_MIDDLE),
+	IB_OPCODE(RC, RDMA_WRITE_LAST),
+	IB_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(RC, RDMA_WRITE_ONLY),
+	IB_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(RC, RDMA_READ_REQUEST),
+	IB_OPCODE(RC, RDMA_READ_RESPONSE_FIRST),
+	IB_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE),
+	IB_OPCODE(RC, RDMA_READ_RESPONSE_LAST),
+	IB_OPCODE(RC, RDMA_READ_RESPONSE_ONLY),
+	IB_OPCODE(RC, ACKNOWLEDGE),
+	IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE),
+	IB_OPCODE(RC, COMPARE_SWAP),
+	IB_OPCODE(RC, FETCH_ADD),
+
+	/* UC */
+	IB_OPCODE(UC, SEND_FIRST),
+	IB_OPCODE(UC, SEND_MIDDLE),
+	IB_OPCODE(UC, SEND_LAST),
+	IB_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(UC, SEND_ONLY),
+	IB_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(UC, RDMA_WRITE_FIRST),
+	IB_OPCODE(UC, RDMA_WRITE_MIDDLE),
+	IB_OPCODE(UC, RDMA_WRITE_LAST),
+	IB_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(UC, RDMA_WRITE_ONLY),
+	IB_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE),
+
+	/* RD */
+	IB_OPCODE(RD, SEND_FIRST),
+	IB_OPCODE(RD, SEND_MIDDLE),
+	IB_OPCODE(RD, SEND_LAST),
+	IB_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(RD, SEND_ONLY),
+	IB_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(RD, RDMA_WRITE_FIRST),
+	IB_OPCODE(RD, RDMA_WRITE_MIDDLE),
+	IB_OPCODE(RD, RDMA_WRITE_LAST),
+	IB_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(RD, RDMA_WRITE_ONLY),
+	IB_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(RD, RDMA_READ_REQUEST),
+	IB_OPCODE(RD, RDMA_READ_RESPONSE_FIRST),
+	IB_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE),
+	IB_OPCODE(RD, RDMA_READ_RESPONSE_LAST),
+	IB_OPCODE(RD, RDMA_READ_RESPONSE_ONLY),
+	IB_OPCODE(RD, ACKNOWLEDGE),
+	IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE),
+	IB_OPCODE(RD, COMPARE_SWAP),
+	IB_OPCODE(RD, FETCH_ADD),
+
+	/* UD */
+	IB_OPCODE(UD, SEND_ONLY),
+	IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE)
+};
+
+enum {
+	IB_LNH_RAW        = 0,
+	IB_LNH_IP         = 1,
+	IB_LNH_IBA_LOCAL  = 2,
+	IB_LNH_IBA_GLOBAL = 3
+};
+
+struct ib_unpacked_lrh {
+	u8        virtual_lane;
+	u8        link_version;
+	u8        service_level;
+	u8        link_next_header;
+	__be16    destination_lid;
+	__be16    packet_length;
+	__be16    source_lid;
+};
+
+struct ib_unpacked_grh {
+	u8    	     ip_version;
+	u8    	     traffic_class;
+	__be32 	     flow_label;
+	__be16       payload_length;
+	u8    	     next_header;
+	u8    	     hop_limit;
+	union ib_gid source_gid;
+	union ib_gid destination_gid;
+};
+
+struct ib_unpacked_bth {
+	u8           opcode;
+	u8           solicited_event;
+	u8           mig_req;
+	u8           pad_count;
+	u8           transport_header_version;
+	__be16       pkey;
+	__be32       destination_qpn;
+	u8           ack_req;
+	__be32       psn;
+};
+
+struct ib_unpacked_deth {
+	__be32       qkey;
+	__be32       source_qpn;
+};
+
+struct ib_ud_header {
+	struct ib_unpacked_lrh  lrh;
+	int                     grh_present;
+	struct ib_unpacked_grh  grh;
+	struct ib_unpacked_bth  bth;
+	struct ib_unpacked_deth deth;
+	int            		immediate_present;
+	__be32         		immediate_data;
+};
+
+void ib_pack(const struct ib_field        *desc,
+	     int                           desc_len,
+	     void                         *structure,
+	     void                         *buf);
+
+void ib_unpack(const struct ib_field        *desc,
+	       int                           desc_len,
+	       void                         *buf,
+	       void                         *structure);
+
+void ib_ud_header_init(int     		   payload_bytes,
+		       int    		   grh_present,
+		       struct ib_ud_header *header);
+
+int ib_ud_header_pack(struct ib_ud_header *header,
+		      void                *buf);
+
+int ib_ud_header_unpack(void                *buf,
+			struct ib_ud_header *header);
+
+#endif /* IB_PACK_H */
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_verbs.h	2004-11-23 08:10:15.974206969 -0800
@@ -0,0 +1,984 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+ * Copyright (c) 2004 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+ *
+ * $Id: ib_verbs.h 1226 2004-11-13 04:35:49Z roland $
+ */
+
+#if !defined(IB_VERBS_H)
+#define IB_VERBS_H
+
+#include <linux/types.h>
+#include <linux/device.h>
+#include <asm/atomic.h>
+
+union ib_gid {
+	u8	raw[16];
+	struct {
+		u64	subnet_prefix;
+		u64	interface_id;
+	} global;
+};
+
+enum ib_node_type {
+	IB_NODE_CA 	= 1,
+	IB_NODE_SWITCH,
+	IB_NODE_ROUTER
+};
+
+enum ib_device_cap_flags {
+	IB_DEVICE_RESIZE_MAX_WR		= 1, 
+	IB_DEVICE_BAD_PKEY_CNTR		= (1<<1), 
+	IB_DEVICE_BAD_QKEY_CNTR		= (1<<2),
+	IB_DEVICE_RAW_MULTI		= (1<<3), 
+	IB_DEVICE_AUTO_PATH_MIG		= (1<<4),
+	IB_DEVICE_CHANGE_PHY_PORT	= (1<<5),   
+	IB_DEVICE_UD_AV_PORT_ENFORCE	= (1<<6),
+	IB_DEVICE_CURR_QP_STATE_MOD	= (1<<7),
+	IB_DEVICE_SHUTDOWN_PORT		= (1<<8),
+	IB_DEVICE_INIT_TYPE		= (1<<9),
+	IB_DEVICE_PORT_ACTIVE_EVENT	= (1<<10),
+	IB_DEVICE_SYS_IMAGE_GUID	= (1<<11),
+	IB_DEVICE_RC_RNR_NAK_GEN	= (1<<12),
+	IB_DEVICE_SRQ_RESIZE		= (1<<13),
+	IB_DEVICE_N_NOTIFY_CQ		= (1<<14),
+	IB_DEVICE_RQ_SIG_TYPE		= (1<<15)
+};
+
+enum ib_atomic_cap {
+	IB_ATOMIC_NONE,
+	IB_ATOMIC_HCA,
+	IB_ATOMIC_GLOB
+};
+
+struct ib_device_attr {
+	u64			fw_ver;
+	u64			node_guid;
+	u64			sys_image_guid;
+	u64			max_mr_size;
+	u64			page_size_cap;
+	u32			vendor_id;
+	u32			vendor_part_id;
+	u32			hw_ver;
+	int			max_qp;
+	int			max_qp_wr;
+	int			device_cap_flags;
+	int			max_sge;
+	int			max_sge_rd;
+	int			max_cq;
+	int			max_cqe;
+	int			max_mr;
+	int			max_pd;
+	int			max_qp_rd_atom;
+	int			max_ee_rd_atom;
+	int			max_res_rd_atom;
+	int			max_qp_init_rd_atom;
+	int			max_ee_init_rd_atom;
+	enum ib_atomic_cap	atomic_cap;
+	int			max_ee;
+	int			max_rdd;
+	int			max_mw;
+	int			max_raw_ipv6_qp;
+	int			max_raw_ethy_qp;
+	int			max_mcast_grp;
+	int			max_mcast_qp_attach;
+	int			max_total_mcast_qp_attach;
+	int			max_ah;
+	int			max_fmr;
+	int			max_map_per_fmr;
+	int			max_srq;
+	int			max_srq_wr;
+	int			max_srq_sge;
+	u16			max_pkeys;
+	u8			local_ca_ack_delay;
+};
+
+enum ib_mtu {
+	IB_MTU_256  = 1,
+	IB_MTU_512  = 2,
+	IB_MTU_1024 = 3,
+	IB_MTU_2048 = 4,
+	IB_MTU_4096 = 5
+};
+
+static inline int ib_mtu_enum_to_int(enum ib_mtu mtu)
+{
+	switch (mtu) {
+	case IB_MTU_256:  return  256;
+	case IB_MTU_512:  return  512;
+	case IB_MTU_1024: return 1024;
+	case IB_MTU_2048: return 2048;
+	case IB_MTU_4096: return 4096;
+	default: 	  return -1;
+	}
+}
+
+enum ib_static_rate {
+	IB_STATIC_RATE_FULL		=  0,
+	IB_STATIC_RATE_12X_TO_4X	=  2,
+	IB_STATIC_RATE_4X_TO_1X		=  3,
+	IB_STATIC_RATE_12X_TO_1X	= 11
+};
+
+enum ib_port_state {
+	IB_PORT_NOP		= 0,
+	IB_PORT_DOWN		= 1,
+	IB_PORT_INIT		= 2,
+	IB_PORT_ARMED		= 3,
+	IB_PORT_ACTIVE		= 4,
+	IB_PORT_ACTIVE_DEFER	= 5
+};
+
+enum ib_port_cap_flags {
+	IB_PORT_SM				= (1<<31),
+	IB_PORT_NOTICE_SUP			= (1<<30),
+	IB_PORT_TRAP_SUP			= (1<<29),
+	IB_PORT_AUTO_MIGR_SUP			= (1<<27),
+	IB_PORT_SL_MAP_SUP			= (1<<26),
+	IB_PORT_MKEY_NVRAM			= (1<<25),
+	IB_PORT_PKEY_NVRAM			= (1<<24),
+	IB_PORT_LED_INFO_SUP			= (1<<23),
+	IB_PORT_SM_DISABLED			= (1<<22),
+	IB_PORT_SYS_IMAGE_GUID_SUP		= (1<<21),
+	IB_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= (1<<20),
+	IB_PORT_CM_SUP				= (1<<16),
+	IB_PORT_SNMP_TUNNEL_SUP			= (1<<15),
+	IB_PORT_REINIT_SUP			= (1<<14),
+	IB_PORT_DEVICE_MGMT_SUP			= (1<<13),
+	IB_PORT_VENDOR_CLASS_SUP		= (1<<12),
+	IB_PORT_DR_NOTICE_SUP			= (1<<11),
+	IB_PORT_PORT_NOTICE_SUP			= (1<<10),
+	IB_PORT_BOOT_MGMT_SUP			= (1<<9)
+};
+
+struct ib_port_attr {
+	enum ib_port_state	state;
+	enum ib_mtu		max_mtu;
+	enum ib_mtu		active_mtu;
+	int			gid_tbl_len;
+	u32			port_cap_flags;
+	u32			max_msg_sz;
+	u32			bad_pkey_cntr;
+	u32			qkey_viol_cntr;
+	u16			pkey_tbl_len;
+	u16			lid;
+	u16			sm_lid;
+	u8			lmc;
+	u8			max_vl_num;
+	u8			sm_sl;
+	u8			subnet_timeout;
+	u8			init_type_reply;
+}; 
+
+enum ib_device_modify_flags {
+	IB_DEVICE_MODIFY_SYS_IMAGE_GUID	= 1
+};
+
+struct ib_device_modify {
+	u64	sys_image_guid;
+};
+
+enum ib_port_modify_flags {
+	IB_PORT_SHUTDOWN		= 1,
+	IB_PORT_INIT_TYPE		= (1<<2),
+	IB_PORT_RESET_QKEY_CNTR		= (1<<3)
+};
+
+struct ib_port_modify {
+	u32	set_port_cap_mask;
+	u32	clr_port_cap_mask;
+	u8	init_type;
+};
+
+enum ib_event_type {
+	IB_EVENT_CQ_ERR,
+	IB_EVENT_QP_FATAL,
+	IB_EVENT_QP_REQ_ERR,
+	IB_EVENT_QP_ACCESS_ERR,
+	IB_EVENT_COMM_EST,
+	IB_EVENT_SQ_DRAINED,
+	IB_EVENT_PATH_MIG,
+	IB_EVENT_PATH_MIG_ERR,
+	IB_EVENT_DEVICE_FATAL,
+	IB_EVENT_PORT_ACTIVE,
+	IB_EVENT_PORT_ERR,
+	IB_EVENT_LID_CHANGE,
+	IB_EVENT_PKEY_CHANGE,
+	IB_EVENT_SM_CHANGE
+};
+
+struct ib_event {
+	struct ib_device	*device;
+	union {
+		struct ib_cq	*cq;
+		struct ib_qp	*qp;
+		u8		port_num;
+	} element;
+	enum ib_event_type	event;
+};
+
+struct ib_event_handler {
+	struct ib_device *device;
+	void            (*handler)(struct ib_event_handler *, struct ib_event *);
+	struct list_head  list;
+};
+
+#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler)		\
+	do {							\
+		(_ptr)->device  = _device;			\
+		(_ptr)->handler = _handler;			\
+		INIT_LIST_HEAD(&(_ptr)->list);			\
+	} while (0)
+
+struct ib_global_route {
+	union ib_gid	dgid;
+	u32		flow_label;
+	u8		sgid_index;
+	u8		hop_limit;
+	u8		traffic_class;
+};
+
+enum {
+	IB_MULTICAST_QPN = 0xffffff
+};
+
+enum ib_ah_flags {
+	IB_AH_GRH	= 1
+};
+
+struct ib_ah_attr {
+	struct ib_global_route	grh;
+	u16			dlid;
+	u8			sl;
+	u8			src_path_bits;
+	u8			static_rate;
+	u8			ah_flags;
+	u8			port_num;
+};
+
+enum ib_wc_status {
+	IB_WC_SUCCESS,
+	IB_WC_LOC_LEN_ERR,
+	IB_WC_LOC_QP_OP_ERR,
+	IB_WC_LOC_EEC_OP_ERR,
+	IB_WC_LOC_PROT_ERR,
+	IB_WC_WR_FLUSH_ERR,
+	IB_WC_MW_BIND_ERR,
+	IB_WC_BAD_RESP_ERR,
+	IB_WC_LOC_ACCESS_ERR,
+	IB_WC_REM_INV_REQ_ERR,
+	IB_WC_REM_ACCESS_ERR,
+	IB_WC_REM_OP_ERR,
+	IB_WC_RETRY_EXC_ERR,
+	IB_WC_RNR_RETRY_EXC_ERR,
+	IB_WC_LOC_RDD_VIOL_ERR,
+	IB_WC_REM_INV_RD_REQ_ERR,
+	IB_WC_REM_ABORT_ERR,
+	IB_WC_INV_EECN_ERR,
+	IB_WC_INV_EEC_STATE_ERR,
+	IB_WC_FATAL_ERR,
+	IB_WC_RESP_TIMEOUT_ERR,
+	IB_WC_GENERAL_ERR
+};
+
+enum ib_wc_opcode {
+	IB_WC_SEND,
+	IB_WC_RDMA_WRITE,
+	IB_WC_RDMA_READ,
+	IB_WC_COMP_SWAP,
+	IB_WC_FETCH_ADD,
+	IB_WC_BIND_MW,
+/*
+ * Set value of IB_WC_RECV so consumers can test if a completion is a
+ * receive by testing (opcode & IB_WC_RECV).
+ */
+	IB_WC_RECV			= 1 << 7,
+	IB_WC_RECV_RDMA_WITH_IMM
+};
+
+enum ib_wc_flags {
+	IB_WC_GRH		= 1,
+	IB_WC_WITH_IMM		= (1<<1)
+};
+
+struct ib_wc {
+	u64			wr_id;
+	enum ib_wc_status	status;
+	enum ib_wc_opcode	opcode;
+	u32			vendor_err;
+	u32			byte_len;
+	__be32			imm_data;
+	u32			src_qp;
+	int			wc_flags;
+	u16			pkey_index;
+	u16			slid;
+	u8			sl;
+	u8			dlid_path_bits;
+	u8			port_num;	/* valid only for DR SMPs on switches */
+};
+
+enum ib_cq_notify {
+	IB_CQ_SOLICITED,
+	IB_CQ_NEXT_COMP
+};
+
+struct ib_qp_cap {
+	u32	max_send_wr;
+	u32	max_recv_wr;
+	u32	max_send_sge;
+	u32	max_recv_sge;
+	u32	max_inline_data;
+};
+
+enum ib_sig_type {
+	IB_SIGNAL_ALL_WR,
+	IB_SIGNAL_REQ_WR
+};
+
+enum ib_qp_type {
+	/*
+	 * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries
+	 * here (and in that order) since the MAD layer uses them as
+	 * indices into a 2-entry table.
+	 */
+	IB_QPT_SMI,
+	IB_QPT_GSI,
+
+	IB_QPT_RC,
+	IB_QPT_UC,
+	IB_QPT_UD,
+	IB_QPT_RAW_IPV6,
+	IB_QPT_RAW_ETY
+};
+
+struct ib_qp_init_attr {
+	void                  (*event_handler)(struct ib_event *, void *);
+	void		       *qp_context;
+	struct ib_cq	       *send_cq;
+	struct ib_cq	       *recv_cq;
+	struct ib_srq	       *srq;
+	struct ib_qp_cap	cap;
+	enum ib_sig_type	sq_sig_type;
+	enum ib_sig_type	rq_sig_type;
+	enum ib_qp_type		qp_type;
+	u8			port_num; /* special QP types only */
+};
+
+enum ib_rnr_timeout {
+	IB_RNR_TIMER_655_36 =  0,
+	IB_RNR_TIMER_000_01 =  1,
+	IB_RNR_TIMER_000_02 =  2,
+	IB_RNR_TIMER_000_03 =  3,
+	IB_RNR_TIMER_000_04 =  4,
+	IB_RNR_TIMER_000_06 =  5,
+	IB_RNR_TIMER_000_08 =  6,
+	IB_RNR_TIMER_000_12 =  7,
+	IB_RNR_TIMER_000_16 =  8,
+	IB_RNR_TIMER_000_24 =  9,
+	IB_RNR_TIMER_000_32 = 10,
+	IB_RNR_TIMER_000_48 = 11,
+	IB_RNR_TIMER_000_64 = 12,
+	IB_RNR_TIMER_000_96 = 13,
+	IB_RNR_TIMER_001_28 = 14,
+	IB_RNR_TIMER_001_92 = 15,
+	IB_RNR_TIMER_002_56 = 16,
+	IB_RNR_TIMER_003_84 = 17,
+	IB_RNR_TIMER_005_12 = 18,
+	IB_RNR_TIMER_007_68 = 19,
+	IB_RNR_TIMER_010_24 = 20,
+	IB_RNR_TIMER_015_36 = 21,
+	IB_RNR_TIMER_020_48 = 22,
+	IB_RNR_TIMER_030_72 = 23,
+	IB_RNR_TIMER_040_96 = 24,
+	IB_RNR_TIMER_061_44 = 25,
+	IB_RNR_TIMER_081_92 = 26,
+	IB_RNR_TIMER_122_88 = 27,
+	IB_RNR_TIMER_163_84 = 28,
+	IB_RNR_TIMER_245_76 = 29,
+	IB_RNR_TIMER_327_68 = 30,
+	IB_RNR_TIMER_491_52 = 31
+};
+
+enum ib_qp_attr_mask {
+	IB_QP_STATE			= 1,
+	IB_QP_CUR_STATE			= (1<<1),
+	IB_QP_EN_SQD_ASYNC_NOTIFY	= (1<<2),
+	IB_QP_ACCESS_FLAGS		= (1<<3),
+	IB_QP_PKEY_INDEX		= (1<<4),
+	IB_QP_PORT			= (1<<5),
+	IB_QP_QKEY			= (1<<6),
+	IB_QP_AV			= (1<<7),
+	IB_QP_PATH_MTU			= (1<<8),
+	IB_QP_TIMEOUT			= (1<<9),
+	IB_QP_RETRY_CNT			= (1<<10),
+	IB_QP_RNR_RETRY			= (1<<11),
+	IB_QP_RQ_PSN			= (1<<12),
+	IB_QP_MAX_QP_RD_ATOMIC		= (1<<13),
+	IB_QP_ALT_PATH			= (1<<14),
+	IB_QP_MIN_RNR_TIMER		= (1<<15),
+	IB_QP_SQ_PSN			= (1<<16),
+	IB_QP_MAX_DEST_RD_ATOMIC	= (1<<17),
+	IB_QP_PATH_MIG_STATE		= (1<<18),
+	IB_QP_CAP			= (1<<19),
+	IB_QP_DEST_QPN			= (1<<20)
+};
+
+enum ib_qp_state {
+	IB_QPS_RESET,
+	IB_QPS_INIT,
+	IB_QPS_RTR,
+	IB_QPS_RTS,
+	IB_QPS_SQD,
+	IB_QPS_SQE,
+	IB_QPS_ERR
+};
+
+enum ib_mig_state {
+	IB_MIG_MIGRATED,
+	IB_MIG_REARM,
+	IB_MIG_ARMED
+};
+
+struct ib_qp_attr {
+	enum ib_qp_state	qp_state;
+	enum ib_qp_state	cur_qp_state;
+	enum ib_mtu		path_mtu;
+	enum ib_mig_state	path_mig_state;
+	u32			qkey;
+	u32			rq_psn;
+	u32			sq_psn;
+	u32			dest_qp_num;
+	int			qp_access_flags;
+	struct ib_qp_cap	cap;
+	struct ib_ah_attr	ah_attr;
+	struct ib_ah_attr	alt_ah_attr;
+	u16			pkey_index;
+	u16			alt_pkey_index;
+	u8			en_sqd_async_notify;
+	u8			sq_draining;
+	u8			max_rd_atomic;
+	u8			max_dest_rd_atomic;
+	u8			min_rnr_timer;
+	u8			port_num;
+	u8			timeout;
+	u8			retry_cnt;
+	u8			rnr_retry;
+	u8			alt_port_num;
+	u8			alt_timeout;
+};
+
+enum ib_wr_opcode {
+	IB_WR_RDMA_WRITE,
+	IB_WR_RDMA_WRITE_WITH_IMM,
+	IB_WR_SEND,
+	IB_WR_SEND_WITH_IMM,
+	IB_WR_RDMA_READ,
+	IB_WR_ATOMIC_CMP_AND_SWP,
+	IB_WR_ATOMIC_FETCH_AND_ADD
+};
+
+enum ib_send_flags {
+	IB_SEND_FENCE		= 1,
+	IB_SEND_SIGNALED	= (1<<1),
+	IB_SEND_SOLICITED	= (1<<2),
+	IB_SEND_INLINE		= (1<<3)
+};
+
+enum ib_recv_flags {
+	IB_RECV_SIGNALED	= 1
+};
+
+struct ib_sge {
+	u64	addr;
+	u32	length;
+	u32	lkey;
+};
+
+struct ib_send_wr {
+	struct ib_send_wr      *next;
+	u64			wr_id;
+	struct ib_sge	       *sg_list;
+	int			num_sge;
+	enum ib_wr_opcode	opcode;
+	int			send_flags;
+	u32			imm_data;
+	union {
+		struct {
+			u64	remote_addr;
+			u32	rkey;
+		} rdma;
+		struct {
+			u64	remote_addr;
+			u64	compare_add;
+			u64	swap;
+			u32	rkey;
+		} atomic;
+		struct {
+			struct ib_ah *ah;
+			struct ib_mad_hdr *mad_hdr;
+			u32	remote_qpn;
+			u32	remote_qkey;
+			int	timeout_ms; /* valid for MADs only */
+			u16	pkey_index; /* valid for GSI only */
+			u8	port_num;   /* valid for DR SMPs on switch only */
+		} ud;
+	} wr;
+};
+
+struct ib_recv_wr {
+	struct ib_recv_wr      *next;
+	u64			wr_id;
+	struct ib_sge	       *sg_list;
+	int			num_sge;
+	int			recv_flags;
+};
+
+enum ib_access_flags {
+	IB_ACCESS_LOCAL_WRITE	= 1,
+	IB_ACCESS_REMOTE_WRITE	= (1<<1),
+	IB_ACCESS_REMOTE_READ	= (1<<2),
+	IB_ACCESS_REMOTE_ATOMIC	= (1<<3),
+	IB_ACCESS_MW_BIND	= (1<<4)
+};
+
+struct ib_phys_buf {
+	u64      addr;
+	u64      size;
+};
+
+struct ib_mr_attr {
+	struct ib_pd	*pd;
+	u64		device_virt_addr;
+	u64		size;
+	int		mr_access_flags;
+	u32		lkey;
+	u32		rkey;
+};
+
+enum ib_mr_rereg_flags {
+	IB_MR_REREG_TRANS	= 1,
+	IB_MR_REREG_PD		= (1<<1),
+	IB_MR_REREG_ACCESS	= (1<<2)
+};
+
+struct ib_mw_bind {
+	struct ib_mr   *mr;
+	u64		wr_id;
+	u64		addr;
+	u32		length;
+	int		send_flags;
+	int		mw_access_flags;
+};
+
+struct ib_fmr_attr {
+	int	max_pages;
+	int	max_maps;
+	u8	page_size;
+};
+
+struct ib_pd {
+	struct ib_device *device;
+	atomic_t          usecnt; /* count all resources */
+};
+
+struct ib_ah {
+	struct ib_device	*device;
+	struct ib_pd		*pd;
+};
+
+typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
+
+struct ib_cq {
+	struct ib_device *device;
+	ib_comp_handler   comp_handler;
+	void             (*event_handler)(struct ib_event *, void *);
+	void *            cq_context;
+	int               cqe;
+	atomic_t          usecnt; /* count number of work queues */
+};
+
+struct ib_srq {
+	struct ib_device	*device;
+	struct ib_pd		*pd;
+	void			*srq_context;
+	atomic_t		usecnt;
+};
+
+struct ib_qp {
+	struct ib_device       *device;
+	struct ib_pd	       *pd;
+	struct ib_cq	       *send_cq;
+	struct ib_cq	       *recv_cq;
+	struct ib_srq	       *srq;
+	void                  (*event_handler)(struct ib_event *, void *);
+	void		       *qp_context;
+	u32			qp_num;
+};
+
+struct ib_mr {
+	struct ib_device *device;
+	struct ib_pd     *pd;
+	u32		  lkey;
+	u32		  rkey;
+	atomic_t          usecnt; /* count number of MWs */
+};
+
+struct ib_mw {
+	struct ib_device	*device;
+	struct ib_pd		*pd;
+	u32			rkey;
+};
+
+struct ib_fmr {
+	struct ib_device	*device;
+	struct ib_pd		*pd;
+	struct list_head	list;
+	u32			lkey;
+	u32			rkey;
+};
+
+struct ib_mad;
+
+enum ib_process_mad_flags {
+	IB_MAD_IGNORE_MKEY	= 1
+};
+
+enum ib_mad_result {
+	IB_MAD_RESULT_FAILURE  = 0,      /* (!SUCCESS is the important flag) */
+	IB_MAD_RESULT_SUCCESS  = 1 << 0, /* MAD was successfully processed   */
+	IB_MAD_RESULT_REPLY    = 1 << 1, /* Reply packet needs to be sent    */
+	IB_MAD_RESULT_CONSUMED = 1 << 2  /* Packet consumed: stop processing */
+};
+
+#define IB_DEVICE_NAME_MAX 64
+
+struct ib_cache {
+	struct ib_event_handler event_handler;
+	struct ib_pkey_cache  **pkey_cache;
+	struct ib_gid_cache   **gid_cache;
+};
+
+struct ib_device {
+	struct device                *dma_device;
+
+	char                          name[IB_DEVICE_NAME_MAX];
+
+	struct list_head              event_handler_list;
+	spinlock_t                    event_handler_lock;
+
+	struct list_head              core_list;
+	struct list_head              client_data_list;
+	spinlock_t                    client_data_lock;
+
+	struct ib_cache               cache;
+
+	u32                           flags;
+
+	int		           (*query_device)(struct ib_device *device,
+						   struct ib_device_attr *device_attr);
+	int		           (*query_port)(struct ib_device *device, 
+						 u8 port_num,
+						 struct ib_port_attr *port_attr);
+	int		           (*query_gid)(struct ib_device *device,
+						u8 port_num, int index,
+						union ib_gid *gid);
+	int		           (*query_pkey)(struct ib_device *device,
+						 u8 port_num, u16 index, u16 *pkey);
+	int		           (*modify_device)(struct ib_device *device,
+						    int device_modify_mask,
+						    struct ib_device_modify *device_modify);
+	int		           (*modify_port)(struct ib_device *device,
+						  u8 port_num, int port_modify_mask,
+						  struct ib_port_modify *port_modify);
+	struct ib_pd *             (*alloc_pd)(struct ib_device *device);
+	int                        (*dealloc_pd)(struct ib_pd *pd);
+	struct ib_ah *             (*create_ah)(struct ib_pd *pd,
+						struct ib_ah_attr *ah_attr);
+	int                        (*modify_ah)(struct ib_ah *ah,
+						struct ib_ah_attr *ah_attr);
+	int                        (*query_ah)(struct ib_ah *ah,
+					       struct ib_ah_attr *ah_attr);
+	int                        (*destroy_ah)(struct ib_ah *ah);
+	struct ib_qp *             (*create_qp)(struct ib_pd *pd,
+						struct ib_qp_init_attr *qp_init_attr);
+	int                        (*modify_qp)(struct ib_qp *qp,
+						struct ib_qp_attr *qp_attr,
+						int qp_attr_mask);
+	int                        (*query_qp)(struct ib_qp *qp,
+					       struct ib_qp_attr *qp_attr,
+					       int qp_attr_mask,
+					       struct ib_qp_init_attr *qp_init_attr);
+	int                        (*destroy_qp)(struct ib_qp *qp);
+	int                        (*post_send)(struct ib_qp *qp,
+						struct ib_send_wr *send_wr,
+						struct ib_send_wr **bad_send_wr);
+	int                        (*post_recv)(struct ib_qp *qp,
+						struct ib_recv_wr *recv_wr,
+						struct ib_recv_wr **bad_recv_wr);
+	struct ib_cq *             (*create_cq)(struct ib_device *device,
+						int cqe);
+	int                        (*destroy_cq)(struct ib_cq *cq);
+	int                        (*resize_cq)(struct ib_cq *cq, int *cqe);
+	int                        (*poll_cq)(struct ib_cq *cq, int num_entries,
+					      struct ib_wc *wc);
+	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
+	int                        (*req_notify_cq)(struct ib_cq *cq,
+						    enum ib_cq_notify cq_notify);
+	int                        (*req_ncomp_notif)(struct ib_cq *cq,
+						      int wc_cnt);
+	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
+						 int mr_access_flags);
+	struct ib_mr *             (*reg_phys_mr)(struct ib_pd *pd,
+						  struct ib_phys_buf *phys_buf_array,
+						  int num_phys_buf,
+						  int mr_access_flags,
+						  u64 *iova_start);
+	int                        (*query_mr)(struct ib_mr *mr,
+					       struct ib_mr_attr *mr_attr);
+	int                        (*dereg_mr)(struct ib_mr *mr);
+	int                        (*rereg_phys_mr)(struct ib_mr *mr,
+						    int mr_rereg_mask,
+						    struct ib_pd *pd,
+						    struct ib_phys_buf *phys_buf_array,
+						    int num_phys_buf,
+						    int mr_access_flags,
+						    u64 *iova_start);
+	struct ib_mw *             (*alloc_mw)(struct ib_pd *pd);
+	int                        (*bind_mw)(struct ib_qp *qp,
+					      struct ib_mw *mw,
+					      struct ib_mw_bind *mw_bind);
+	int                        (*dealloc_mw)(struct ib_mw *mw);
+	struct ib_fmr *	           (*alloc_fmr)(struct ib_pd *pd,
+						int mr_access_flags,
+						struct ib_fmr_attr *fmr_attr);
+	int		           (*map_phys_fmr)(struct ib_fmr *fmr,
+						   u64 *page_list, int list_len,
+						   u64 iova);
+	int		           (*unmap_fmr)(struct list_head *fmr_list);
+	int		           (*dealloc_fmr)(struct ib_fmr *fmr);
+	int                        (*attach_mcast)(struct ib_qp *qp,
+						   union ib_gid *gid,
+						   u16 lid);
+	int                        (*detach_mcast)(struct ib_qp *qp,
+						   union ib_gid *gid,
+						   u16 lid);
+	int                        (*process_mad)(struct ib_device *device,
+						  int process_mad_flags,
+						  u8 port_num,
+						  u16 source_lid,
+						  struct ib_mad *in_mad,
+						  struct ib_mad *out_mad);
+
+	struct class_device          class_dev;
+	struct kobject               ports_parent;
+	struct list_head             port_list;
+
+	enum {
+		IB_DEV_UNINITIALIZED,
+		IB_DEV_REGISTERED,
+		IB_DEV_UNREGISTERED
+	}                            reg_state;
+
+	u8                           node_type;
+	u8                           phys_port_cnt;
+};
+
+struct ib_client {
+	char  *name;
+	void (*add)   (struct ib_device *);
+	void (*remove)(struct ib_device *);
+
+	struct list_head list;
+};
+
+struct ib_device *ib_alloc_device(size_t size);
+void ib_dealloc_device(struct ib_device *device);
+
+int ib_register_device   (struct ib_device *device);
+void ib_unregister_device(struct ib_device *device);
+
+int ib_register_client   (struct ib_client *client);
+void ib_unregister_client(struct ib_client *client);
+
+void *ib_get_client_data(struct ib_device *device, struct ib_client *client);
+void  ib_set_client_data(struct ib_device *device, struct ib_client *client,
+			 void *data);
+
+int ib_register_event_handler  (struct ib_event_handler *event_handler);
+int ib_unregister_event_handler(struct ib_event_handler *event_handler);
+void ib_dispatch_event(struct ib_event *event);
+
+int ib_query_device(struct ib_device *device,
+		    struct ib_device_attr *device_attr);
+
+int ib_query_port(struct ib_device *device, 
+		  u8 port_num, struct ib_port_attr *port_attr);
+
+int ib_query_gid(struct ib_device *device,
+		 u8 port_num, int index, union ib_gid *gid);
+
+int ib_query_pkey(struct ib_device *device,
+		  u8 port_num, u16 index, u16 *pkey);
+
+int ib_modify_device(struct ib_device *device,
+		     int device_modify_mask,
+		     struct ib_device_modify *device_modify);
+
+int ib_modify_port(struct ib_device *device,
+		   u8 port_num, int port_modify_mask,
+		   struct ib_port_modify *port_modify);
+
+struct ib_pd *ib_alloc_pd(struct ib_device *device);
+int ib_dealloc_pd(struct ib_pd *pd);
+
+struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr);
+int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
+int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
+int ib_destroy_ah(struct ib_ah *ah);
+
+struct ib_qp *ib_create_qp(struct ib_pd *pd,
+			   struct ib_qp_init_attr *qp_init_attr);
+
+int ib_modify_qp(struct ib_qp *qp,
+		 struct ib_qp_attr *qp_attr,
+		 int qp_attr_mask);
+
+int ib_query_qp(struct ib_qp *qp,
+		struct ib_qp_attr *qp_attr,
+		int qp_attr_mask,
+		struct ib_qp_init_attr *qp_init_attr);
+
+int ib_destroy_qp(struct ib_qp *qp);
+
+static inline int ib_post_send(struct ib_qp *qp,
+			       struct ib_send_wr *send_wr,
+			       struct ib_send_wr **bad_send_wr)
+{
+	return qp->device->post_send(qp, send_wr, bad_send_wr);
+}
+
+static inline int ib_post_recv(struct ib_qp *qp,
+			       struct ib_recv_wr *recv_wr,
+			       struct ib_recv_wr **bad_recv_wr)
+{
+	return qp->device->post_recv(qp, recv_wr, bad_recv_wr);
+}
+
+struct ib_cq *ib_create_cq(struct ib_device *device,
+			   ib_comp_handler comp_handler,
+			   void (*event_handler)(struct ib_event *, void *),
+			   void *cq_context, int cqe);
+
+int ib_resize_cq(struct ib_cq *cq, int cqe);
+int ib_destroy_cq(struct ib_cq *cq);
+
+/**
+ * ib_poll_cq - poll a CQ for completion(s)
+ * @cq:the CQ being polled
+ * @num_entries:maximum number of completions to return
+ * @wc:array of at least @num_entries &struct ib_wc where completions
+ *   will be returned
+ *
+ * Poll a CQ for (possibly multiple) completions.  If the return value
+ * is < 0, an error occurred.  If the return value is >= 0, it is the
+ * number of completions returned.  If the return value is
+ * non-negative and < num_entries, then the CQ was emptied.
+ */
+static inline int ib_poll_cq(struct ib_cq *cq, int num_entries,
+			     struct ib_wc *wc)
+{
+	return cq->device->poll_cq(cq, num_entries, wc);
+}
+
+int ib_peek_cq(struct ib_cq *cq, int wc_cnt);
+
+/**
+ * ib_req_notify_cq - request completion notification
+ * @cq:the CQ to generate an event for
+ * @cq_notify:%IB_CQ_SOLICITED for next solicited event,
+ * %IB_CQ_NEXT_COMP for any completion.
+ */
+static inline int ib_req_notify_cq(struct ib_cq *cq,
+				   enum ib_cq_notify cq_notify)
+{
+	return cq->device->req_notify_cq(cq, cq_notify);
+}
+
+static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
+{
+	return cq->device->req_ncomp_notif ?
+		cq->device->req_ncomp_notif(cq, wc_cnt) :
+		-ENOSYS;
+}
+
+struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
+
+struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
+			     struct ib_phys_buf *phys_buf_array,
+			     int num_phys_buf,
+			     int mr_access_flags,
+			     u64 *iova_start);
+
+int ib_rereg_phys_mr(struct ib_mr *mr,
+		     int mr_rereg_mask,
+		     struct ib_pd *pd,
+		     struct ib_phys_buf *phys_buf_array,
+		     int num_phys_buf,
+		     int mr_access_flags,
+		     u64 *iova_start);
+
+int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
+int ib_dereg_mr(struct ib_mr *mr);
+
+struct ib_mw *ib_alloc_mw(struct ib_pd *pd);
+
+static inline int ib_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind)
+{
+	/* XXX reference counting in corresponding MR? */
+	return mw->device->bind_mw ?
+		mw->device->bind_mw(qp, mw, mw_bind) :
+		-ENOSYS;
+}
+
+int ib_dealloc_mw(struct ib_mw *mw);
+
+struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd,
+			    int mr_access_flags,
+			    struct ib_fmr_attr *fmr_attr);
+
+static inline int ib_map_phys_fmr(struct ib_fmr *fmr,
+				  u64 *page_list, int list_len,
+				  u64 iova)
+{
+	return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova);
+}
+
+int ib_unmap_fmr(struct list_head *fmr_list);
+int ib_dealloc_fmr(struct ib_fmr *fmr);
+
+int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
+int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
+
+#endif /* IB_VERBS_H */


From roland at topspin.com  Tue Nov 23 08:14:19 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:19 -0800
Subject: [openib-general] [PATCH][RFC/v2][2/21] Add core InfiniBand support
In-Reply-To: <20041123814.rXLIXw020elfd6Da@topspin.com>
Message-ID: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com>

Add implementation of core InfiniBand support.  This can be thought of
as a midlayer that provides an abstraction between low-level hardware
drivers and upper level protocols (such as IP-over-InfiniBand).

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/Kconfig	2004-11-23 08:10:16.399144313 -0800
@@ -0,0 +1,11 @@
+menu "InfiniBand support"
+
+config INFINIBAND
+	tristate "InfiniBand support"
+	default n
+	---help---
+	  Core support for InfiniBand (IB).  Make sure to also select
+	  any protocols you wish to use as well as drivers for your
+	  InfiniBand hardware.
+
+endmenu
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/Makefile	2004-11-23 08:10:16.436138859 -0800
@@ -0,0 +1 @@
+obj-$(CONFIG_INFINIBAND)		+= core/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/Makefile	2004-11-23 08:10:16.496130013 -0800
@@ -0,0 +1,13 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+obj-$(CONFIG_INFINIBAND) += \
+    ib_core.o
+
+ib_core-objs := \
+    packer.o \
+    ud_header.o \
+    verbs.o \
+    sysfs.o \
+    device.o \
+    fmr_pool.o \
+    cache.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/cache.c	2004-11-23 08:10:16.816082837 -0800
@@ -0,0 +1,338 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Topspin Communications.  All rights reserved.
+
+  $Id: cache.c 1257 2004-11-17 23:12:18Z roland $
+*/
+
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/rcupdate.h>
+
+#include "core_priv.h"
+
+struct ib_pkey_cache {
+	struct rcu_head rcu;
+	int             table_len;
+	u16             table[0];
+};
+
+struct ib_gid_cache {
+	struct rcu_head rcu;
+	int             table_len;
+	union ib_gid    table[0];
+};
+
+struct ib_update_work {
+	struct work_struct work;
+	struct ib_device  *device;
+	u8                 port_num;
+};
+
+static inline int start_port(struct ib_device *device)
+{
+	return device->node_type == IB_NODE_SWITCH ? 0 : 1;
+}
+
+static inline int end_port(struct ib_device *device)
+{
+	return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt;
+}
+
+static void rcu_free_pkey(struct rcu_head *head)
+{
+	struct ib_pkey_cache *cache =
+		container_of(head, struct ib_pkey_cache, rcu);
+	kfree(cache);
+}
+
+static void rcu_free_gid(struct rcu_head *head)
+{
+	struct ib_gid_cache *cache =
+		container_of(head, struct ib_gid_cache, rcu);
+	kfree(cache);
+}
+
+int ib_cached_gid_get(struct ib_device *device,
+		      u8                port,
+		      int               index,
+		      union ib_gid     *gid)
+{
+	struct ib_gid_cache *cache;
+	int ret = 0;
+
+	if (port < start_port(device) || port > end_port(device))
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	cache = rcu_dereference(device->cache.gid_cache[port - start_port(device)]);
+
+	if (index < 0 || index >= cache->table_len)
+		ret = -EINVAL;
+	else
+		*gid = cache->table[index];
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_cached_gid_get);
+
+int ib_cached_pkey_get(struct ib_device *device,
+		       u8                port,
+		       int               index,
+		       u16              *pkey)
+{
+	struct ib_pkey_cache *cache;
+	int ret = 0;
+
+	if (port < start_port(device) || port > end_port(device))
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]);
+
+	if (index < 0 || index >= cache->table_len)
+		ret = -EINVAL;
+	else
+		*pkey = cache->table[index];
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_cached_pkey_get);
+
+int ib_cached_pkey_find(struct ib_device *device,
+			u8                port,
+			u16               pkey,
+			u16              *index)
+{
+	struct ib_pkey_cache *cache;
+	int i;
+	int ret = -ENOENT;
+
+	if (port < start_port(device) || port > end_port(device))
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]);
+
+	*index = -1;
+
+	for (i = 0; i < cache->table_len; ++i)
+		if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
+			*index = i;
+			ret = 0;
+			break;
+		}
+
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(ib_cached_pkey_find);
+
+static void ib_cache_update(struct ib_device *device,
+			    u8                port)
+{
+	struct ib_port_attr       *tprops = NULL;
+	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
+	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
+	int                        i;
+	int                        ret;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		return;
+
+	ret = ib_query_port(device, port, tprops);
+	if (ret) {
+		printk(KERN_WARNING "ib_query_port failed (%d) for %s\n",
+		       ret, device->name);
+		goto err;
+	}
+
+	pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
+			     sizeof *pkey_cache->table, GFP_KERNEL);
+	if (!pkey_cache)
+		goto err;
+
+	INIT_RCU_HEAD(&pkey_cache->rcu);
+	pkey_cache->table_len = tprops->pkey_tbl_len;
+
+	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
+			    sizeof *gid_cache->table, GFP_KERNEL);
+	if (!gid_cache)
+		goto err;
+
+	INIT_RCU_HEAD(&gid_cache->rcu);
+	gid_cache->table_len = tprops->gid_tbl_len;
+
+	for (i = 0; i < pkey_cache->table_len; ++i) {
+		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
+		if (ret) {
+			printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n",
+			       ret, device->name, i);
+			goto err;
+		}
+	}
+
+	for (i = 0; i < gid_cache->table_len; ++i) {
+		ret = ib_query_gid(device, port, i, gid_cache->table + i);
+		if (ret) {
+			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
+			       ret, device->name, i);
+			goto err;
+		}
+	}
+
+	old_pkey_cache = device->cache.pkey_cache[port - start_port(device)];
+	old_gid_cache  = device->cache.gid_cache [port - start_port(device)];
+
+	rcu_assign_pointer(device->cache.pkey_cache[port - start_port(device)],
+			   pkey_cache);
+	rcu_assign_pointer(device->cache.gid_cache [port - start_port(device)],
+			   gid_cache);
+
+	if (old_pkey_cache)
+		call_rcu(&old_pkey_cache->rcu, rcu_free_pkey);
+	if (old_gid_cache)
+		call_rcu(&old_gid_cache->rcu, rcu_free_gid);
+
+	kfree(tprops);
+	return;
+
+err:
+	kfree(pkey_cache);
+	kfree(gid_cache);
+	kfree(tprops);
+}
+
+static void ib_cache_task(void *work_ptr)
+{
+	struct ib_update_work *work = work_ptr;
+
+	ib_cache_update(work->device, work->port_num);
+	kfree(work);
+}
+
+static void ib_cache_event(struct ib_event_handler *handler,
+			   struct ib_event *event)
+{
+	struct ib_update_work *work;
+
+	if (event->event == IB_EVENT_PORT_ERR    ||
+	    event->event == IB_EVENT_PORT_ACTIVE ||
+	    event->event == IB_EVENT_LID_CHANGE  ||
+	    event->event == IB_EVENT_PKEY_CHANGE ||
+	    event->event == IB_EVENT_SM_CHANGE) {
+		work = kmalloc(sizeof *work, GFP_ATOMIC);
+		if (work) {
+			INIT_WORK(&work->work, ib_cache_task, work);
+			work->device   = event->device;
+			work->port_num = event->element.port_num;
+			schedule_work(&work->work);
+		}
+	}
+}
+
+void ib_cache_setup_one(struct ib_device *device)
+{
+	int p;
+
+	device->cache.pkey_cache =
+		kmalloc(sizeof *device->cache.pkey_cache *
+			end_port(device) - start_port(device), GFP_KERNEL);
+	device->cache.gid_cache =
+		kmalloc(sizeof *device->cache.pkey_cache *
+			end_port(device) - start_port(device), GFP_KERNEL);
+
+	if (!device->cache.pkey_cache || !device->cache.gid_cache) {
+		printk(KERN_WARNING "Couldn't allocate cache "
+		       "for %s\n", device->name);
+		goto err;
+	}
+
+	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
+		device->cache.pkey_cache[p] = NULL;
+		device->cache.gid_cache [p] = NULL;
+		ib_cache_update(device, p + start_port(device));
+	}
+
+	INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
+			      device, ib_cache_event);
+	if (ib_register_event_handler(&device->cache.event_handler))
+		goto err_cache;
+
+	return;
+
+err_cache:
+	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
+		kfree(device->cache.pkey_cache[p]);
+		kfree(device->cache.gid_cache[p]);
+	}
+
+err:
+	kfree(device->cache.pkey_cache);
+	kfree(device->cache.gid_cache);
+}
+
+void ib_cache_cleanup_one(struct ib_device *device)
+{
+	int p;
+
+	ib_unregister_event_handler(&device->cache.event_handler);
+	flush_scheduled_work();
+
+	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
+		kfree(device->cache.pkey_cache[p]);
+		kfree(device->cache.gid_cache[p]);
+	}
+
+	kfree(device->cache.pkey_cache);
+	kfree(device->cache.gid_cache);
+}
+
+struct ib_client cache_client = {
+	.name   = "cache",
+	.add    = ib_cache_setup_one,
+	.remove = ib_cache_cleanup_one
+};
+
+int __init ib_cache_setup(void)
+{
+	return ib_register_client(&cache_client);
+}
+
+void __exit ib_cache_cleanup(void)
+{
+	ib_unregister_client(&cache_client);
+}
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/core_priv.h	2004-11-23 08:10:16.845078561 -0800
@@ -0,0 +1,48 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Topspin Communications.  All rights reserved.
+
+  $Id: core_priv.h 1179 2004-11-09 05:04:42Z roland $
+*/
+
+#ifndef _CORE_PRIV_H
+#define _CORE_PRIV_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#include <ib_verbs.h>
+
+int  ib_device_register_sysfs(struct ib_device *device);
+void ib_device_unregister_sysfs(struct ib_device *device);
+
+int  ib_sysfs_setup(void);
+void ib_sysfs_cleanup(void);
+
+int  ib_cache_setup(void);
+void ib_cache_cleanup(void);
+
+#endif /* _CORE_PRIV_H */
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/device.c	2004-11-23 08:10:16.735094778 -0800
@@ -0,0 +1,462 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: device.c 1179 2004-11-09 05:04:42Z roland $
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+
+#include <asm/semaphore.h>
+
+#include "core_priv.h"
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("core kernel InfiniBand API");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct ib_client_data {
+	struct list_head  list;
+	struct ib_client *client;
+	void *            data;
+};
+
+static LIST_HEAD(device_list);
+static LIST_HEAD(client_list);
+
+/*
+ * device_sem protects access to both device_list and client_list.
+ * There's no real point to using multiple locks or something fancier
+ * like an rwsem: we always access both lists, and we're always
+ * modifying one list or the other list.  In any case this is not a
+ * hot path so there's no point in trying to optimize.
+ */
+static DECLARE_MUTEX(device_sem);
+
+static int ib_device_check_mandatory(struct ib_device *device)
+{
+#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x }
+	static const struct {
+		size_t offset;
+		char  *name;
+	} mandatory_table[] = {
+		IB_MANDATORY_FUNC(query_device),
+		IB_MANDATORY_FUNC(query_port),
+		IB_MANDATORY_FUNC(query_pkey),
+		IB_MANDATORY_FUNC(query_gid),
+		IB_MANDATORY_FUNC(alloc_pd),
+		IB_MANDATORY_FUNC(dealloc_pd),
+		IB_MANDATORY_FUNC(create_ah),
+		IB_MANDATORY_FUNC(destroy_ah),
+		IB_MANDATORY_FUNC(create_qp),
+		IB_MANDATORY_FUNC(modify_qp),
+		IB_MANDATORY_FUNC(destroy_qp),
+		IB_MANDATORY_FUNC(post_send),
+		IB_MANDATORY_FUNC(post_recv),
+		IB_MANDATORY_FUNC(create_cq),
+		IB_MANDATORY_FUNC(destroy_cq),
+		IB_MANDATORY_FUNC(poll_cq),
+		IB_MANDATORY_FUNC(req_notify_cq),
+		IB_MANDATORY_FUNC(get_dma_mr),
+		IB_MANDATORY_FUNC(dereg_mr)
+	};
+	int i;
+
+	for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) {
+		if (!*(void **) ((void *) device + mandatory_table[i].offset)) {
+			printk(KERN_WARNING "Device %s is missing mandatory function %s\n",
+			       device->name, mandatory_table[i].name);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static struct ib_device *__ib_device_get_by_name(const char *name)
+{
+	struct ib_device *device;
+
+	list_for_each_entry(device, &device_list, core_list)
+		if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX))
+			return device;
+
+	return NULL;
+}
+
+
+static int alloc_name(char *name)
+{
+	long *inuse;
+	char buf[IB_DEVICE_NAME_MAX];
+	struct ib_device *device;
+	int i;
+
+	inuse = (long *) get_zeroed_page(GFP_KERNEL);
+	if (!inuse)
+		return -ENOMEM;
+
+	list_for_each_entry(device, &device_list, core_list) {
+		if (!sscanf(device->name, name, &i))
+			continue;
+		if (i < 0 || i >= PAGE_SIZE * 8)
+			continue;
+		snprintf(buf, sizeof buf, name, i);
+		if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX))
+			set_bit(i, inuse);
+	}
+
+	i = find_first_zero_bit(inuse, PAGE_SIZE * 8);
+	free_page((unsigned long) inuse);
+	snprintf(buf, sizeof buf, name, i);
+
+	if (__ib_device_get_by_name(buf))
+		return -ENFILE;
+
+	strlcpy(name, buf, IB_DEVICE_NAME_MAX);
+	return 0;
+}
+
+struct ib_device *ib_alloc_device(size_t size)
+{
+	void *dev;
+
+	BUG_ON(size < sizeof (struct ib_device));
+
+	dev = kmalloc(size, GFP_KERNEL);
+	if (!dev)
+		return NULL;
+
+	memset(dev, 0, size);
+
+	return dev;
+}
+EXPORT_SYMBOL(ib_alloc_device);
+
+void ib_dealloc_device(struct ib_device *device)
+{
+	if (device->reg_state == IB_DEV_UNINITIALIZED) {
+		kfree(device);
+		return;
+	}
+
+	BUG_ON(device->reg_state != IB_DEV_UNREGISTERED);
+
+	ib_device_unregister_sysfs(device);
+}
+EXPORT_SYMBOL(ib_dealloc_device);
+
+static int add_client_context(struct ib_device *device, struct ib_client *client)
+{
+	struct ib_client_data *context;
+	unsigned long flags;
+
+	context = kmalloc(sizeof *context, GFP_KERNEL);
+	if (!context) {
+		printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n",
+		       device->name, client->name);
+		return -ENOMEM;
+	}
+
+	context->client = client;
+	context->data   = NULL;
+
+	spin_lock_irqsave(&device->client_data_lock, flags);
+	list_add(&context->list, &device->client_data_list);
+	spin_unlock_irqrestore(&device->client_data_lock, flags);
+
+	return 0;
+}
+
+int ib_register_device(struct ib_device *device)
+{
+	int ret;
+
+	down(&device_sem);
+
+	if (strchr(device->name, '%')) {
+		ret = alloc_name(device->name);
+		if (ret)
+			goto out;
+	}
+
+	if (ib_device_check_mandatory(device)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&device->event_handler_list);
+	INIT_LIST_HEAD(&device->client_data_list);
+	spin_lock_init(&device->event_handler_lock);
+	spin_lock_init(&device->client_data_lock);
+
+	ret = ib_device_register_sysfs(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
+		       device->name);
+		goto out;
+	}
+
+	list_add_tail(&device->core_list, &device_list);
+
+	device->reg_state = IB_DEV_REGISTERED;
+
+	{
+		struct ib_client *client;
+
+		list_for_each_entry(client, &client_list, list)
+			if (client->add && !add_client_context(device, client))
+				client->add(device);
+	}
+
+ out:
+	up(&device_sem);
+	return ret;
+}
+EXPORT_SYMBOL(ib_register_device);
+
+void ib_unregister_device(struct ib_device *device)
+{
+	struct ib_client *client;
+	struct ib_client_data *context, *tmp;
+	unsigned long flags;
+
+	down(&device_sem);
+
+	list_for_each_entry_reverse(client, &client_list, list)
+		if (client->remove)
+			client->remove(device);
+
+	list_del(&device->core_list);
+
+	up(&device_sem);
+
+	spin_lock_irqsave(&device->client_data_lock, flags);
+	list_for_each_entry_safe(context, tmp, &device->client_data_list, list)
+		kfree(context);
+	spin_unlock_irqrestore(&device->client_data_lock, flags);
+
+	device->reg_state = IB_DEV_UNREGISTERED;
+}
+EXPORT_SYMBOL(ib_unregister_device);
+
+int ib_register_client(struct ib_client *client)
+{
+	struct ib_device *device;
+
+	down(&device_sem);
+
+	list_add_tail(&client->list, &client_list);
+	list_for_each_entry(device, &device_list, core_list)
+		if (client->add && !add_client_context(device, client))
+			client->add(device);
+
+	up(&device_sem);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_register_client);
+
+void ib_unregister_client(struct ib_client *client)
+{
+	struct ib_client_data *context, *tmp;
+	struct ib_device *device;
+	unsigned long flags;
+
+	down(&device_sem);
+
+	list_for_each_entry(device, &device_list, core_list) {
+		if (client->remove)
+			client->remove(device);
+
+		spin_lock_irqsave(&device->client_data_lock, flags);
+		list_for_each_entry_safe(context, tmp, &device->client_data_list, list)
+			if (context->client == client) {
+				list_del(&context->list);
+				kfree(context);
+			}
+		spin_unlock_irqrestore(&device->client_data_lock, flags);
+	}
+	list_del(&client->list);
+
+	up(&device_sem);
+}
+EXPORT_SYMBOL(ib_unregister_client);
+
+void *ib_get_client_data(struct ib_device *device, struct ib_client *client)
+{
+	struct ib_client_data *context;
+	void *ret = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&device->client_data_lock, flags);
+	list_for_each_entry(context, &device->client_data_list, list)
+		if (context->client == client) {
+			ret = context->data;
+			break;
+		}
+	spin_unlock_irqrestore(&device->client_data_lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_get_client_data);
+
+void ib_set_client_data(struct ib_device *device, struct ib_client *client,
+			void *data)
+{
+	struct ib_client_data *context;
+	unsigned long flags;
+
+	spin_lock_irqsave(&device->client_data_lock, flags);
+	list_for_each_entry(context, &device->client_data_list, list)
+		if (context->client == client) {
+			context->data = data;
+			goto out;
+		}
+
+	printk(KERN_WARNING "No client context found for %s/%s\n",
+	       device->name, client->name);
+
+out:
+	spin_unlock_irqrestore(&device->client_data_lock, flags);
+}
+EXPORT_SYMBOL(ib_set_client_data);
+
+int ib_register_event_handler  (struct ib_event_handler *event_handler)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&event_handler->device->event_handler_lock, flags);
+	list_add_tail(&event_handler->list,
+		      &event_handler->device->event_handler_list);
+	spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_register_event_handler);
+
+int ib_unregister_event_handler(struct ib_event_handler *event_handler)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&event_handler->device->event_handler_lock, flags);
+	list_del(&event_handler->list);
+	spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_unregister_event_handler);
+
+void ib_dispatch_event(struct ib_event *event)
+{
+	unsigned long flags;
+	struct ib_event_handler *handler;
+
+	spin_lock_irqsave(&event->device->event_handler_lock, flags);
+
+	list_for_each_entry(handler, &event->device->event_handler_list, list)
+		handler->handler(handler, event);
+
+	spin_unlock_irqrestore(&event->device->event_handler_lock, flags);
+}
+EXPORT_SYMBOL(ib_dispatch_event);
+
+int ib_query_device(struct ib_device *device,
+		    struct ib_device_attr *device_attr)
+{
+	return device->query_device(device, device_attr);
+}
+EXPORT_SYMBOL(ib_query_device);
+
+int ib_query_port(struct ib_device *device, 
+		  u8 port_num, 
+		  struct ib_port_attr *port_attr)
+{
+	return device->query_port(device, port_num, port_attr);
+}
+EXPORT_SYMBOL(ib_query_port);
+
+int ib_query_gid(struct ib_device *device,
+		 u8 port_num, int index, union ib_gid *gid)
+{
+	return device->query_gid(device, port_num, index, gid);
+}
+EXPORT_SYMBOL(ib_query_gid);
+
+int ib_query_pkey(struct ib_device *device,
+		  u8 port_num, u16 index, u16 *pkey)
+{
+	return device->query_pkey(device, port_num, index, pkey);
+}
+EXPORT_SYMBOL(ib_query_pkey);
+
+int ib_modify_device(struct ib_device *device,
+		     int device_modify_mask,
+		     struct ib_device_modify *device_modify)
+{
+	return device->modify_device(device, device_modify_mask,
+				     device_modify);
+}
+EXPORT_SYMBOL(ib_modify_device);
+
+int ib_modify_port(struct ib_device *device,
+		   u8 port_num, int port_modify_mask,
+		   struct ib_port_modify *port_modify)
+{
+	return device->modify_port(device, port_num, port_modify_mask,
+				   port_modify);
+}
+EXPORT_SYMBOL(ib_modify_port);
+
+static int __init ib_core_init(void)
+{
+	int ret;
+
+	ret = ib_sysfs_setup();
+	if (ret)
+		printk(KERN_WARNING "Couldn't create InfiniBand device class\n");
+
+	ret = ib_cache_setup();
+	if (ret) {
+		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
+		ib_sysfs_cleanup();
+	}
+
+	return ret;
+}
+
+static void __exit ib_core_cleanup(void)
+{
+	ib_cache_cleanup();
+	ib_sysfs_cleanup();
+}
+
+module_init(ib_core_init);
+module_exit(ib_core_cleanup);
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/fmr_pool.c	2004-11-23 08:10:16.773089176 -0800
@@ -0,0 +1,470 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Topspin Communications.  All rights reserved.
+
+  $Id: fmr_pool.c 1082 2004-10-27 20:32:50Z roland $
+*/
+
+#include <linux/errno.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/kthread.h>
+
+#include <ib_fmr_pool.h>
+
+#include "core_priv.h"
+
+enum {
+	IB_FMR_MAX_REMAPS = 32,
+
+	IB_FMR_HASH_BITS  = 8,
+	IB_FMR_HASH_SIZE  = 1 << IB_FMR_HASH_BITS,
+	IB_FMR_HASH_MASK  = IB_FMR_HASH_SIZE - 1
+};
+
+/*
+  If an FMR is not in use, then the list member will point to either
+  its pool's free_list (if the FMR can be mapped again; that is,
+  remap_count < IB_FMR_MAX_REMAPS) or its pool's dirty_list (if the
+  FMR needs to be unmapped before being remapped).  In either of these
+  cases it is a bug if the ref_count is not 0.  In other words, if
+  ref_count is > 0, then the list member must not be linked into
+  either free_list or dirty_list.
+
+  The cache_node member is used to link the FMR into a cache bucket
+  (if caching is enabled).  This is independent of the reference count
+  of the FMR.  When a valid FMR is released, its ref_count is
+  decremented, and if ref_count reaches 0, the FMR is placed in either
+  free_list or dirty_list as appropriate.  However, it is not removed
+  from the cache and may be "revived" if a call to
+  ib_fmr_register_physical() occurs before the FMR is remapped.  In
+  this case we just increment the ref_count and remove the FMR from
+  free_list/dirty_list.
+
+  Before we remap an FMR from free_list, we remove it from the cache
+  (to prevent another user from obtaining a stale FMR).  When an FMR
+  is released, we add it to the tail of the free list, so that our
+  cache eviction policy is "least recently used."
+
+  All manipulation of ref_count, list and cache_node is protected by
+  pool_lock to maintain consistency.
+*/
+
+struct ib_fmr_pool {
+	spinlock_t                pool_lock;
+
+	int                       pool_size;
+	int                       max_pages;
+	int                       dirty_watermark;
+	int                       dirty_len;
+	struct list_head          free_list;
+	struct list_head          dirty_list;
+	struct hlist_head        *cache_bucket;
+
+	void                     (*flush_function)(struct ib_fmr_pool *pool,
+						   void *              arg);
+	void                     *flush_arg;
+
+	struct task_struct       *thread;
+
+	atomic_t                  req_ser;
+	atomic_t                  flush_ser;
+
+	wait_queue_head_t         force_wait;
+};
+
+static inline u32 ib_fmr_hash(u64 first_page)
+{
+	return jhash_2words((u32) first_page,
+			    (u32) (first_page >> 32),
+			    0);
+}
+
+/* Caller must hold pool_lock */
+static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool,
+						      u64 *page_list,
+						      int  page_list_len,
+						      u64  io_virtual_address)
+{
+	struct hlist_head *bucket;
+	struct ib_pool_fmr *fmr;
+	struct hlist_node *pos;
+
+	if (!pool->cache_bucket)
+		return NULL;
+
+	bucket = pool->cache_bucket + ib_fmr_hash(*page_list);
+
+	hlist_for_each_entry(fmr, pos, bucket, cache_node)
+		if (io_virtual_address == fmr->io_virtual_address &&
+		    page_list_len      == fmr->page_list_len      &&
+		    !memcmp(page_list, fmr->page_list,
+			    page_list_len * sizeof *page_list))
+			return fmr;
+
+	return NULL;
+}
+
+static void ib_fmr_batch_release(struct ib_fmr_pool *pool)
+{
+	int                 ret;
+	struct ib_pool_fmr *fmr;
+	LIST_HEAD(unmap_list);
+	LIST_HEAD(fmr_list);
+
+	spin_lock_irq(&pool->pool_lock);
+
+	list_for_each_entry(fmr, &pool->dirty_list, list) {
+		hlist_del_init(&fmr->cache_node);
+		fmr->remap_count = 0;
+		list_add_tail(&fmr->fmr->list, &fmr_list);
+
+#ifdef DEBUG
+		if (fmr->ref_count !=0) {
+			printk(KERN_WARNING "Unmapping FMR 0x%08x with ref count %d",
+			       fmr, fmr->ref_count);
+		}
+#endif
+	}
+
+	list_splice(&pool->dirty_list, &unmap_list);
+	INIT_LIST_HEAD(&pool->dirty_list);
+	pool->dirty_len = 0;
+
+	spin_unlock_irq(&pool->pool_lock);
+
+	if (list_empty(&unmap_list)) {
+		return;
+	}
+
+	ret = ib_unmap_fmr(&fmr_list);
+	if (ret)
+		printk(KERN_WARNING "ib_unmap_fmr returned %d", ret);
+
+	spin_lock_irq(&pool->pool_lock);
+	list_splice(&unmap_list, &pool->free_list);
+	spin_unlock_irq(&pool->pool_lock);
+}
+
+static int ib_fmr_cleanup_thread(void *pool_ptr)
+{
+	struct ib_fmr_pool *pool = pool_ptr;
+
+	do {
+		if (pool->dirty_len >= pool->dirty_watermark ||
+		    atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) < 0) {
+			ib_fmr_batch_release(pool);
+
+			atomic_inc(&pool->flush_ser);
+			wake_up_interruptible(&pool->force_wait);
+
+			if (pool->flush_function)
+				pool->flush_function(pool, pool->flush_arg);
+		}
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (pool->dirty_len < pool->dirty_watermark &&
+		    atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) >= 0 &&
+		    !kthread_should_stop())
+			schedule();
+		__set_current_state(TASK_RUNNING);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+int ib_create_fmr_pool(struct ib_pd             *pd,
+		       struct ib_fmr_pool_param *params,
+		       struct ib_fmr_pool      **pool_handle)
+{
+	struct ib_device   *device;
+	struct ib_fmr_pool *pool;
+	int i;
+	int ret;
+
+	if (!params) {
+		return -EINVAL;
+	}
+
+	device = pd->device;
+	if (!device->alloc_fmr    ||
+	    !device->dealloc_fmr  ||
+	    !device->map_phys_fmr ||
+	    !device->unmap_fmr) {
+		printk(KERN_WARNING "Device %s does not support fast memory regions",
+		       device->name);
+		return -ENOSYS;
+	}
+
+	pool = kmalloc(sizeof *pool, GFP_KERNEL);
+	if (!pool) {
+		printk(KERN_WARNING "couldn't allocate pool struct");
+		return -ENOMEM;
+	}
+
+	pool->cache_bucket   = NULL;
+
+	pool->flush_function = params->flush_function;
+	pool->flush_arg      = params->flush_arg;
+
+	INIT_LIST_HEAD(&pool->free_list);
+	INIT_LIST_HEAD(&pool->dirty_list);
+
+	if (params->cache) {
+		pool->cache_bucket =
+			kmalloc(IB_FMR_HASH_SIZE * sizeof *pool->cache_bucket,
+				GFP_KERNEL);
+		if (!pool->cache_bucket) {
+			printk(KERN_WARNING "Failed to allocate cache in pool");
+			ret = -ENOMEM;
+			goto out_free_pool;
+		}
+
+		for (i = 0; i < IB_FMR_HASH_SIZE; ++i)
+			INIT_HLIST_HEAD(pool->cache_bucket + i);
+	}
+
+	pool->pool_size       = 0;
+	pool->max_pages       = params->max_pages_per_fmr;
+	pool->dirty_watermark = params->dirty_watermark;
+	pool->dirty_len       = 0;
+	spin_lock_init(&pool->pool_lock);
+	atomic_set(&pool->req_ser,   0);
+	atomic_set(&pool->flush_ser, 0);
+	init_waitqueue_head(&pool->force_wait);
+
+	pool->thread = kthread_create(ib_fmr_cleanup_thread,
+				      pool,
+				      "ib_fmr(%s)",
+				      device->name);
+	if (IS_ERR(pool->thread)) {
+		printk(KERN_WARNING "couldn't start cleanup thread");
+		ret = PTR_ERR(pool->thread);
+		goto out_free_pool;
+	}
+
+	{
+		struct ib_pool_fmr *fmr;
+		struct ib_fmr_attr attr = {
+			.max_pages = params->max_pages_per_fmr,
+			.max_maps  = IB_FMR_MAX_REMAPS,
+			.page_size = PAGE_SHIFT
+		};
+
+		for (i = 0; i < params->pool_size; ++i) {
+			fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64),
+				      GFP_KERNEL);
+			if (!fmr) {
+				printk(KERN_WARNING "failed to allocate fmr struct for FMR %d", i);
+				goto out_fail;
+			}
+
+			fmr->pool             = pool;
+			fmr->remap_count      = 0;
+			fmr->ref_count        = 0;
+			INIT_HLIST_NODE(&fmr->cache_node);
+
+			fmr->fmr = ib_alloc_fmr(pd, params->access, &attr);
+			if (IS_ERR(fmr->fmr)) {
+				printk(KERN_WARNING "fmr_create failed for FMR %d", i);
+				kfree(fmr);
+				goto out_fail;
+			}
+
+			list_add_tail(&fmr->list, &pool->free_list);
+			++pool->pool_size;
+		}
+	}
+
+	*pool_handle = pool;
+	return 0;
+
+ out_free_pool:
+	kfree(pool->cache_bucket);
+	kfree(pool);
+
+	return ret;
+
+ out_fail:
+	ib_destroy_fmr_pool(pool);
+	*pool_handle = NULL;
+
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(ib_create_fmr_pool);
+
+int ib_destroy_fmr_pool(struct ib_fmr_pool *pool)
+{
+	struct ib_pool_fmr *fmr;
+	struct ib_pool_fmr *tmp;
+	int                 i;
+
+	kthread_stop(pool->thread);
+	ib_fmr_batch_release(pool);
+
+	i = 0;
+	list_for_each_entry_safe(fmr, tmp, &pool->free_list, list) {
+		ib_dealloc_fmr(fmr->fmr);
+		list_del(&fmr->list);
+		kfree(fmr);
+		++i;
+	}
+
+	if (i < pool->pool_size)
+		printk(KERN_WARNING "pool still has %d regions registered",
+		       pool->pool_size - i);
+
+	kfree(pool->cache_bucket);
+	kfree(pool);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_destroy_fmr_pool);
+
+int ib_flush_fmr_pool(struct ib_fmr_pool *pool)
+{
+	int serial;
+
+	atomic_inc(&pool->req_ser);
+	/* It's OK if someone else bumps req_ser again here -- we'll
+	   just wait a little longer. */
+	serial = atomic_read(&pool->req_ser);
+
+	wake_up_process(pool->thread);
+
+	if (wait_event_interruptible(pool->force_wait,
+				     atomic_read(&pool->flush_ser) -
+				     atomic_read(&pool->req_ser) >= 0))
+		return -EINTR;
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_flush_fmr_pool);
+
+struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle,
+					 u64                *page_list,
+					 int                 list_len,
+					 u64                *io_virtual_address)
+{
+	struct ib_fmr_pool *pool = pool_handle;
+	struct ib_pool_fmr *fmr;
+	unsigned long       flags;
+	int                 result;
+
+	if (list_len < 1 || list_len > pool->max_pages)
+		return ERR_PTR(-EINVAL);
+
+	spin_lock_irqsave(&pool->pool_lock, flags);
+	fmr = ib_fmr_cache_lookup(pool,
+				  page_list,
+				  list_len,
+				  *io_virtual_address);
+	if (fmr) {
+		/* found in cache */
+		++fmr->ref_count;
+		if (fmr->ref_count == 1) {
+			list_del(&fmr->list);
+		}
+
+		spin_unlock_irqrestore(&pool->pool_lock, flags);
+
+		return fmr;
+	}
+
+	if (list_empty(&pool->free_list)) {
+		spin_unlock_irqrestore(&pool->pool_lock, flags);
+		return ERR_PTR(-EAGAIN);
+	}
+
+	fmr = list_entry(pool->free_list.next, struct ib_pool_fmr, list);
+	list_del(&fmr->list);
+	hlist_del_init(&fmr->cache_node);
+	spin_unlock_irqrestore(&pool->pool_lock, flags);
+
+	result = ib_map_phys_fmr(fmr->fmr, page_list, list_len,
+				 *io_virtual_address);
+
+	if (result) {
+		spin_lock_irqsave(&pool->pool_lock, flags);
+		list_add(&fmr->list, &pool->free_list);
+		spin_unlock_irqrestore(&pool->pool_lock, flags);
+
+		printk(KERN_WARNING "fmr_map returns %d",
+		       result);
+
+		return ERR_PTR(result);
+	}
+
+	++fmr->remap_count;
+	fmr->ref_count = 1;
+
+	if (pool->cache_bucket) {
+		fmr->io_virtual_address = *io_virtual_address;
+		fmr->page_list_len      = list_len;
+		memcpy(fmr->page_list, page_list, list_len * sizeof(*page_list));
+
+		spin_lock_irqsave(&pool->pool_lock, flags);
+		hlist_add_head(&fmr->cache_node,
+			       pool->cache_bucket + ib_fmr_hash(fmr->page_list[0]));
+		spin_unlock_irqrestore(&pool->pool_lock, flags);
+	}
+
+	return fmr;
+}
+EXPORT_SYMBOL(ib_fmr_pool_map_phys);
+
+int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr)
+{
+	struct ib_fmr_pool *pool;
+	unsigned long flags;
+
+	pool = fmr->pool;
+
+	spin_lock_irqsave(&pool->pool_lock, flags);
+
+	--fmr->ref_count;
+	if (!fmr->ref_count) {
+		if (fmr->remap_count < IB_FMR_MAX_REMAPS) {
+			list_add_tail(&fmr->list, &pool->free_list);
+		} else {
+			list_add_tail(&fmr->list, &pool->dirty_list);
+			++pool->dirty_len;
+			wake_up_process(pool->thread);
+		}
+	}
+
+#ifdef DEBUG
+	if (fmr->ref_count < 0)
+		printk(KERN_WARNING "FMR %p has ref count %d < 0",
+		       fmr, fmr->ref_count);
+#endif
+
+	spin_unlock_irqrestore(&pool->pool_lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_fmr_pool_unmap);
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/packer.c	2004-11-23 08:10:16.560120578 -0800
@@ -0,0 +1,177 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ *
+ * $Id: packer.c 1027 2004-10-20 03:59:00Z roland $
+ */
+
+#include <ib_pack.h>
+
+static u64 value_read(int offset, int size, void *structure)
+{
+	switch (size) {
+	case 1: return                *(u8  *) (structure + offset);
+	case 2: return be16_to_cpup((__be16 *) (structure + offset));
+	case 4: return be32_to_cpup((__be32 *) (structure + offset));
+	case 8: return be64_to_cpup((__be64 *) (structure + offset));
+	default:
+		printk(KERN_WARNING "Field size %d bits not handled\n", size * 8);
+		return 0;
+	}
+}
+
+void ib_pack(const struct ib_field        *desc,
+	     int                           desc_len,
+	     void                         *structure,
+	     void                         *buf)
+{
+	int i;
+
+	for (i = 0; i < desc_len; ++i) {
+		if (desc[i].size_bits <= 32) {
+			int shift;
+			u32 val;
+			__be32 mask;
+			__be32 *addr;
+
+			shift = 32 - desc[i].offset_bits - desc[i].size_bits;
+			if (desc[i].struct_size_bytes)
+				val = value_read(desc[i].struct_offset_bytes,
+						 desc[i].struct_size_bytes,
+						 structure) << shift;
+			else
+				val = 0;
+
+			mask = cpu_to_be32(((1ull << desc[i].size_bits) - 1) << shift);
+			addr = (__be32 *) buf + desc[i].offset_words;
+			*addr = (*addr & ~mask) | (cpu_to_be32(val) & mask);
+		} else if (desc[i].size_bits <= 64) {
+			int shift;
+			u64 val;
+			__be64 mask;
+			__be64 *addr;
+
+			shift = 64 - desc[i].offset_bits - desc[i].size_bits;
+			if (desc[i].struct_size_bytes)
+				val = value_read(desc[i].struct_offset_bytes,
+						 desc[i].struct_size_bytes,
+						 structure) << shift;
+			else
+				val = 0;
+
+			mask = cpu_to_be64(((1ull << desc[i].size_bits) - 1) << shift);
+			addr = (__be64 *) ((__be32 *) buf + desc[i].offset_words);
+			*addr = (*addr & ~mask) | (cpu_to_be64(val) & mask);
+		} else {
+			if (desc[i].offset_bits % 8 ||
+			    desc[i].size_bits   % 8) {
+				printk(KERN_WARNING "Structure field %s of size %d "
+				       "bits is not byte-aligned\n",
+				       desc[i].field_name, desc[i].size_bits);
+			}
+
+			if (desc[i].struct_size_bytes)
+				memcpy(buf + desc[i].offset_words * 4 + 
+				       desc[i].offset_bits / 8,
+				       structure + desc[i].struct_offset_bytes,
+				       desc[i].size_bits / 8);
+			else
+				memset(buf + desc[i].offset_words * 4 + 
+				       desc[i].offset_bits / 8,
+				       0,
+				       desc[i].size_bits / 8);
+		}
+	}
+}
+EXPORT_SYMBOL(ib_pack);
+
+static void value_write(int offset, int size, u64 val, void *structure)
+{
+	switch (size * 8) {
+	case 8:  *(    u8 *) (structure + offset) = val; break;
+	case 16: *(__be16 *) (structure + offset) = cpu_to_be16(val); break;
+	case 32: *(__be32 *) (structure + offset) = cpu_to_be32(val); break;
+	case 64: *(__be64 *) (structure + offset) = cpu_to_be64(val); break;
+	default:
+		printk(KERN_WARNING "Field size %d bits not handled\n", size * 8);
+	}
+}
+
+void ib_unpack(const struct ib_field        *desc,
+	       int                           desc_len,
+	       void                         *buf,
+	       void                         *structure)
+{
+	int i;
+
+	for (i = 0; i < desc_len; ++i) {
+		if (!desc[i].struct_size_bytes)
+			continue;
+
+		if (desc[i].size_bits <= 32) {
+			int shift;
+			u32  val;
+			u32  mask;
+			__be32 *addr;
+
+			shift = 32 - desc[i].offset_bits - desc[i].size_bits;
+			mask = ((1ull << desc[i].size_bits) - 1) << shift;
+			addr = (__be32 *) buf + desc[i].offset_words;
+			val = (be32_to_cpup(addr) & mask) >> shift;
+			value_write(desc[i].struct_offset_bytes,
+				    desc[i].struct_size_bytes,
+				    val,
+				    structure);
+		} else if (desc[i].size_bits <= 64) {
+			int shift;
+			u64  val;
+			u64  mask;
+			__be64 *addr;
+
+			shift = 64 - desc[i].offset_bits - desc[i].size_bits;
+			mask = ((1ull << desc[i].size_bits) - 1) << shift;
+			addr = (__be64 *) buf + desc[i].offset_words;
+			val = (be64_to_cpup(addr) & mask) >> shift;
+			value_write(desc[i].struct_offset_bytes,
+				    desc[i].struct_size_bytes,
+				    val,
+				    structure);
+		} else {
+			if (desc[i].offset_bits % 8 ||
+			    desc[i].size_bits   % 8) {
+				printk(KERN_WARNING "Structure field %s of size %d "
+				       "bits is not byte-aligned\n",
+				       desc[i].field_name, desc[i].size_bits);
+			}
+
+			memcpy(structure + desc[i].struct_offset_bytes,
+			       buf + desc[i].offset_words * 4 +
+			       desc[i].offset_bits / 8,
+			       desc[i].size_bits / 8);
+		}
+	}
+}
+EXPORT_SYMBOL(ib_unpack);
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/sysfs.c	2004-11-23 08:10:16.690101412 -0800
@@ -0,0 +1,684 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: sysfs.c 1257 2004-11-17 23:12:18Z roland $
+ */
+
+#include "core_priv.h"
+
+#include <ib_mad.h>
+
+struct ib_port {
+	struct kobject         kobj;
+	struct ib_device      *ibdev;
+	struct attribute_group gid_group;
+	struct attribute     **gid_attr;
+	struct attribute_group pkey_group;
+	struct attribute     **pkey_attr;
+	u8                     port_num;
+};
+
+struct port_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf);
+	ssize_t (*store)(struct ib_port *, struct port_attribute *,
+			 const char *buf, size_t count);
+};
+
+#define PORT_ATTR(_name, _mode, _show, _store) \
+struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+#define PORT_ATTR_RO(_name) \
+struct port_attribute port_attr_##_name = __ATTR_RO(_name)
+
+struct port_table_attribute {
+	struct port_attribute attr;
+	int                   index;
+};
+
+static ssize_t port_attr_show(struct kobject *kobj,
+			      struct attribute *attr, char *buf)
+{
+	struct port_attribute *port_attr =
+		container_of(attr, struct port_attribute, attr);
+	struct ib_port *p = container_of(kobj, struct ib_port, kobj);
+
+	if (!port_attr->show)
+		return 0;
+
+	return port_attr->show(p, port_attr, buf);
+}
+
+static struct sysfs_ops port_sysfs_ops = {
+	.show = port_attr_show
+};
+
+static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
+			  char *buf)
+{
+	struct ib_port_attr attr;
+	ssize_t ret;
+
+	static const char *state_name[] = {
+		[IB_PORT_NOP]		= "NOP",
+		[IB_PORT_DOWN]		= "DOWN",
+		[IB_PORT_INIT]		= "INIT",
+		[IB_PORT_ARMED]		= "ARMED",
+		[IB_PORT_ACTIVE]	= "ACTIVE",
+		[IB_PORT_ACTIVE_DEFER]	= "ACTIVE_DEFER"
+	};
+
+	ret = ib_query_port(p->ibdev, p->port_num, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%d: %s\n", attr.state,
+		       attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ?
+		       state_name[attr.state] : "UNKNOWN");
+}
+
+static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused,
+			char *buf)
+{
+	struct ib_port_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_port(p->ibdev, p->port_num, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "0x%x\n", attr.lid);
+}
+
+static ssize_t lid_mask_count_show(struct ib_port *p,
+				   struct port_attribute *unused,
+				   char *buf)
+{
+	struct ib_port_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_port(p->ibdev, p->port_num, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%d\n", attr.lmc);
+}
+
+static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused,
+			   char *buf)
+{
+	struct ib_port_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_port(p->ibdev, p->port_num, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "0x%x\n", attr.sm_lid);
+}
+
+static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused,
+			  char *buf)
+{
+	struct ib_port_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_port(p->ibdev, p->port_num, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%d\n", attr.sm_sl);
+}
+
+static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused,
+			     char *buf)
+{
+	struct ib_port_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_port(p->ibdev, p->port_num, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "0x%08x\n", attr.port_cap_flags);
+}
+
+static PORT_ATTR_RO(state);
+static PORT_ATTR_RO(lid);
+static PORT_ATTR_RO(lid_mask_count);
+static PORT_ATTR_RO(sm_lid);
+static PORT_ATTR_RO(sm_sl);
+static PORT_ATTR_RO(cap_mask);
+
+static struct attribute *port_default_attrs[] = {
+	&port_attr_state.attr,
+	&port_attr_lid.attr,
+	&port_attr_lid_mask_count.attr,
+	&port_attr_sm_lid.attr,
+	&port_attr_sm_sl.attr,
+	&port_attr_cap_mask.attr,
+	NULL
+};
+
+static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
+			     char *buf)
+{
+	struct port_table_attribute *tab_attr =
+		container_of(attr, struct port_table_attribute, attr);
+	union ib_gid gid;
+	ssize_t ret;
+
+	ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
+		       be16_to_cpu(((u16 *) gid.raw)[0]),
+		       be16_to_cpu(((u16 *) gid.raw)[1]),
+		       be16_to_cpu(((u16 *) gid.raw)[2]),
+		       be16_to_cpu(((u16 *) gid.raw)[3]),
+		       be16_to_cpu(((u16 *) gid.raw)[4]),
+		       be16_to_cpu(((u16 *) gid.raw)[5]),
+		       be16_to_cpu(((u16 *) gid.raw)[6]),
+		       be16_to_cpu(((u16 *) gid.raw)[7]));
+}
+
+static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
+			      char *buf)
+{
+	struct port_table_attribute *tab_attr =
+		container_of(attr, struct port_table_attribute, attr);
+	u16 pkey;
+	ssize_t ret;
+
+	ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "0x%04x\n", pkey);
+}
+
+#define PORT_PMA_ATTR(_name, _counter, _width, _offset)			\
+struct port_table_attribute port_pma_attr_##_name = {			\
+	.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),	\
+	.index = (_offset) | ((_width) << 16) | ((_counter) << 24)	\
+}
+
+static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
+				char *buf)
+{
+	struct port_table_attribute *tab_attr =
+		container_of(attr, struct port_table_attribute, attr);
+	int offset = tab_attr->index & 0xffff;
+	int width  = (tab_attr->index >> 16) & 0xff;
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	ssize_t ret;
+
+	if (!p->ibdev->process_mad)
+		return sprintf(buf, "N/A (no PMA)\n");
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version  = 1;
+	in_mad->mad_hdr.mgmt_class    = IB_MGMT_CLASS_PERF_MGMT;
+	in_mad->mad_hdr.class_version = 1;
+	in_mad->mad_hdr.method        = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id       = cpu_to_be16(0x12); /* PortCounters */
+
+	in_mad->data[41] = p->port_num;	/* PortSelect field */
+
+	if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff,
+				   in_mad, out_mad) &
+	     (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) !=
+	    (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	switch (width) {
+	case 4:
+		ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >>
+					    (offset % 4)) & 0xf);
+		break;
+	case 8:
+		ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]);
+		break;
+	case 16:
+		ret = sprintf(buf, "%u\n",
+			      be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8)));
+		break;
+	case 32:
+		ret = sprintf(buf, "%u\n",
+			      be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8)));
+		break;
+	default:
+		ret = 0;
+	}
+
+out:
+	kfree(in_mad);
+	kfree(out_mad);
+
+	return ret;
+}
+
+static PORT_PMA_ATTR(symbol_error		    ,  0, 16,  32);
+static PORT_PMA_ATTR(link_error_recovery	    ,  1,  8,  48);
+static PORT_PMA_ATTR(link_downed		    ,  2,  8,  56);
+static PORT_PMA_ATTR(port_rcv_errors		    ,  3, 16,  64);
+static PORT_PMA_ATTR(port_rcv_remote_physical_errors,  4, 16,  80);
+static PORT_PMA_ATTR(port_rcv_switch_relay_errors   ,  5, 16,  96);
+static PORT_PMA_ATTR(port_xmit_discards		    ,  6, 16, 112);
+static PORT_PMA_ATTR(port_xmit_constraint_errors    ,  7,  8, 128);
+static PORT_PMA_ATTR(port_rcv_constraint_errors	    ,  8,  8, 136);
+static PORT_PMA_ATTR(local_link_integrity_errors    ,  9,  4, 152);
+static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10,  4, 156);
+static PORT_PMA_ATTR(VL15_dropped		    , 11, 16, 176);
+static PORT_PMA_ATTR(port_xmit_data		    , 12, 32, 192);
+static PORT_PMA_ATTR(port_rcv_data		    , 13, 32, 224);
+static PORT_PMA_ATTR(port_xmit_packets		    , 14, 32, 256);
+static PORT_PMA_ATTR(port_rcv_packets		    , 15, 32, 288);
+
+static struct attribute *pma_attrs[] = {
+	&port_pma_attr_symbol_error.attr.attr,
+	&port_pma_attr_link_error_recovery.attr.attr,
+	&port_pma_attr_link_downed.attr.attr,
+	&port_pma_attr_port_rcv_errors.attr.attr,
+	&port_pma_attr_port_rcv_remote_physical_errors.attr.attr,
+	&port_pma_attr_port_rcv_switch_relay_errors.attr.attr,
+	&port_pma_attr_port_xmit_discards.attr.attr,
+	&port_pma_attr_port_xmit_constraint_errors.attr.attr,
+	&port_pma_attr_port_rcv_constraint_errors.attr.attr,
+	&port_pma_attr_local_link_integrity_errors.attr.attr,
+	&port_pma_attr_excessive_buffer_overrun_errors.attr.attr,
+	&port_pma_attr_VL15_dropped.attr.attr,
+	&port_pma_attr_port_xmit_data.attr.attr,
+	&port_pma_attr_port_rcv_data.attr.attr,
+	&port_pma_attr_port_xmit_packets.attr.attr,
+	&port_pma_attr_port_rcv_packets.attr.attr,
+	NULL
+};
+
+static struct attribute_group pma_group = {
+	.name  = "counters",
+	.attrs  = pma_attrs
+};
+
+static void ib_port_release(struct kobject *kobj)
+{
+	struct ib_port *p = container_of(kobj, struct ib_port, kobj);
+	struct attribute *a;
+	int i;
+
+	for (i = 0; (a = p->gid_attr[i]); ++i) {
+		kfree(a->name);
+		kfree(a);
+	}
+
+	for (i = 0; (a = p->pkey_attr[i]); ++i) {
+		kfree(a->name);
+		kfree(a);
+	}
+
+	kfree(p->gid_attr);
+	kfree(p);
+}
+
+static struct kobj_type port_type = {
+	.release       = ib_port_release,
+	.sysfs_ops     = &port_sysfs_ops,
+	.default_attrs = port_default_attrs
+};
+
+static void ib_device_release(struct class_device *cdev)
+{
+	struct ib_device *dev = container_of(cdev, struct ib_device, class_dev);
+
+	kfree(dev);
+}
+
+static int ib_device_hotplug(struct class_device *cdev, char **envp,
+			     int num_envp, char *buf, int size)
+{
+	struct ib_device *dev = container_of(cdev, struct ib_device, class_dev);
+	int i = 0, len = 0;
+
+	if (add_hotplug_env_var(envp, num_envp, &i, buf, size, &len,
+				"NAME=%s", dev->name))
+		return -ENOMEM;
+
+	/*
+	 * It might be nice to pass the node GUID to hotplug, but
+	 * right now the only way to get it is to query the device
+	 * provider, and this can crash during device removal because
+	 * we are will be running after driver removal has started.
+	 * We could add a node_guid field to struct ib_device, or we
+	 * could just let the hotplug script read the node GUID from
+	 * sysfs when devices are added.
+	 */
+
+	envp[i] = NULL;
+	return 0;
+}
+
+static int alloc_group(struct attribute ***attr,
+		       ssize_t (*show)(struct ib_port *,
+				       struct port_attribute *, char *buf),
+		       int len)
+{
+	struct port_table_attribute ***tab_attr =
+		(struct port_table_attribute ***) attr;
+	int i;
+	int ret;
+
+	*tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL);
+	if (!*tab_attr)
+		return -ENOMEM;
+
+	memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr);
+
+	for (i = 0; i < len; ++i) {
+		(*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL);
+		if (!(*tab_attr)[i]) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]);
+		(*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL);
+		if (!(*tab_attr)[i]->attr.attr.name) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		(*tab_attr)[i]->attr.attr.mode  = S_IRUGO;
+		(*tab_attr)[i]->attr.attr.owner = THIS_MODULE;
+		(*tab_attr)[i]->attr.show       = show;
+		(*tab_attr)[i]->index           = i;
+	}	
+
+	return 0;
+
+err:
+	for (i = 0; i < len; ++i) {
+		if ((*tab_attr)[i])
+			kfree((*tab_attr)[i]->attr.attr.name);
+		kfree((*tab_attr)[i]);
+	}
+
+	kfree(*tab_attr);
+
+	return ret;
+}
+
+static int add_port(struct ib_device *device, int port_num)
+{
+	struct ib_port *p;
+	struct ib_port_attr attr;
+	int i;
+	int ret;
+
+	ret = ib_query_port(device, port_num, &attr);
+	if (ret)
+		return ret;
+
+	p = kmalloc(sizeof *p, GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+	memset(p, 0, sizeof *p);
+
+	p->ibdev      = device;
+	p->port_num   = port_num;
+	p->kobj.ktype = &port_type;
+
+	p->kobj.parent = kobject_get(&device->ports_parent);
+	if (!p->kobj.parent) {
+		ret = -EBUSY;
+		goto err;
+	}
+
+	ret = kobject_set_name(&p->kobj, "%d", port_num);
+	if (ret)
+		goto err_put;
+
+	ret = kobject_register(&p->kobj);
+	if (ret)
+		goto err_put;
+
+	ret = sysfs_create_group(&p->kobj, &pma_group);
+	if (ret)
+		goto err_put;
+
+	ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len);
+	if (ret)
+		goto err_remove_pma;
+
+	p->gid_group.name  = "gids";
+	p->gid_group.attrs = p->gid_attr;
+
+	ret = sysfs_create_group(&p->kobj, &p->gid_group);
+	if (ret)
+		goto err_free_gid;
+
+	ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len);
+	if (ret)
+		goto err_remove_gid;
+
+	p->pkey_group.name  = "pkeys";
+	p->pkey_group.attrs = p->pkey_attr;
+
+	ret = sysfs_create_group(&p->kobj, &p->pkey_group);
+	if (ret)
+		goto err_free_pkey;
+
+	list_add_tail(&p->kobj.entry, &device->port_list);
+
+	return 0;
+
+err_free_pkey:
+	for (i = 0; i < attr.pkey_tbl_len; ++i) {
+		kfree(p->pkey_attr[i]->name);
+		kfree(p->pkey_attr[i]);
+	}
+
+	kfree(p->pkey_attr);
+
+err_remove_gid:
+	sysfs_remove_group(&p->kobj, &p->gid_group);
+
+err_free_gid:
+	for (i = 0; i < attr.gid_tbl_len; ++i) {
+		kfree(p->gid_attr[i]->name);
+		kfree(p->gid_attr[i]);
+	}
+
+	kfree(p->gid_attr);
+
+err_remove_pma:
+	sysfs_remove_group(&p->kobj, &pma_group);
+
+err_put:
+	kobject_put(&device->ports_parent);
+
+err:
+	kfree(p);
+	return ret;
+}
+
+static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf)
+{
+	struct ib_device *dev = container_of(cdev, struct ib_device, class_dev);
+	struct ib_device_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_device(dev, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%04x:%04x:%04x:%04x\n",
+		       be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]),
+		       be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]),
+		       be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]),
+		       be16_to_cpu(((u16 *) &attr.sys_image_guid)[3]));
+}
+
+static ssize_t show_node_guid(struct class_device *cdev, char *buf)
+{
+	struct ib_device *dev = container_of(cdev, struct ib_device, class_dev);
+	struct ib_device_attr attr;
+	ssize_t ret;
+
+	ret = ib_query_device(dev, &attr);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%04x:%04x:%04x:%04x\n",
+		       be16_to_cpu(((u16 *) &attr.node_guid)[0]),
+		       be16_to_cpu(((u16 *) &attr.node_guid)[1]),
+		       be16_to_cpu(((u16 *) &attr.node_guid)[2]),
+		       be16_to_cpu(((u16 *) &attr.node_guid)[3]));
+}
+
+static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL);
+static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL);
+
+static struct class_device_attribute *ib_class_attributes[] = {
+	&class_device_attr_sys_image_guid,
+	&class_device_attr_node_guid
+};
+
+static struct class ib_class = {
+	.name    = "infiniband",
+	.release = ib_device_release,
+	.hotplug = ib_device_hotplug,
+};
+
+int ib_device_register_sysfs(struct ib_device *device)
+{
+	struct class_device *class_dev = &device->class_dev;
+	int ret;
+	int i;
+
+	class_dev->class      = &ib_class;
+	class_dev->class_data = device;
+	strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE);
+
+	INIT_LIST_HEAD(&device->port_list);
+
+	ret = class_device_register(class_dev);
+	if (ret)
+		goto err;
+
+	for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) {
+		ret = class_device_create_file(class_dev, ib_class_attributes[i]);
+		if (ret)
+			goto err_unregister;
+	}
+
+	device->ports_parent.parent = kobject_get(&class_dev->kobj);
+	if (!device->ports_parent.parent) {
+		ret = -EBUSY;
+		goto err_unregister;
+	}
+	ret = kobject_set_name(&device->ports_parent, "ports");
+	if (ret)
+		goto err_put;
+	ret = kobject_register(&device->ports_parent);
+	if (ret)
+		goto err_put;
+
+	if (device->node_type == IB_NODE_SWITCH) {
+		ret = add_port(device, 0);
+		if (ret)
+			goto err_put;
+	} else {
+		int i;
+
+		for (i = 1; i <= device->phys_port_cnt; ++i) {
+			ret = add_port(device, i);
+			if (ret)
+				goto err_put;
+		}
+	}
+
+	return 0;
+
+err_put:
+	{
+		struct kobject *p, *t;
+		struct ib_port *port;
+
+		list_for_each_entry_safe(p, t, &device->port_list, entry) {
+			list_del(&p->entry);
+			port = container_of(p, struct ib_port, kobj);
+			sysfs_remove_group(p, &pma_group);
+			sysfs_remove_group(p, &port->pkey_group);
+			sysfs_remove_group(p, &port->gid_group);
+			kobject_unregister(p);
+		}
+	}
+
+	kobject_put(&class_dev->kobj);
+
+err_unregister:
+	class_device_unregister(class_dev);
+
+err:
+	return ret;
+}
+
+void ib_device_unregister_sysfs(struct ib_device *device)
+{
+	struct kobject *p, *t;
+	struct ib_port *port;
+
+	list_for_each_entry_safe(p, t, &device->port_list, entry) {
+		list_del(&p->entry);
+		port = container_of(p, struct ib_port, kobj);
+		sysfs_remove_group(p, &pma_group);
+		sysfs_remove_group(p, &port->pkey_group);
+		sysfs_remove_group(p, &port->gid_group);
+		kobject_unregister(p);
+	}
+
+	kobject_unregister(&device->ports_parent);
+	class_device_unregister(&device->class_dev);
+}
+
+int ib_sysfs_setup(void)
+{
+	return class_register(&ib_class);
+}
+
+void ib_sysfs_cleanup(void)
+{
+	class_unregister(&ib_class);
+}
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/ud_header.c	2004-11-23 08:10:16.600114681 -0800
@@ -0,0 +1,333 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ *
+ * $Id: ud_header.c 1027 2004-10-20 03:59:00Z roland $
+ */
+
+#include <linux/errno.h>
+
+#include <ib_pack.h>
+
+#define STRUCT_FIELD(header, field) \
+	.struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field),      \
+	.struct_size_bytes   = sizeof ((struct ib_unpacked_ ## header *) 0)->field, \
+	.field_name          = #header ":" #field
+
+static const struct ib_field lrh_table[]  = {
+	{ STRUCT_FIELD(lrh, virtual_lane),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 4 },
+	{ STRUCT_FIELD(lrh, link_version),
+	  .offset_words = 0,
+	  .offset_bits  = 4,
+	  .size_bits    = 4 },
+	{ STRUCT_FIELD(lrh, service_level),
+	  .offset_words = 0,
+	  .offset_bits  = 8,
+	  .size_bits    = 4 },
+	{ RESERVED,
+	  .offset_words = 0,
+	  .offset_bits  = 12,
+	  .size_bits    = 2 },
+	{ STRUCT_FIELD(lrh, link_next_header),
+	  .offset_words = 0,
+	  .offset_bits  = 14,
+	  .size_bits    = 2 },
+	{ STRUCT_FIELD(lrh, destination_lid),
+	  .offset_words = 0,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ RESERVED,
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 5 },
+	{ STRUCT_FIELD(lrh, packet_length),
+	  .offset_words = 1,
+	  .offset_bits  = 5,
+	  .size_bits    = 11 },
+	{ STRUCT_FIELD(lrh, source_lid),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 }
+};
+
+static const struct ib_field grh_table[]  = {
+	{ STRUCT_FIELD(grh, ip_version),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 4 },
+	{ STRUCT_FIELD(grh, traffic_class),
+	  .offset_words = 0,
+	  .offset_bits  = 4,
+	  .size_bits    = 8 },
+	{ STRUCT_FIELD(grh, flow_label),
+	  .offset_words = 0,
+	  .offset_bits  = 12,
+	  .size_bits    = 20 },
+	{ STRUCT_FIELD(grh, payload_length),
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ STRUCT_FIELD(grh, next_header),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 8 },
+	{ STRUCT_FIELD(grh, hop_limit),
+	  .offset_words = 1,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ STRUCT_FIELD(grh, source_gid),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ STRUCT_FIELD(grh, destination_gid),
+	  .offset_words = 6,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 }
+};
+
+static const struct ib_field bth_table[]  = {
+	{ STRUCT_FIELD(bth, opcode),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ STRUCT_FIELD(bth, solicited_event),
+	  .offset_words = 0,
+	  .offset_bits  = 8,
+	  .size_bits    = 1 },
+	{ STRUCT_FIELD(bth, mig_req),
+	  .offset_words = 0,
+	  .offset_bits  = 9,
+	  .size_bits    = 1 },
+	{ STRUCT_FIELD(bth, pad_count),
+	  .offset_words = 0,
+	  .offset_bits  = 10,
+	  .size_bits    = 2 },
+	{ STRUCT_FIELD(bth, transport_header_version),
+	  .offset_words = 0,
+	  .offset_bits  = 12,
+	  .size_bits    = 4 },
+	{ STRUCT_FIELD(bth, pkey),
+	  .offset_words = 0,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ RESERVED,
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ STRUCT_FIELD(bth, destination_qpn),
+	  .offset_words = 1,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+	{ STRUCT_FIELD(bth, ack_req),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ RESERVED,
+	  .offset_words = 2,
+	  .offset_bits  = 1,
+	  .size_bits    = 7 },
+	{ STRUCT_FIELD(bth, psn),
+	  .offset_words = 2,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 }
+};
+
+static const struct ib_field deth_table[] = {
+	{ STRUCT_FIELD(deth, qkey),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ RESERVED,
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ STRUCT_FIELD(deth, source_qpn),
+	  .offset_words = 1,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 }
+};
+
+void ib_ud_header_init(int     		    payload_bytes,
+		       int    		    grh_present,
+		       struct ib_ud_header *header)
+{
+	int header_len;
+
+	memset(header, 0, sizeof *header);
+
+	header_len =
+		IB_LRH_BYTES  +
+		IB_BTH_BYTES  +
+		IB_DETH_BYTES;
+	if (grh_present) {
+		header_len += IB_GRH_BYTES;
+	}
+
+	header->lrh.link_version     = 0;
+	header->lrh.link_next_header =
+		grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL;
+	header->lrh.packet_length    = (IB_LRH_BYTES     +
+					IB_BTH_BYTES     +
+					IB_DETH_BYTES    +
+					payload_bytes    +
+					4                + /* ICRC     */
+					3) / 4;            /* round up */
+
+	header->grh_present          = grh_present;
+	if (grh_present) {
+		header->lrh.packet_length  += IB_GRH_BYTES / 4;
+
+		header->grh.ip_version      = 6;
+		header->grh.payload_length  =
+			cpu_to_be16((IB_BTH_BYTES     +
+				     IB_DETH_BYTES    +
+				     payload_bytes    +
+				     4                + /* ICRC     */
+				     3) & ~3);          /* round up */
+		header->grh.next_header     = 0x1b;
+	}
+
+	cpu_to_be16s(&header->lrh.packet_length);
+
+	if (header->immediate_present)
+		header->bth.opcode           = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+	else
+		header->bth.opcode           = IB_OPCODE_UD_SEND_ONLY;
+	header->bth.pad_count                = (4 - payload_bytes) & 3;
+	header->bth.transport_header_version = 0;
+}
+EXPORT_SYMBOL(ib_ud_header_init);
+
+int ib_ud_header_pack(struct ib_ud_header *header,
+		      void                *buf)
+{
+	int len = 0;
+
+	ib_pack(lrh_table, ARRAY_SIZE(lrh_table),
+		&header->lrh, buf);
+	len += IB_LRH_BYTES;
+
+	if (header->grh_present) {
+		ib_pack(grh_table, ARRAY_SIZE(grh_table),
+			&header->grh, buf + len);
+		len += IB_GRH_BYTES;
+	}
+
+	ib_pack(bth_table, ARRAY_SIZE(bth_table),
+		&header->bth, buf + len);
+	len += IB_BTH_BYTES;
+
+	ib_pack(deth_table, ARRAY_SIZE(deth_table),
+		&header->deth, buf + len);
+	len += IB_DETH_BYTES;
+
+	if (header->immediate_present) {
+		memcpy(buf + len, &header->immediate_data, sizeof header->immediate_data);
+		len += sizeof header->immediate_data;
+	}
+
+	return len;
+}
+EXPORT_SYMBOL(ib_ud_header_pack);
+
+int ib_ud_header_unpack(void                *buf,
+			struct ib_ud_header *header)
+{
+	ib_unpack(lrh_table, ARRAY_SIZE(lrh_table),
+		  buf, &header->lrh);
+	buf += IB_LRH_BYTES;
+
+	if (header->lrh.link_version != 0) {
+		printk(KERN_WARNING "Invalid LRH.link_version %d\n",
+		       header->lrh.link_version);
+		return -EINVAL;
+	}
+
+	switch (header->lrh.link_next_header) {
+	case IB_LNH_IBA_LOCAL:
+		header->grh_present = 0;
+		break;
+
+	case IB_LNH_IBA_GLOBAL:
+		header->grh_present = 1;
+		ib_unpack(grh_table, ARRAY_SIZE(grh_table),
+			  buf, &header->grh);
+		buf += IB_GRH_BYTES;
+
+		if (header->grh.ip_version != 6) {
+			printk(KERN_WARNING "Invalid GRH.ip_version %d\n",
+			       header->grh.ip_version);
+			return -EINVAL;
+		}
+		if (header->grh.next_header != 0x1b) {
+			printk(KERN_WARNING "Invalid GRH.next_header 0x%02x\n",
+			       header->grh.next_header);
+			return -EINVAL;
+		}
+		break;
+
+	default:
+		printk(KERN_WARNING "Invalid LRH.link_next_header %d\n",
+		       header->lrh.link_next_header);
+		return -EINVAL;
+	}
+
+	ib_unpack(bth_table, ARRAY_SIZE(bth_table),
+		  buf, &header->bth);
+	buf += IB_BTH_BYTES;
+
+	switch (header->bth.opcode) {
+	case IB_OPCODE_UD_SEND_ONLY:
+		header->immediate_present = 0;
+		break;
+	case IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE:
+		header->immediate_present = 1;
+		break;
+	default:
+		printk(KERN_WARNING "Invalid BTH.opcode 0x%02x\n",
+		       header->bth.opcode);
+		return -EINVAL;
+	}
+
+	if (header->bth.transport_header_version != 0) {
+		printk(KERN_WARNING "Invalid BTH.transport_header_version %d\n",
+		       header->bth.transport_header_version);
+		return -EINVAL;
+	}
+
+	ib_unpack(deth_table, ARRAY_SIZE(deth_table),
+		  buf, &header->deth);
+	buf += IB_DETH_BYTES;
+
+	if (header->immediate_present)
+		memcpy(&header->immediate_data, buf, sizeof header->immediate_data);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_ud_header_unpack);
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/verbs.c	2004-11-23 08:10:16.644108194 -0800
@@ -0,0 +1,420 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#include <linux/errno.h>
+#include <linux/err.h>
+
+#include <ib_verbs.h>
+
+/* Protection domains */
+
+struct ib_pd *ib_alloc_pd(struct ib_device *device)
+{
+	struct ib_pd *pd;
+
+	pd = device->alloc_pd(device);
+
+	if (!IS_ERR(pd)) {
+		pd->device = device;
+		atomic_set(&pd->usecnt, 0);
+	}
+
+	return pd;
+}
+EXPORT_SYMBOL(ib_alloc_pd);
+
+int ib_dealloc_pd(struct ib_pd *pd)
+{
+	if (atomic_read(&pd->usecnt))
+		return -EBUSY;
+
+	return pd->device->dealloc_pd(pd);
+}
+EXPORT_SYMBOL(ib_dealloc_pd);
+
+/* Address handles */
+
+struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+{
+	struct ib_ah *ah;
+
+	ah = pd->device->create_ah(pd, ah_attr);
+
+	if (!IS_ERR(ah)) {
+		ah->device = pd->device;
+		ah->pd     = pd;
+		atomic_inc(&pd->usecnt);
+	}
+
+	return ah;
+}
+EXPORT_SYMBOL(ib_create_ah);
+
+int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr)
+{
+	return ah->device->modify_ah ?
+		ah->device->modify_ah(ah, ah_attr) :
+		-ENOSYS;
+}
+EXPORT_SYMBOL(ib_modify_ah);
+
+int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr)
+{
+	return ah->device->query_ah ?
+		ah->device->query_ah(ah, ah_attr) :
+		-ENOSYS;
+}
+EXPORT_SYMBOL(ib_query_ah);
+
+int ib_destroy_ah(struct ib_ah *ah)
+{
+	struct ib_pd *pd;
+	int ret;
+
+	pd = ah->pd;
+	ret = ah->device->destroy_ah(ah);
+	if (!ret)
+		atomic_dec(&pd->usecnt);
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_destroy_ah);
+
+/* Queue pairs */
+
+struct ib_qp *ib_create_qp(struct ib_pd *pd,
+			   struct ib_qp_init_attr *qp_init_attr)
+{
+	struct ib_qp *qp;
+
+	qp = pd->device->create_qp(pd, qp_init_attr);
+
+	if (!IS_ERR(qp)) {
+		qp->device     	  = pd->device;
+		qp->pd         	  = pd;
+		qp->send_cq    	  = qp_init_attr->send_cq;
+		qp->recv_cq    	  = qp_init_attr->recv_cq;
+		qp->srq	       	  = qp_init_attr->srq;
+		qp->event_handler = qp_init_attr->event_handler;
+		qp->qp_context    = qp_init_attr->qp_context;
+		atomic_inc(&pd->usecnt);
+		atomic_inc(&qp_init_attr->send_cq->usecnt);
+		atomic_inc(&qp_init_attr->recv_cq->usecnt);
+		if (qp_init_attr->srq)
+			atomic_inc(&qp_init_attr->srq->usecnt);
+	}
+
+	return qp;
+}
+EXPORT_SYMBOL(ib_create_qp);
+
+int ib_modify_qp(struct ib_qp *qp,
+		 struct ib_qp_attr *qp_attr,
+		 int qp_attr_mask)
+{
+	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask);
+}
+EXPORT_SYMBOL(ib_modify_qp);
+
+int ib_query_qp(struct ib_qp *qp,
+		struct ib_qp_attr *qp_attr,
+		int qp_attr_mask,
+		struct ib_qp_init_attr *qp_init_attr)
+{
+	return qp->device->query_qp ? 
+		qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) :
+		-ENOSYS;
+}
+EXPORT_SYMBOL(ib_query_qp);
+
+int ib_destroy_qp(struct ib_qp *qp)
+{
+	struct ib_pd *pd;
+	struct ib_cq *scq, *rcq;
+	struct ib_srq *srq;
+	int ret;
+
+	pd  = qp->pd;
+	scq = qp->send_cq;
+	rcq = qp->recv_cq;
+	srq = qp->srq;
+
+	ret = qp->device->destroy_qp(qp);
+	if (!ret) {
+		atomic_dec(&pd->usecnt);
+		atomic_dec(&scq->usecnt);
+		atomic_dec(&rcq->usecnt);
+		if (srq)
+			atomic_dec(&srq->usecnt);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_destroy_qp);
+
+/* Completion queues */
+
+struct ib_cq *ib_create_cq(struct ib_device *device,
+			   ib_comp_handler comp_handler,
+			   void (*event_handler)(struct ib_event *, void *),
+			   void *cq_context, int cqe)
+{
+	struct ib_cq *cq;
+
+	cq = device->create_cq(device, cqe);
+
+	if (!IS_ERR(cq)) {
+		cq->device        = device;
+		cq->comp_handler  = comp_handler;
+		cq->event_handler = event_handler;
+		cq->cq_context    = cq_context;
+		atomic_set(&cq->usecnt, 0);
+	}
+
+	return cq;
+}
+EXPORT_SYMBOL(ib_create_cq);
+
+int ib_destroy_cq(struct ib_cq *cq)
+{
+	if (atomic_read(&cq->usecnt))
+		return -EBUSY;
+
+	return cq->device->destroy_cq(cq);
+}
+EXPORT_SYMBOL(ib_destroy_cq);
+
+int ib_resize_cq(struct ib_cq *cq,
+                 int           cqe)
+{
+	int ret;
+
+	if (!cq->device->resize_cq)
+		return -ENOSYS;
+
+	ret = cq->device->resize_cq(cq, &cqe);
+	if (!ret)
+		cq->cqe = cqe;
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_resize_cq);
+
+/* Memory regions */
+
+struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
+{
+	struct ib_mr *mr;
+
+	mr = pd->device->get_dma_mr(pd, mr_access_flags);
+
+	if (!IS_ERR(mr)) {
+		mr->device = pd->device;
+		mr->pd     = pd;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_get_dma_mr);
+
+struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
+			     struct ib_phys_buf *phys_buf_array,
+			     int num_phys_buf,
+			     int mr_access_flags,
+			     u64 *iova_start)
+{
+	struct ib_mr *mr;
+
+	mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf,
+				     mr_access_flags, iova_start);
+
+	if (!IS_ERR(mr)) {
+		mr->device = pd->device;
+		mr->pd     = pd;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_reg_phys_mr);
+
+int ib_rereg_phys_mr(struct ib_mr *mr,
+		     int mr_rereg_mask,
+		     struct ib_pd *pd,
+		     struct ib_phys_buf *phys_buf_array,
+		     int num_phys_buf,
+		     int mr_access_flags,
+		     u64 *iova_start)
+{
+	struct ib_pd *old_pd;
+	int ret;
+
+	if (!mr->device->rereg_phys_mr)
+		return -ENOSYS;
+
+	if (atomic_read(&mr->usecnt))
+		return -EBUSY;
+
+	old_pd = mr->pd;
+
+	ret = mr->device->rereg_phys_mr(mr, mr_rereg_mask, pd,
+					phys_buf_array, num_phys_buf,
+					mr_access_flags, iova_start);
+
+	if (!ret && (mr_rereg_mask & IB_MR_REREG_PD)) {
+		atomic_dec(&old_pd->usecnt);
+		atomic_inc(&pd->usecnt);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_rereg_phys_mr);
+
+int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr)
+{
+	return mr->device->query_mr ?
+		mr->device->query_mr(mr, mr_attr) : -ENOSYS;
+}
+EXPORT_SYMBOL(ib_query_mr);
+
+int ib_dereg_mr(struct ib_mr *mr)
+{
+	struct ib_pd *pd;
+	int ret;
+
+	if (atomic_read(&mr->usecnt))
+		return -EBUSY;
+
+	pd = mr->pd;
+	ret = mr->device->dereg_mr(mr);
+	if (!ret)
+		atomic_dec(&pd->usecnt);
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_dereg_mr);
+
+/* Memory windows */
+
+struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
+{
+	struct ib_mw *mw;
+
+	if (!pd->device->alloc_mw)
+		return ERR_PTR(-ENOSYS);
+
+	mw = pd->device->alloc_mw(pd);
+	if (!IS_ERR(mw)) {
+		mw->device = pd->device;
+		mw->pd     = pd;
+		atomic_inc(&pd->usecnt);
+	}
+
+	return mw;
+}
+EXPORT_SYMBOL(ib_alloc_mw);
+
+int ib_dealloc_mw(struct ib_mw *mw)
+{
+	struct ib_pd *pd;
+	int ret;
+
+	pd = mw->pd;
+	ret = mw->device->dealloc_mw(mw);
+	if (!ret)
+		atomic_dec(&pd->usecnt);
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_dealloc_mw);
+
+/* "Fast" memory regions */
+
+struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd,
+			    int mr_access_flags,
+			    struct ib_fmr_attr *fmr_attr)
+{
+	struct ib_fmr *fmr;
+
+	if (!pd->device->alloc_fmr)
+		return ERR_PTR(-ENOSYS);
+
+	fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr);
+	if (!IS_ERR(fmr)) {
+		fmr->device = pd->device;
+		fmr->pd     = pd;
+		atomic_inc(&pd->usecnt);
+	}
+
+	return fmr;
+}
+EXPORT_SYMBOL(ib_alloc_fmr);
+
+int ib_unmap_fmr(struct list_head *fmr_list)
+{
+	struct ib_fmr *fmr;
+
+	if (list_empty(fmr_list))
+		return 0;
+
+	fmr = list_entry(fmr_list->next, struct ib_fmr, list);
+	return fmr->device->unmap_fmr(fmr_list);
+}
+EXPORT_SYMBOL(ib_unmap_fmr);
+
+int ib_dealloc_fmr(struct ib_fmr *fmr)
+{
+	struct ib_pd *pd;
+	int ret;
+
+	pd = fmr->pd;
+	ret = fmr->device->dealloc_fmr(fmr);
+	if (!ret)
+		atomic_dec(&pd->usecnt);
+
+	return ret;
+}
+EXPORT_SYMBOL(ib_dealloc_fmr);
+
+/* Multicast groups */
+
+int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
+{
+	return qp->device->attach_mcast ?
+		qp->device->attach_mcast(qp, gid, lid) :
+		-ENOSYS;
+}
+EXPORT_SYMBOL(ib_attach_mcast);
+
+int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
+{
+	return qp->device->detach_mcast ?
+		qp->device->detach_mcast(qp, gid, lid) :
+		-ENOSYS;
+}
+EXPORT_SYMBOL(ib_detach_mcast);


From roland at topspin.com  Tue Nov 23 08:14:26 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:26 -0800
Subject: [openib-general] [PATCH][RFC/v2][3/21] Hook up drivers/infiniband
In-Reply-To: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
Message-ID: <20041123814.LeHMD5hRZLn6VbLm@topspin.com>

Add the appropriate lines to drivers/Kconfig and drivers/Makefile so
that the kernel configuration and build systems know about drivers/infiniband.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/drivers/Kconfig	2004-11-23 08:09:54.858320443 -0800
+++ linux-bk/drivers/Kconfig	2004-11-23 08:10:17.410995118 -0800
@@ -54,4 +54,6 @@
 
 source "drivers/usb/Kconfig"
 
+source "drivers/infiniband/Kconfig"
+
 endmenu
--- linux-bk.orig/drivers/Makefile	2004-11-23 08:10:06.504603238 -0800
+++ linux-bk/drivers/Makefile	2004-11-23 08:10:17.411994971 -0800
@@ -59,4 +59,5 @@
 obj-$(CONFIG_EISA)		+= eisa/
 obj-$(CONFIG_CPU_FREQ)		+= cpufreq/
 obj-$(CONFIG_MMC)		+= mmc/
+obj-$(CONFIG_INFINIBAND)	+= infiniband/
 obj-y				+= firmware/


From roland at topspin.com  Tue Nov 23 08:14:31 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:31 -0800
Subject: [openib-general] [PATCH][RFC/v2][4/21] Add InfiniBand MAD
	(management datagram) support (public headers)
In-Reply-To: <20041123814.LeHMD5hRZLn6VbLm@topspin.com>
Message-ID: <20041123814.xOcI2C4YpT1G9jQi@topspin.com>

Add public headers for handling InfiniBand MADs (management
datagrams), including sending and receiving MADs as well as passing
MADs on to local agents.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_mad.h	2004-11-23 08:10:17.682955018 -0800
@@ -0,0 +1,334 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+ * Copyright (c) 2004 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#if !defined( IB_MAD_H )
+#define IB_MAD_H
+
+#include <ib_verbs.h>
+
+/* Management base version */
+#define IB_MGMT_BASE_VERSION			1
+
+/* Management classes */
+#define IB_MGMT_CLASS_SUBN_LID_ROUTED		0x01
+#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE	0x81
+#define IB_MGMT_CLASS_SUBN_ADM			0x03
+#define IB_MGMT_CLASS_PERF_MGMT			0x04
+#define IB_MGMT_CLASS_BM			0x05
+#define IB_MGMT_CLASS_DEVICE_MGMT		0x06
+#define IB_MGMT_CLASS_CM			0x07
+#define IB_MGMT_CLASS_SNMP			0x08
+
+/* Management methods */
+#define IB_MGMT_METHOD_GET			0x01
+#define IB_MGMT_METHOD_SET			0x02
+#define IB_MGMT_METHOD_GET_RESP			0x81
+#define IB_MGMT_METHOD_SEND			0x03
+#define IB_MGMT_METHOD_TRAP			0x05
+#define IB_MGMT_METHOD_REPORT			0x06
+#define IB_MGMT_METHOD_REPORT_RESP		0x86
+#define IB_MGMT_METHOD_TRAP_REPRESS		0x07
+
+#define IB_MGMT_METHOD_RESP			0x80
+
+
+#define IB_MGMT_MAX_METHODS			128
+
+#define IB_QP0		0
+#define IB_QP1		cpu_to_be32(1)
+#define IB_QP1_QKEY	0x80010000
+
+struct ib_grh {
+	u32		version_tclass_flow;
+	u16		paylen;
+	u8		next_hdr;
+	u8		hop_limit;
+	union ib_gid	sgid;
+	union ib_gid	dgid;
+} __attribute__ ((packed));
+
+struct ib_mad_hdr {
+	u8	base_version;
+	u8	mgmt_class;
+	u8	class_version;
+	u8	method;
+	u16	status;
+	u16	class_specific;
+	u64	tid;
+	u16	attr_id;
+	u16	resv;
+	u32	attr_mod;
+} __attribute__ ((packed));
+
+struct ib_rmpp_hdr {
+	u8	rmpp_version;
+	u8	rmpp_type;
+	u8	rmpp_rtime_flags;
+	u8	rmpp_status;
+	u32	seg_num;
+	u32	paylen_newwin;
+} __attribute__ ((packed));
+
+struct ib_mad {
+	struct ib_mad_hdr	mad_hdr;
+	u8			data[232];
+} __attribute__ ((packed));
+
+struct ib_rmpp_mad {
+	struct ib_mad_hdr	mad_hdr;
+	struct ib_rmpp_hdr	rmpp_hdr;
+	u8			data[220];
+} __attribute__ ((packed));
+
+struct ib_mad_agent;
+struct ib_mad_send_wc;
+struct ib_mad_recv_wc;
+
+/**
+ * ib_mad_send_handler - callback handler for a sent MAD.
+ * @mad_agent - MAD agent that sent the MAD.
+ * @mad_send_wc - Send work completion information on the sent MAD.
+ */
+typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent,
+				    struct ib_mad_send_wc *mad_send_wc);
+
+/**
+ * ib_mad_recv_handler - callback handler for a received MAD.
+ * @mad_agent - MAD agent requesting the received MAD.
+ * @mad_recv_wc - Received work completion information on the received MAD.
+ *
+ * MADs received in response to a send request operation will be handed to
+ * the user after the send operation completes.  All data buffers given
+ * to the user through this routine are owned by the receiving client.
+ */
+typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent,
+				    struct ib_mad_recv_wc *mad_recv_wc);
+
+/**
+ * ib_mad_agent - Used to track MAD registration with the access layer.
+ * @device - Reference to device registration is on.
+ * @qp - Reference to QP used for sending and receiving MADs.
+ * @recv_handler - Callback handler for a received MAD.
+ * @send_handler - Callback handler for a sent MAD.
+ * @context - User-specified context associated with this registration.
+ * @hi_tid - Access layer assigned transaction ID for this client.
+ *   Unsolicited MADs sent by this client will have the upper 32-bits
+ *   of their TID set to this value.
+ * @port_num - Port number on which QP is registered
+ */
+struct ib_mad_agent {
+	struct ib_device	*device;
+	struct ib_qp		*qp;
+	ib_mad_recv_handler	recv_handler;
+	ib_mad_send_handler	send_handler;
+	void			*context;
+	u32			hi_tid;
+	u8			port_num;
+};
+
+/**
+ * ib_mad_send_wc - MAD send completion information.
+ * @wr_id - Work request identifier associated with the send MAD request.
+ * @status - Completion status.
+ * @vendor_err - Optional vendor error information returned with a failed
+ *   request.
+ */
+struct ib_mad_send_wc {
+	u64			wr_id;
+	enum ib_wc_status	status;
+	u32			vendor_err;
+};
+
+/**
+ * ib_mad_recv_buf - received MAD buffer information.
+ * @list - Reference to next data buffer for a received RMPP MAD.
+ * @grh - References a data buffer containing the global route header.
+ *   The data refereced by this buffer is only valid if the GRH is
+ *   valid.
+ * @mad - References the start of the received MAD.
+ */
+struct ib_mad_recv_buf {
+	struct list_head	list;
+	struct ib_grh		*grh;
+	struct ib_mad		*mad;
+};
+
+/**
+ * ib_mad_recv_wc - received MAD information.
+ * @wc - Completion information for the received data.
+ * @recv_buf - Specifies the location of the received data buffer(s).
+ * @mad_len - The length of the received MAD, without duplicated headers.
+ *
+ * For received response, the wr_id field of the wc is set to the wr_id
+ *   for the corresponding send request.
+ */
+struct ib_mad_recv_wc {
+	struct ib_wc		*wc;
+	struct ib_mad_recv_buf	*recv_buf;
+	int			mad_len;
+};
+
+/**
+ * ib_mad_reg_req - MAD registration request
+ * @mgmt_class - Indicates which management class of MADs should be receive
+ *   by the caller.  This field is only required if the user wishes to
+ *   receive unsolicited MADs, otherwise it should be 0.
+ * @mgmt_class_version - Indicates which version of MADs for the given
+ *   management class to receive.
+ * @method_mask - The caller will receive unsolicited MADs for any method
+ *   where @method_mask = 1.
+ */
+struct ib_mad_reg_req {
+	u8	mgmt_class;
+	u8	mgmt_class_version;
+	DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS);
+};
+
+/**
+ * ib_register_mad_agent - Register to send/receive MADs.
+ * @device - The device to register with.
+ * @port_num - The port on the specified device to use.
+ * @qp_type - Specifies which QP to access.  Must be either
+ *   IB_QPT_SMI or IB_QPT_GSI.
+ * @mad_reg_req - Specifies which unsolicited MADs should be received
+ *   by the caller.  This parameter may be NULL if the caller only
+ *   wishes to receive solicited responses.
+ * @rmpp_version - If set, indicates that the client will send
+ *   and receive MADs that contain the RMPP header for the given version.
+ *   If set to 0, indicates that RMPP is not used by this client.
+ * @send_handler - The completion callback routine invoked after a send
+ *   request has completed.
+ * @recv_handler - The completion callback routine invoked for a received
+ *   MAD.
+ * @context - User specified context associated with the registration.
+ */
+struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
+					   u8 port_num,
+					   enum ib_qp_type qp_type,
+					   struct ib_mad_reg_req *mad_reg_req,
+					   u8 rmpp_version,
+					   ib_mad_send_handler send_handler,
+					   ib_mad_recv_handler recv_handler,
+					   void *context);
+
+/**
+ * ib_unregister_mad_agent - Unregisters a client from using MAD services.
+ * @mad_agent - Corresponding MAD registration request to deregister.
+ *
+ * After invoking this routine, MAD services are no longer usable by the
+ * client on the associated QP.
+ */
+int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent);
+
+/**
+ * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated
+ *   with the registered client.
+ * @mad_agent - Specifies the associated registration to post the send to.
+ * @send_wr - Specifies the information needed to send the MAD(s).
+ * @bad_send_wr - Specifies the MAD on which an error was encountered.
+ *
+ * Sent MADs are not guaranteed to complete in the order that they were posted.
+ */
+int ib_post_send_mad(struct ib_mad_agent *mad_agent,
+		     struct ib_send_wr *send_wr,
+		     struct ib_send_wr **bad_send_wr);
+
+/**
+ * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer.
+ * @mad_recv_wc - Work completion information for a received MAD.
+ * @buf - User-provided data buffer to receive the coalesced buffers.  The
+ *   referenced buffer should be at least the size of the mad_len specified
+ *   by @mad_recv_wc.
+ *
+ * This call copies a chain of received RMPP MADs into a single data buffer,
+ * removing duplicated headers.
+ */
+void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc,
+			  void *buf);
+
+/**
+ * ib_free_recv_mad - Returns data buffers used to receive a MAD to the
+ *   access layer.
+ * @mad_recv_wc - Work completion information for a received MAD.
+ *
+ * Clients receiving MADs through their ib_mad_recv_handler must call this
+ * routine to return the work completion buffers to the access layer.
+ */
+void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc);
+
+/**
+ * ib_cancel_mad - Cancels an outstanding send MAD operation.
+ * @mad_agent - Specifies the registration associated with sent MAD.
+ * @wr_id - Indicates the work request identifier of the MAD to cancel.
+ *
+ * MADs will be returned to the user through the corresponding
+ * ib_mad_send_handler.
+ */
+void ib_cancel_mad(struct ib_mad_agent *mad_agent,
+		   u64 wr_id);
+
+/**
+ * ib_redirect_mad_qp - Registers a QP for MAD services.
+ * @qp - Reference to a QP that requires MAD services.
+ * @rmpp_version - If set, indicates that the client will send
+ *   and receive MADs that contain the RMPP header for the given version.
+ *   If set to 0, indicates that RMPP is not used by this client.
+ * @send_handler - The completion callback routine invoked after a send
+ *   request has completed.
+ * @recv_handler - The completion callback routine invoked for a received
+ *   MAD.
+ * @context - User specified context associated with the registration.
+ *
+ * Use of this call allows clients to use MAD services, such as RMPP,
+ * on user-owned QPs.  After calling this routine, users may send
+ * MADs on the specified QP by calling ib_mad_post_send.
+ */
+struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp,
+					u8 rmpp_version,
+					ib_mad_send_handler send_handler,
+					ib_mad_recv_handler recv_handler,
+					void *context);
+
+/**
+ * ib_process_mad_wc - Processes a work completion associated with a
+ *   MAD sent or received on a redirected QP.
+ * @mad_agent - Specifies the registered MAD service using the redirected QP.
+ * @wc - References a work completion associated with a sent or received
+ *   MAD segment.
+ *
+ * This routine is used to complete or continue processing on a MAD request.
+ * If the work completion is associated with a send operation, calling
+ * this routine is required to continue an RMPP transfer or to wait for a
+ * corresponding response, if it is a request.  If the work completion is
+ * associated with a receive operation, calling this routine is required to
+ * process an inbound or outbound RMPP transfer, or to match a response MAD
+ * with its corresponding request.
+ */
+int ib_process_mad_wc(struct ib_mad_agent *mad_agent,
+		      struct ib_wc *wc);
+
+#endif /* IB_MAD_H */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_smi.h	2004-11-23 08:10:17.722949121 -0800
@@ -0,0 +1,67 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+ * Copyright (c) 2004 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#if !defined( IB_SMI_H )
+#define IB_SMI_H
+
+#include <ib_mad.h>
+
+#define IB_LID_PERMISSIVE			0xFFFF
+
+#define IB_SMP_DATA_SIZE			64
+#define IB_SMP_MAX_PATH_HOPS			64
+
+struct ib_smp {
+	u8	base_version;
+	u8	mgmt_class;
+	u8	class_version;
+	u8	method;
+	u16	status;
+	u8	hop_ptr;
+	u8	hop_cnt;
+	u64	tid;
+	u16	attr_id;
+	u16	resv;
+	u32	attr_mod;
+	u64	mkey;
+	u16	dr_slid;
+	u16	dr_dlid;
+	u8	reserved[28];
+	u8	data[IB_SMP_DATA_SIZE];
+	u8	initial_path[IB_SMP_MAX_PATH_HOPS];
+	u8	return_path[IB_SMP_MAX_PATH_HOPS];
+} __attribute__ ((packed));
+
+#define IB_SMP_DIRECTION	cpu_to_be16(0x8000)
+
+static inline u8
+ib_get_smp_direction(struct ib_smp *smp)
+{
+	return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION);
+}
+
+#endif /* IB_SMI_H */


From roland at topspin.com  Tue Nov 23 08:14:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:40 -0800
Subject: [openib-general] [PATCH][RFC/v2][5/21] Add InfiniBand MAD
	(management datagram) support
In-Reply-To: <20041123814.xOcI2C4YpT1G9jQi@topspin.com>
Message-ID: <20041123814.sBoIUxeLIDc9lo4V@topspin.com>

Add support for handling InfiniBand MADs (management datagrams),
including sending and receiving MADs as well as passing MADs on to
local agents.

This is required for an SM (subnet manager) to discover and configure
the host, since the SM's query MADs must be passed to the local SMA
(subnet management agent).  In addition, this support is used by upper
level protocols to send queries to and receive responses from the SM.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/drivers/infiniband/core/Makefile	2004-11-23 08:10:16.496130013 -0800
+++ linux-bk/drivers/infiniband/core/Makefile	2004-11-23 08:10:17.978911380 -0800
@@ -1,7 +1,8 @@
 EXTRA_CFLAGS += -Idrivers/infiniband/include
 
 obj-$(CONFIG_INFINIBAND) += \
-    ib_core.o
+    ib_core.o \
+    ib_mad.o
 
 ib_core-objs := \
     packer.o \
@@ -11,3 +12,8 @@
     device.o \
     fmr_pool.o \
     cache.o
+
+ib_mad-objs := \
+    mad.o \
+    smi.o \
+    agent.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/agent.c	2004-11-23 08:10:18.065898554 -0800
@@ -0,0 +1,390 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#include <linux/dma-mapping.h>
+
+#include <asm/bug.h>
+
+#include <ib_smi.h>
+
+#include "smi.h"
+#include "agent_priv.h"
+#include "mad_priv.h"
+
+
+spinlock_t ib_agent_port_list_lock;
+static LIST_HEAD(ib_agent_port_list);
+
+extern kmem_cache_t *ib_mad_cache;
+
+
+/*
+ * Caller must hold ib_agent_port_list_lock
+ */
+static inline struct ib_agent_port_private *
+__ib_get_agent_port(struct ib_device *device, int port_num,
+		    struct ib_mad_agent *mad_agent)
+{
+	struct ib_agent_port_private *entry;
+
+	BUG_ON(!(!!device ^ !!mad_agent));  /* Exactly one MUST be (!NULL) */
+
+	if (device) {
+		list_for_each_entry(entry, &ib_agent_port_list, port_list) {
+			if (entry->dr_smp_agent->device == device &&
+			    entry->port_num == port_num)
+				return entry;
+		}
+	} else {
+		list_for_each_entry(entry, &ib_agent_port_list, port_list) {
+			if ((entry->dr_smp_agent == mad_agent) ||
+			    (entry->lr_smp_agent == mad_agent) ||
+			    (entry->perf_mgmt_agent == mad_agent))
+				return entry;
+		}
+	}
+	return NULL;
+}
+
+static inline struct ib_agent_port_private *
+ib_get_agent_port(struct ib_device *device, int port_num,
+		  struct ib_mad_agent *mad_agent)
+{
+	struct ib_agent_port_private *entry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
+	entry = __ib_get_agent_port(device, port_num, mad_agent);
+	spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
+
+	return entry;
+}
+
+int smi_check_local_dr_smp(struct ib_smp *smp,
+			   struct ib_device *device,
+			   int port_num)
+{
+	struct ib_agent_port_private *port_priv;
+
+	if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		return 1;
+	port_priv = ib_get_agent_port(device, port_num, NULL);
+	if (!port_priv) {
+		printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d "
+		       "not open\n",
+		       device->name, port_num);
+		return 1;
+	}
+
+	return smi_check_local_smp(port_priv->dr_smp_agent, smp);
+}
+
+static int agent_mad_send(struct ib_mad_agent *mad_agent,
+			  struct ib_agent_port_private *port_priv,
+			  struct ib_mad_private *mad,
+			  struct ib_grh *grh,
+			  struct ib_wc *wc)
+{
+	struct ib_agent_send_wr *agent_send_wr;
+	struct ib_sge gather_list;
+	struct ib_send_wr send_wr;
+	struct ib_send_wr *bad_send_wr;
+	struct ib_ah_attr ah_attr;
+	unsigned long flags;
+	int ret = 1;
+
+	agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL);
+	if (!agent_send_wr)
+		goto out;
+	agent_send_wr->mad = mad;
+
+	/* PCI mapping */
+	gather_list.addr = dma_map_single(mad_agent->device->dma_device,
+					  &mad->mad,
+					  sizeof(mad->mad),
+					  DMA_TO_DEVICE);
+	gather_list.length = sizeof(mad->mad);
+	gather_list.lkey = (*port_priv->mr).lkey;
+
+	send_wr.next = NULL;
+	send_wr.opcode = IB_WR_SEND;
+	send_wr.sg_list = &gather_list;
+	send_wr.num_sge = 1;
+	send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */
+	send_wr.wr.ud.timeout_ms = 0;
+	send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+
+	ah_attr.dlid = wc->slid;
+	ah_attr.port_num = mad_agent->port_num;
+	ah_attr.src_path_bits = wc->dlid_path_bits;
+	ah_attr.sl = wc->sl;
+	ah_attr.static_rate = 0;
+	if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
+		if (wc->wc_flags & IB_WC_GRH) {
+			ah_attr.ah_flags = IB_AH_GRH;
+			/* Should sgid be looked up ? */
+			ah_attr.grh.sgid_index = 0;
+			ah_attr.grh.hop_limit = grh->hop_limit;
+			ah_attr.grh.flow_label = be32_to_cpup(
+				&grh->version_tclass_flow)  & 0xffff;
+			ah_attr.grh.traffic_class = (be32_to_cpup(
+				&grh->version_tclass_flow) >> 20) & 0xff;
+			memcpy(ah_attr.grh.dgid.raw,
+			       grh->sgid.raw,
+			       sizeof(struct ib_grh));
+		} else {
+			ah_attr.ah_flags = 0; /* No GRH for SM class */
+		}
+	} else {
+		/* Directed route or LID routed SM class */
+		ah_attr.ah_flags = 0; /* No GRH */
+	}
+
+	agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr);
+	if (IS_ERR(agent_send_wr->ah)) {
+		printk(KERN_ERR SPFX "No memory for address handle\n");
+		kfree(agent_send_wr);
+		goto out;
+	}
+
+	send_wr.wr.ud.ah = agent_send_wr->ah;
+	if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) {
+		send_wr.wr.ud.pkey_index = wc->pkey_index;
+		send_wr.wr.ud.remote_qkey = IB_QP1_QKEY;
+	} else {
+		send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */
+		send_wr.wr.ud.remote_qkey = 0; /* for SMPs */
+	}
+	send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr;
+	send_wr.wr_id = (unsigned long)agent_send_wr;
+
+	pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr);
+
+	/* Send */
+	spin_lock_irqsave(&port_priv->send_list_lock, flags);
+	if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) {
+		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+		dma_unmap_single(mad_agent->device->dma_device,
+				 pci_unmap_addr(agent_send_wr, mapping),
+				 sizeof(mad->mad),
+				 DMA_TO_DEVICE);
+		ib_destroy_ah(agent_send_wr->ah);
+		kfree(agent_send_wr);
+	} else {
+		list_add_tail(&agent_send_wr->send_list,
+			      &port_priv->send_posted_list);
+		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+		ret = 0;
+	}
+
+out:
+	return ret;
+}
+
+int agent_send(struct ib_mad_private *mad,
+	       struct ib_grh *grh,
+	       struct ib_wc *wc,
+	       struct ib_device *device,
+	       int port_num)
+{
+	struct ib_agent_port_private *port_priv;
+	struct ib_mad_agent *mad_agent;
+
+	port_priv = ib_get_agent_port(device, port_num, NULL);
+	if (!port_priv) {
+		printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n",
+		       device->name, port_num);
+		return 1;
+	}
+
+	/* Get mad agent based on mgmt_class in MAD */
+	switch (mad->mad.mad.mad_hdr.mgmt_class) {
+		case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
+			mad_agent = port_priv->dr_smp_agent;
+			break;
+		case IB_MGMT_CLASS_SUBN_LID_ROUTED:
+			mad_agent = port_priv->lr_smp_agent;
+			break;
+		case IB_MGMT_CLASS_PERF_MGMT:
+			mad_agent = port_priv->perf_mgmt_agent;
+			break;
+		default:
+			return 1;
+	}
+
+	return agent_mad_send(mad_agent, port_priv, mad, grh, wc);
+}
+
+static void agent_send_handler(struct ib_mad_agent *mad_agent,
+			       struct ib_mad_send_wc *mad_send_wc)
+{
+	struct ib_agent_port_private	*port_priv;
+	struct ib_agent_send_wr		*agent_send_wr;
+	unsigned long			flags;
+
+	/* Find matching MAD agent */
+	port_priv = ib_get_agent_port(NULL, 0, mad_agent);
+	if (!port_priv) {
+		printk(KERN_ERR SPFX "agent_send_handler: no matching MAD "
+		       "agent %p\n", mad_agent);
+		return;
+	}
+
+	agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id;
+	spin_lock_irqsave(&port_priv->send_list_lock, flags);
+	/* Remove completed send from posted send MAD list */
+	list_del(&agent_send_wr->send_list);
+	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
+
+	/* Unmap PCI */
+	dma_unmap_single(mad_agent->device->dma_device,
+			 pci_unmap_addr(agent_send_wr, mapping),
+			 sizeof(agent_send_wr->mad->mad),
+			 DMA_TO_DEVICE);
+
+	ib_destroy_ah(agent_send_wr->ah);
+
+	/* Release allocated memory */
+	kmem_cache_free(ib_mad_cache, agent_send_wr->mad);
+	kfree(agent_send_wr);
+}
+
+int ib_agent_port_open(struct ib_device *device, int port_num)
+{
+	int ret;
+	struct ib_agent_port_private *port_priv;
+	struct ib_mad_reg_req reg_req;
+	unsigned long flags;
+
+	/* First, check if port already open for SMI */
+	port_priv = ib_get_agent_port(device, port_num, NULL);
+	if (port_priv) {
+		printk(KERN_DEBUG SPFX "%s port %d already open\n",
+		       device->name, port_num);
+		return 0;
+	}
+
+	/* Create new device info */
+	port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL);
+	if (!port_priv) {
+		printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n");
+		ret = -ENOMEM;
+		goto error1;
+	}
+
+	memset(port_priv, 0, sizeof *port_priv);
+	port_priv->port_num = port_num;
+	spin_lock_init(&port_priv->send_list_lock);
+	INIT_LIST_HEAD(&port_priv->send_posted_list);
+
+	/* Obtain MAD agent for directed route SM class */
+	reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE;
+	reg_req.mgmt_class_version = 1;
+
+	port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num,
+							IB_QPT_SMI,
+							NULL, 0,
+						       &agent_send_handler,
+							NULL, NULL);
+
+	if (IS_ERR(port_priv->dr_smp_agent)) {
+		ret = PTR_ERR(port_priv->dr_smp_agent);
+		goto error2;
+	}
+
+	/* Obtain MAD agent for LID routed SM class */
+	reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num,
+							IB_QPT_SMI,
+							NULL, 0,
+						       &agent_send_handler,
+							NULL, NULL);
+	if (IS_ERR(port_priv->lr_smp_agent)) {
+		ret = PTR_ERR(port_priv->lr_smp_agent);
+		goto error3;
+	}
+
+	/* Obtain MAD agent for PerfMgmt class */
+	reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT;
+	port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num,
+							   IB_QPT_GSI,
+							   NULL, 0,
+							  &agent_send_handler,
+							   NULL, NULL);
+	if (IS_ERR(port_priv->perf_mgmt_agent)) {
+		ret = PTR_ERR(port_priv->perf_mgmt_agent);
+		goto error4;
+	}
+
+	port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd,
+				      IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(port_priv->mr)) {
+		printk(KERN_ERR SPFX "Couldn't get DMA MR\n");
+		ret = PTR_ERR(port_priv->mr);
+		goto error5;
+	} 
+
+	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
+	list_add_tail(&port_priv->port_list, &ib_agent_port_list);
+	spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
+
+	return 0;
+
+error5:
+	ib_unregister_mad_agent(port_priv->perf_mgmt_agent);
+error4:
+	ib_unregister_mad_agent(port_priv->lr_smp_agent);
+error3:
+	ib_unregister_mad_agent(port_priv->dr_smp_agent);
+error2:
+	kfree(port_priv);
+error1:
+	return ret;
+}
+
+int ib_agent_port_close(struct ib_device *device, int port_num)
+{
+	struct ib_agent_port_private *port_priv;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
+	port_priv = __ib_get_agent_port(device, port_num, NULL);
+	if (port_priv == NULL) {
+		spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
+		printk(KERN_ERR SPFX "Port %d not found\n", port_num);
+		return -ENODEV;
+	}
+	list_del(&port_priv->port_list);
+	spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
+
+	ib_dereg_mr(port_priv->mr);
+
+	ib_unregister_mad_agent(port_priv->perf_mgmt_agent);
+	ib_unregister_mad_agent(port_priv->lr_smp_agent);
+	ib_unregister_mad_agent(port_priv->dr_smp_agent);
+	kfree(port_priv);
+
+	return 0;
+}
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/agent.h	2004-11-23 08:10:18.154885433 -0800
@@ -0,0 +1,42 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#ifndef __AGENT_H_
+#define __AGENT_H_
+
+extern spinlock_t ib_agent_port_list_lock;
+
+extern int ib_agent_port_open(struct ib_device *device,
+			      int port_num);
+
+extern int ib_agent_port_close(struct ib_device *device, int port_num);
+
+extern int agent_send(struct ib_mad_private *mad,
+		      struct ib_grh *grh,
+		      struct ib_wc *wc,
+		      struct ib_device *device,
+		      int port_num);
+
+#endif	/* __AGENT_H_ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/agent_priv.h	2004-11-23 08:10:18.178881895 -0800
@@ -0,0 +1,51 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#ifndef __IB_AGENT_PRIV_H__
+#define __IB_AGENT_PRIV_H__
+
+#include <linux/pci.h>
+
+#define SPFX "ib_agent: "
+
+struct ib_agent_send_wr {
+	struct list_head send_list;
+	struct ib_ah *ah;
+	struct ib_mad_private *mad;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+struct ib_agent_port_private {
+	struct list_head port_list;
+	struct list_head send_posted_list;
+	spinlock_t send_list_lock;
+	int port_num;
+	struct ib_mad_agent *dr_smp_agent;    /* DR SM class */
+	struct ib_mad_agent *lr_smp_agent;    /* LR SM class */
+	struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */
+	struct ib_mr *mr;
+};
+
+#endif	/* __IB_AGENT_PRIV_H__ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/mad.c	2004-11-23 08:10:18.021905041 -0800
@@ -0,0 +1,2109 @@
+/*
+ * Copyright (c) 2004, Voltaire, Inc. All rights reserved.
+ * Maintained by: vtrmaint1 at voltaire.com
+ *
+ * This program is intended for the purpose of Infiniband
+ * protocol stack for Linux Servers. 
+ *
+ * This software program is free software and you are free to modifyi
+ * and/or redistribute it under a choice of one of the following two
+ * licenses:
+ *
+ * 1) under either the GNU General Public License (GPL) Version 2, June 1991,
+ *    a copy of which is in the file LICENSE_GPL_V2.txt in the root directory.
+ *    This GPL license is also available from the Free Software Foundation,
+ *    Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the
+ *    web at http://www.fsf.org/copyleft/gpl.html
+ *
+ * OR
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is in the file
+ *    LICENSE2.txt in the root directory. The license is also available from
+ *    the Open Source Initiative, on the web at
+ *    http://www.opensource.org/licenses/bsd-license.php.
+ *
+ *
+ *    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ *
+ *
+ * To obtain a copy of these licenses, the source code to this software or
+ * for other questions, you may write to Voltaire, Inc.,
+ * Attention: Voltaire openSource maintainer, 
+ * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or
+ * by Email: vtrmaint1 at voltaire.com
+ *
+ * Licensee has the right to choose either one of the above two licenses.
+ *
+ * Redistributions of source code must retain both the above copyright
+ * notice and either one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, either one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+
+#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
+
+#include <ib_mad.h>
+
+#include "mad_priv.h"
+#include "smi.h"
+#include "agent.h"
+
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("kernel IB MAD API");
+MODULE_AUTHOR("Hal Rosenstock");
+MODULE_AUTHOR("Sean Hefty");
+
+
+kmem_cache_t *ib_mad_cache;
+static struct list_head ib_mad_port_list;
+static u32 ib_mad_client_id = 0;
+
+/* Port list lock */
+static spinlock_t ib_mad_port_list_lock;
+
+
+/* Forward declarations */
+static int method_in_use(struct ib_mad_mgmt_method_table **method,
+			 struct ib_mad_reg_req *mad_reg_req);
+static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
+			   struct ib_mad_agent_private *priv);
+static void remove_mad_reg_req(struct ib_mad_agent_private *priv); 
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info,
+				    struct ib_mad_private *mad);
+static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
+static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
+				    struct ib_mad_send_wc *mad_send_wc);
+static void timeout_sends(void *data);
+static int solicited_mad(struct ib_mad *mad);
+
+/*
+ * Returns a ib_mad_port_private structure or NULL for a device/port
+ * Assumes ib_mad_port_list_lock is being held
+ */
+static inline struct ib_mad_port_private *
+__ib_get_mad_port(struct ib_device *device, int port_num)
+{
+	struct ib_mad_port_private *entry;
+
+	list_for_each_entry(entry, &ib_mad_port_list, port_list) {
+		if (entry->device == device && entry->port_num == port_num)
+			return entry;
+	}
+	return NULL;
+}
+
+/*
+ * Wrapper function to return a ib_mad_port_private structure or NULL
+ * for a device/port
+ */
+static inline struct ib_mad_port_private *
+ib_get_mad_port(struct ib_device *device, int port_num)
+{
+	struct ib_mad_port_private *entry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
+	entry = __ib_get_mad_port(device, port_num);
+	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
+
+	return entry;
+}
+
+static inline u8 convert_mgmt_class(u8 mgmt_class)
+{
+	/* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */
+	return mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ?
+		0 : mgmt_class;
+}
+
+static int get_spl_qp_index(enum ib_qp_type qp_type)
+{
+	switch (qp_type)
+	{
+	case IB_QPT_SMI:
+		return 0;
+	case IB_QPT_GSI:
+		return 1;
+	default:
+		return -1;
+	}
+}
+
+/*
+ * ib_register_mad_agent - Register to send/receive MADs
+ */
+struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
+					   u8 port_num,
+					   enum ib_qp_type qp_type,
+					   struct ib_mad_reg_req *mad_reg_req,
+					   u8 rmpp_version,
+					   ib_mad_send_handler send_handler,
+					   ib_mad_recv_handler recv_handler,
+					   void *context)
+{
+	struct ib_mad_port_private *port_priv;
+	struct ib_mad_agent *ret;
+	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_reg_req *reg_req = NULL;
+	struct ib_mad_mgmt_class_table *class;
+	struct ib_mad_mgmt_method_table *method;
+	int ret2, qpn;
+	unsigned long flags;
+	u8 mgmt_class;
+
+	/* Validate parameters */
+	qpn = get_spl_qp_index(qp_type);
+	if (qpn == -1) {
+		ret = ERR_PTR(-EINVAL);
+		goto error1;
+	}
+
+	if (rmpp_version) {
+		ret = ERR_PTR(-EINVAL);	/* XXX: until RMPP implemented */
+		goto error1;
+	}
+
+	/* Validate MAD registration request if supplied */
+	if (mad_reg_req) {
+		if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) {
+			ret = ERR_PTR(-EINVAL);
+			goto error1;
+		}
+		if (!recv_handler) {
+			ret = ERR_PTR(-EINVAL);
+			goto error1;
+		}
+		if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) {
+			/*
+			 * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only
+			 * one in this range currently allowed
+			 */
+			if (mad_reg_req->mgmt_class !=
+			    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+				ret = ERR_PTR(-EINVAL);
+				goto error1;
+			}
+		} else if (mad_reg_req->mgmt_class == 0) {
+			/* 
+			 * Class 0 is reserved in IBA and is used for 
+			 * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE
+			 */
+			ret = ERR_PTR(-EINVAL);
+			goto error1;
+		}
+	} else {
+		/* No registration request supplied */
+		if (!send_handler) {
+			ret = ERR_PTR(-EINVAL);
+			goto error1;
+		}
+	}
+
+	/* Validate device and port */
+	port_priv = ib_get_mad_port(device, port_num);
+	if (!port_priv) {
+		ret = ERR_PTR(-ENODEV);
+		goto error1;
+	}
+
+	/* Allocate structures */
+	mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL);
+	if (!mad_agent_priv) { 
+		ret = ERR_PTR(-ENOMEM);
+		goto error1;
+	}
+
+	if (mad_reg_req) {
+		reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL);
+		if (!reg_req) {
+			ret = ERR_PTR(-ENOMEM);
+			goto error2;
+		}
+		/* Make a copy of the MAD registration request */
+		memcpy(reg_req, mad_reg_req, sizeof *reg_req);
+	}
+ 
+	/* Now, fill in the various structures */
+	memset(mad_agent_priv, 0, sizeof *mad_agent_priv);
+	mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
+	mad_agent_priv->reg_req = reg_req;
+	mad_agent_priv->rmpp_version = rmpp_version;
+	mad_agent_priv->agent.device = device;
+	mad_agent_priv->agent.recv_handler = recv_handler;
+	mad_agent_priv->agent.send_handler = send_handler;
+	mad_agent_priv->agent.context = context;
+	mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp;
+	mad_agent_priv->agent.port_num = port_num;
+
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
+	mad_agent_priv->agent.hi_tid = ++ib_mad_client_id;
+
+	/*
+	 * Make sure MAD registration (if supplied)
+	 * is non overlapping with any existing ones
+	 */
+	if (mad_reg_req) {
+		class = port_priv->version[mad_reg_req->mgmt_class_version];
+		if (class) {
+			mgmt_class = convert_mgmt_class(
+						mad_reg_req->mgmt_class);
+			method = class->method_table[mgmt_class];
+			if (method) {
+				if (method_in_use(&method, mad_reg_req)) {
+					ret = ERR_PTR(-EINVAL);
+					goto error3;
+				}
+			}
+		}
+	}
+
+	ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv);
+	if (ret2) {
+		ret = ERR_PTR(ret2);	
+		goto error3;	
+	}
+
+	/* Add mad agent into port's agent list */
+	list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list);
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+
+	spin_lock_init(&mad_agent_priv->lock);
+	INIT_LIST_HEAD(&mad_agent_priv->send_list);
+	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
+	INIT_WORK(&mad_agent_priv->work, timeout_sends, mad_agent_priv);
+	atomic_set(&mad_agent_priv->refcount, 1);
+	init_waitqueue_head(&mad_agent_priv->wait);
+
+	return &mad_agent_priv->agent;
+
+error3:
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+	kfree(reg_req);
+error2:
+	kfree(mad_agent_priv);
+error1:
+	return ret;	
+}
+EXPORT_SYMBOL(ib_register_mad_agent);
+
+/*
+ * ib_unregister_mad_agent - Unregisters a client from using MAD services
+ */
+int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent)
+{
+	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_port_private *port_priv;
+	unsigned long flags;
+
+	mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private,
+				      agent);
+
+	/* Note that we could still be handling received MADs */
+
+	/*
+	 * Canceling all sends results in dropping received response
+	 * MADs, preventing us from queuing additional work
+	 */
+	cancel_mads(mad_agent_priv);
+
+	port_priv = mad_agent_priv->qp_info->port_priv;
+	cancel_delayed_work(&mad_agent_priv->work);
+	flush_workqueue(port_priv->wq);
+
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
+	remove_mad_reg_req(mad_agent_priv);
+	list_del(&mad_agent_priv->agent_list);
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+
+	/* XXX: Cleanup pending RMPP receives for this agent */
+
+	atomic_dec(&mad_agent_priv->refcount);
+	wait_event(mad_agent_priv->wait,
+		   !atomic_read(&mad_agent_priv->refcount));
+
+	if (mad_agent_priv->reg_req)
+		kfree(mad_agent_priv->reg_req);
+	kfree(mad_agent_priv);
+	return 0;
+}
+EXPORT_SYMBOL(ib_unregister_mad_agent);
+
+static void dequeue_mad(struct ib_mad_list_head *mad_list)
+{
+	struct ib_mad_queue *mad_queue;
+	unsigned long flags;
+
+	BUG_ON(!mad_list->mad_queue);
+	mad_queue = mad_list->mad_queue;
+	spin_lock_irqsave(&mad_queue->lock, flags);
+	list_del(&mad_list->list);
+	mad_queue->count--;
+	spin_unlock_irqrestore(&mad_queue->lock, flags);
+}
+
+/*
+ * Return 0 if SMP is to be sent
+ * Return 1 if SMP was consumed locally (whether or not solicited)
+ * Return < 0 if error 
+ */
+static int handle_outgoing_smp(struct ib_mad_agent *mad_agent,
+			       struct ib_smp *smp,
+			       struct ib_send_wr *send_wr)
+{
+	int ret;
+
+	if (!smi_handle_dr_smp_send(smp,
+				    mad_agent->device->node_type,
+				    mad_agent->port_num)) {
+		ret = -EINVAL;
+		printk(KERN_ERR PFX "Invalid directed route\n");
+		goto error1;
+	}
+	if (smi_check_local_dr_smp(smp,
+				   mad_agent->device,
+				   mad_agent->port_num)) {
+		struct ib_mad_private *mad_priv;
+		struct ib_mad_agent_private *mad_agent_priv;
+		struct ib_mad_send_wc mad_send_wc;
+
+		mad_priv = kmem_cache_alloc(ib_mad_cache,
+					    (in_atomic() || irqs_disabled()) ?
+					    GFP_ATOMIC : GFP_KERNEL);
+		if (!mad_priv) {
+			ret = -ENOMEM;
+			printk(KERN_ERR PFX "No memory for local "
+			       "response MAD\n");
+			goto error1;
+		}
+
+		mad_agent_priv = container_of(mad_agent,
+					      struct ib_mad_agent_private,
+					      agent);
+
+		if (mad_agent->device->process_mad) {
+			ret = mad_agent->device->process_mad(
+					    mad_agent->device,
+					    0,
+					    mad_agent->port_num,
+					    smp->dr_slid, /* ? */
+					    (struct ib_mad *)smp,
+					    (struct ib_mad *)&mad_priv->mad);
+			if (ret & IB_MAD_RESULT_SUCCESS) {
+				if (ret & IB_MAD_RESULT_CONSUMED) {
+					ret = 1;
+					goto error1;
+				}
+				if (ret & IB_MAD_RESULT_REPLY) {
+					/*
+					 * See if response is solicited and
+					 * there is a recv handler
+					 */
+					if (solicited_mad(&mad_priv->mad.mad) && 
+					    mad_agent_priv->agent.recv_handler) {
+						struct ib_wc wc;
+
+						/*
+						 * Defined behavior is to
+						 * complete response before
+						 * request
+						 */
+						wc.wr_id = send_wr->wr_id;
+						wc.status = IB_WC_SUCCESS;
+						wc.opcode = IB_WC_RECV;
+						wc.vendor_err = 0;
+						wc.byte_len = sizeof(struct ib_mad);
+						wc.src_qp = 0;  /* IB_QPT_SMI ? */
+						wc.wc_flags = 0;
+						wc.pkey_index = 0;
+						wc.slid = IB_LID_PERMISSIVE;
+						wc.sl = 0;
+						wc.dlid_path_bits = 0;
+						mad_priv->header.recv_wc.wc = &wc;
+						mad_priv->header.recv_wc.mad_len =
+							sizeof(struct ib_mad);
+						INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
+						mad_priv->header.recv_buf.grh = NULL;
+						mad_priv->header.recv_buf.mad =
+							&mad_priv->mad.mad;
+						mad_priv->header.recv_wc.recv_buf =
+							&mad_priv->header.recv_buf;
+						mad_agent_priv->agent.recv_handler(
+							mad_agent,
+							&mad_priv->header.recv_wc);
+					} else
+						kmem_cache_free(ib_mad_cache, mad_priv);
+				} else
+					kmem_cache_free(ib_mad_cache, mad_priv);
+			} else
+				kmem_cache_free(ib_mad_cache, mad_priv);
+		}
+
+		if (mad_agent_priv->agent.send_handler) {
+			/* Now, complete send */
+			mad_send_wc.status = IB_WC_SUCCESS;
+			mad_send_wc.vendor_err = 0;
+			mad_send_wc.wr_id = send_wr->wr_id;
+			mad_agent_priv->agent.send_handler(
+						mad_agent,
+						&mad_send_wc);
+			ret = 1;
+		} else
+			ret = -EINVAL;
+	} else 
+		ret = 0;
+
+error1:
+	return ret;
+}
+
+static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv,
+		       struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_mad_qp_info *qp_info;
+	struct ib_send_wr *bad_send_wr;
+	unsigned long flags;
+	int ret;
+
+	/* Replace user's WR ID with our own to find WR upon completion */
+	qp_info = mad_agent_priv->qp_info;
+	mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id;
+	mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list;
+	mad_send_wr->mad_list.mad_queue = &qp_info->send_queue;
+
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) {
+		list_add_tail(&mad_send_wr->mad_list.list,
+			      &qp_info->send_queue.list);
+		spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+		ret = ib_post_send(mad_agent_priv->agent.qp, 
+				   &mad_send_wr->send_wr, &bad_send_wr);
+		if (ret) {
+			printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret);
+			dequeue_mad(&mad_send_wr->mad_list);
+		}
+	} else {
+		list_add_tail(&mad_send_wr->mad_list.list,
+			      &qp_info->overflow_list);
+		spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+		ret = 0;
+	}
+	return ret;
+}
+
+/*
+ * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated
+ *  with the registered client
+ */
+int ib_post_send_mad(struct ib_mad_agent *mad_agent,
+		     struct ib_send_wr *send_wr,
+		     struct ib_send_wr **bad_send_wr)
+{
+	int ret = -EINVAL;
+	struct ib_mad_agent_private *mad_agent_priv;
+
+	/* Validate supplied parameters */
+	if (!bad_send_wr)
+		goto error1;
+
+	if (!mad_agent || !send_wr)
+		goto error2;
+
+	if (!mad_agent->send_handler)
+		goto error2;
+
+	mad_agent_priv = container_of(mad_agent,
+				      struct ib_mad_agent_private,
+				      agent);
+
+	/* Walk list of send WRs and post each on send list */
+	while (send_wr) {
+		unsigned long			flags;
+		struct ib_send_wr		*next_send_wr;
+		struct ib_mad_send_wr_private	*mad_send_wr;
+		struct ib_smp			*smp;
+
+		/* Validate more parameters */
+		if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG)
+			goto error2;
+
+		if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)
+			goto error2;
+
+		if (!send_wr->wr.ud.mad_hdr) {
+			printk(KERN_ERR PFX "MAD header must be supplied "
+			       "in WR %p\n", send_wr);
+			goto error2;
+		}
+
+		/*
+		 * Save pointer to next work request to post in case the
+		 * current one completes, and the user modifies the work
+		 * request associated with the completion
+		 */
+		next_send_wr = (struct ib_send_wr *)send_wr->next;
+
+		smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr;
+		if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+			ret = handle_outgoing_smp(mad_agent, smp, send_wr);
+			if (ret < 0)		/* error */
+				goto error2;
+			else if (ret == 1)	/* locally consumed */
+				goto next;
+		}
+
+		/* Allocate MAD send WR tracking structure */
+		mad_send_wr = kmalloc(sizeof *mad_send_wr, 
+				      (in_atomic() || irqs_disabled()) ?
+				      GFP_ATOMIC : GFP_KERNEL);
+		if (!mad_send_wr) {
+			printk(KERN_ERR PFX "No memory for "
+			       "ib_mad_send_wr_private\n");
+			ret = -ENOMEM;
+			goto error2;
+		}
+
+		mad_send_wr->send_wr = *send_wr;
+		mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list;
+		memcpy(mad_send_wr->sg_list, send_wr->sg_list,
+		       sizeof *send_wr->sg_list * send_wr->num_sge);
+		mad_send_wr->send_wr.next = NULL;
+		mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid;
+		mad_send_wr->agent = mad_agent;
+		/* Timeout will be updated after send completes */
+		mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr.
+							ud.timeout_ms);
+		mad_send_wr->retry = 0;
+		/* One reference for each work request to QP + response */
+		mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
+		mad_send_wr->status = IB_WC_SUCCESS;
+
+		/* Reference MAD agent until send completes */
+		atomic_inc(&mad_agent_priv->refcount);
+		spin_lock_irqsave(&mad_agent_priv->lock, flags);
+		list_add_tail(&mad_send_wr->agent_list,
+			      &mad_agent_priv->send_list);
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+
+		ret = ib_send_mad(mad_agent_priv, mad_send_wr);
+		if (ret) {
+			/* Fail send request */
+			spin_lock_irqsave(&mad_agent_priv->lock, flags);
+			list_del(&mad_send_wr->agent_list);
+			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+			atomic_dec(&mad_agent_priv->refcount);
+			goto error2;
+		}
+next:
+		send_wr = next_send_wr;
+	}
+	return 0;	
+
+error2:
+	*bad_send_wr = send_wr;
+error1:
+	return ret;
+}
+EXPORT_SYMBOL(ib_post_send_mad);
+
+/*
+ * ib_free_recv_mad - Returns data buffers used to receive
+ *  a MAD to the access layer
+ */
+void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_mad_recv_buf *entry;
+	struct ib_mad_private_header *mad_priv_hdr;
+	struct ib_mad_private *priv;
+
+	mad_priv_hdr = container_of(mad_recv_wc,
+				    struct ib_mad_private_header,
+				    recv_wc);
+	priv = container_of(mad_priv_hdr, struct ib_mad_private, header);
+
+	/*
+	 * Walk receive buffer list associated with this WC
+	 * No need to remove them from list of receive buffers
+	 */
+	list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) {
+		/* Free previous receive buffer */
+		kmem_cache_free(ib_mad_cache, priv);
+		mad_priv_hdr = container_of(entry, struct ib_mad_private_header,
+					    recv_buf);
+		priv = container_of(mad_priv_hdr, struct ib_mad_private,
+				    header);
+	}
+
+	/* Free last buffer */
+	kmem_cache_free(ib_mad_cache, priv);
+}
+EXPORT_SYMBOL(ib_free_recv_mad);
+
+void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc,
+			  void *buf)
+{
+	printk(KERN_ERR PFX "ib_coalesce_recv_mad() not implemented yet\n");
+}
+EXPORT_SYMBOL(ib_coalesce_recv_mad);
+
+struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp,
+					u8 rmpp_version,
+					ib_mad_send_handler send_handler,
+					ib_mad_recv_handler recv_handler,
+					void *context)
+{
+	return ERR_PTR(-EINVAL);	/* XXX: for now */
+}
+EXPORT_SYMBOL(ib_redirect_mad_qp);
+
+int ib_process_mad_wc(struct ib_mad_agent *mad_agent,
+		      struct ib_wc *wc)
+{
+	printk(KERN_ERR PFX "ib_process_mad_wc() not implemented yet\n");
+	return 0;
+}
+EXPORT_SYMBOL(ib_process_mad_wc);
+
+static int method_in_use(struct ib_mad_mgmt_method_table **method,
+			 struct ib_mad_reg_req *mad_reg_req)
+{
+	int i;
+
+	for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS);
+	     i < IB_MGMT_MAX_METHODS;
+	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
+			       1+i)) {
+		if ((*method)->agent[i]) {
+			printk(KERN_ERR PFX "Method %d already in use\n", i);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int allocate_method_table(struct ib_mad_mgmt_method_table **method)
+{
+	/* Allocate management method table */
+	*method = kmalloc(sizeof **method, GFP_ATOMIC);
+	if (!*method) {
+		printk(KERN_ERR PFX "No memory for "
+		       "ib_mad_mgmt_method_table\n");
+		return -ENOMEM;
+	}
+	/* Clear management method table */
+	memset(*method, 0, sizeof **method);
+
+	return 0;
+}
+
+/*
+ * Check to see if there are any methods still in use
+ */
+static int check_method_table(struct ib_mad_mgmt_method_table *method)
+{
+	int i;
+
+	for (i = 0; i < IB_MGMT_MAX_METHODS; i++)
+		if (method->agent[i])
+			return 1;
+	return 0;
+}
+
+/*
+ * Check to see if there are any method tables for this class still in use
+ */
+static int check_class_table(struct ib_mad_mgmt_class_table *class)
+{
+	int i;
+
+	for (i = 0; i < MAX_MGMT_CLASS; i++)
+		if (class->method_table[i])
+			return 1;
+	return 0;
+}
+
+static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method,
+				     struct ib_mad_agent_private *agent)
+{
+	int i;
+
+	/* Remove any methods for this mad agent */
+	for (i = 0; i < IB_MGMT_MAX_METHODS; i++) {
+		if (method->agent[i] == agent) {
+			method->agent[i] = NULL;
+		}
+	}
+}
+
+static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req,
+			   struct ib_mad_agent_private *priv)
+{
+	struct ib_mad_port_private *private;
+	struct ib_mad_mgmt_class_table **class;
+	struct ib_mad_mgmt_method_table **method;
+
+	int i, ret;
+	u8 mgmt_class;
+
+	/* Make sure MAD registration request supplied */
+	if (!mad_reg_req)
+		return 0;
+
+	private = priv->qp_info->port_priv;
+	mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class);
+	class = &private->version[mad_reg_req->mgmt_class_version];
+	if (!*class) {
+		/* Allocate management class table for "new" class version */
+		*class = kmalloc(sizeof **class, GFP_ATOMIC);
+		if (!*class) {
+			printk(KERN_ERR PFX "No memory for "
+			       "ib_mad_mgmt_class_table\n");
+			ret = -ENOMEM;
+			goto error1;
+		}
+		/* Clear management class table for this class version */
+		memset((*class)->method_table, 0,
+		       sizeof((*class)->method_table));
+		/* Allocate method table for this management class */
+		method = &(*class)->method_table[mgmt_class];
+		if ((ret = allocate_method_table(method)))
+			goto error2;
+	} else {
+		method = &(*class)->method_table[mgmt_class];
+		if (!*method) {
+			/* Allocate method table for this management class */
+			if ((ret = allocate_method_table(method)))
+				goto error1;
+		}
+	}
+
+	/* Now, make sure methods are not already in use */
+	if (method_in_use(method, mad_reg_req))
+		goto error3;
+
+	/* Finally, add in methods being registered */
+	for (i = find_first_bit(mad_reg_req->method_mask,
+				IB_MGMT_MAX_METHODS); 
+	     i < IB_MGMT_MAX_METHODS;
+	     i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS,
+			       1+i)) {
+		(*method)->agent[i] = priv;
+	}
+	return 0;
+
+error3:
+	/* Remove any methods for this mad agent */
+	remove_methods_mad_agent(*method, priv);
+	/* Now, check to see if there are any methods in use */
+	if (!check_method_table(*method)) {
+		/* If not, release management method table */
+		kfree(*method);
+		*method = NULL;
+	}
+	ret = -EINVAL;
+	goto error1;
+error2:
+	kfree(*class);
+	*class = NULL;
+error1:
+	return ret;
+}
+
+static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv)
+{
+	struct ib_mad_port_private *port_priv;
+	struct ib_mad_mgmt_class_table *class;
+	struct ib_mad_mgmt_method_table *method;
+	u8 mgmt_class;
+
+	/*
+	 * Was MAD registration request supplied
+	 * with original registration ?
+	 */
+	if (!agent_priv->reg_req) {
+		goto out;
+	}
+
+	port_priv = agent_priv->qp_info->port_priv;
+	class = port_priv->version[agent_priv->reg_req->mgmt_class_version];
+	if (!class) {
+		printk(KERN_ERR PFX "No class table yet MAD registration "
+		       "request supplied\n");
+		goto out;
+	}
+
+	mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class);
+	method = class->method_table[mgmt_class];
+	if (method) {
+		/* Remove any methods for this mad agent */
+		remove_methods_mad_agent(method, agent_priv);
+		/* Now, check to see if there are any methods still in use */
+		if (!check_method_table(method)) {
+			/* If not, release management method table */
+			 kfree(method);
+			 class->method_table[mgmt_class] = NULL;
+			 /* Any management classes left ? */
+			if (!check_class_table(class)) {
+				/* If not, release management class table */
+				kfree(class);
+				port_priv->version[agent_priv->reg_req->
+						   mgmt_class_version]= NULL;
+			}
+		}
+	}
+
+out:
+	return;
+}
+
+static int response_mad(struct ib_mad *mad)
+{
+	/* Trap represses are responses although response bit is reset */
+	return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || 
+		(mad->mad_hdr.method & IB_MGMT_METHOD_RESP));
+}
+
+static int solicited_mad(struct ib_mad *mad)
+{
+	/* CM MADs are never solicited */
+	if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) {
+		return 0;
+	}
+
+	/* XXX: Determine whether MAD is using RMPP */
+
+	/* Not using RMPP */
+	/* Is this MAD a response to a previous MAD ? */
+	return response_mad(mad);
+}
+
+static struct ib_mad_agent_private *
+find_mad_agent(struct ib_mad_port_private *port_priv,
+	       struct ib_mad *mad,
+	       int solicited)
+{
+	struct ib_mad_agent_private *mad_agent = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port_priv->reg_lock, flags);
+
+	/*
+	 * Whether MAD was solicited determines type of routing to
+	 * MAD client.
+	 */
+	if (solicited) {
+		u32 hi_tid;
+		struct ib_mad_agent_private *entry;
+
+		/*
+		 * Routing is based on high 32 bits of transaction ID
+		 * of MAD.
+		 */
+		hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32;
+		list_for_each_entry(entry, &port_priv->agent_list,
+				    agent_list) {
+			if (entry->agent.hi_tid == hi_tid) {
+				mad_agent = entry;
+				break;
+			}
+		}
+	} else {
+		struct ib_mad_mgmt_class_table *version;
+		struct ib_mad_mgmt_method_table *class;
+
+		/* Routing is based on version, class, and method */
+		if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION)
+			goto out;
+		version = port_priv->version[mad->mad_hdr.class_version];
+		if (!version)
+			goto out;
+		class = version->method_table[convert_mgmt_class(
+						mad->mad_hdr.mgmt_class)];
+		if (class)
+			mad_agent = class->agent[mad->mad_hdr.method &
+						 ~IB_MGMT_METHOD_RESP];
+	}
+
+	if (mad_agent) {
+		if (mad_agent->agent.recv_handler)
+			atomic_inc(&mad_agent->refcount);
+		else {
+			printk(KERN_NOTICE PFX "No receive handler for client "
+			       "%p on port %d\n",
+			       &mad_agent->agent, port_priv->port_num);
+			mad_agent = NULL;
+		}
+	}
+out:
+	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
+
+	return mad_agent;
+}
+
+static int validate_mad(struct ib_mad *mad, u32 qp_num)
+{
+	int valid = 0;
+
+	/* Make sure MAD base version is understood */
+	if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) {
+		printk(KERN_ERR PFX "MAD received with unsupported base "
+		       "version %d\n", mad->mad_hdr.base_version);
+		goto out;
+	}
+
+	/* Filter SMI packets sent to other than QP0 */
+	if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) ||
+	    (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) {
+		if (qp_num == 0)
+			valid = 1;
+	} else {
+		/* Filter GSI packets sent to QP0 */
+		if (qp_num != 0)
+			valid = 1;	
+	}
+
+out:
+	return valid;
+}
+
+/*
+ * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet
+ */
+static struct ib_mad_private *
+reassemble_recv(struct ib_mad_agent_private *mad_agent_priv,
+		struct ib_mad_private *recv)
+{
+	/* Until we have RMPP, all receives are reassembled!... */
+	INIT_LIST_HEAD(&recv->header.recv_buf.list);
+	return recv;
+}
+
+static struct ib_mad_send_wr_private*
+find_send_req(struct ib_mad_agent_private *mad_agent_priv,
+	      u64 tid)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+
+	list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list,
+			    agent_list) {
+		if (mad_send_wr->tid == tid)
+			return mad_send_wr;
+	}
+
+	/*
+	 * It's possible to receive the response before we've
+	 * been notified that the send has completed
+	 */
+	list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list,
+			    agent_list) {
+		if (mad_send_wr->tid == tid && mad_send_wr->timeout) {
+			/* Verify request has not been canceled */
+			return (mad_send_wr->status == IB_WC_SUCCESS) ?
+				mad_send_wr : NULL;
+		}
+	}
+	return NULL;
+}
+
+static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
+				 struct ib_mad_private *recv,
+				 int solicited)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_send_wc mad_send_wc;
+	unsigned long flags;
+
+	/* Fully reassemble receive before processing */
+	recv = reassemble_recv(mad_agent_priv, recv);
+	if (!recv) {
+		if (atomic_dec_and_test(&mad_agent_priv->refcount))
+			wake_up(&mad_agent_priv->wait);
+		return;
+	}
+
+	/* Complete corresponding request */
+	if (solicited) {
+		spin_lock_irqsave(&mad_agent_priv->lock, flags);
+		mad_send_wr = find_send_req(mad_agent_priv,
+					    recv->mad.mad.mad_hdr.tid);
+		if (!mad_send_wr) {
+			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+			ib_free_recv_mad(&recv->header.recv_wc);
+			if (atomic_dec_and_test(&mad_agent_priv->refcount))
+				wake_up(&mad_agent_priv->wait);
+			return;
+		}
+		/* Timeout = 0 means that we won't wait for a response */
+		mad_send_wr->timeout = 0;
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+
+		/* Defined behavior is to complete response before request */
+		recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id;
+		mad_agent_priv->agent.recv_handler(
+						&mad_agent_priv->agent,
+						&recv->header.recv_wc);
+		atomic_dec(&mad_agent_priv->refcount);
+
+		mad_send_wc.status = IB_WC_SUCCESS;
+		mad_send_wc.vendor_err = 0;
+		mad_send_wc.wr_id = mad_send_wr->wr_id;
+		ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc);
+	} else {
+		mad_agent_priv->agent.recv_handler(
+						&mad_agent_priv->agent,
+						&recv->header.recv_wc);
+		if (atomic_dec_and_test(&mad_agent_priv->refcount))
+			wake_up(&mad_agent_priv->wait);
+	}
+}
+
+static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv,
+				     struct ib_wc *wc)
+{
+	struct ib_mad_qp_info *qp_info;
+	struct ib_mad_private_header *mad_priv_hdr;
+	struct ib_mad_private *recv, *response;
+	struct ib_mad_list_head *mad_list;
+	struct ib_mad_agent_private *mad_agent;
+	struct ib_smp *smp;
+	int solicited;
+
+	response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+	if (!response)
+		printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
+		       "for response buffer\n");
+
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	qp_info = mad_list->mad_queue->qp_info;
+	dequeue_mad(mad_list);
+
+	mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header,
+				    mad_list);
+	recv = container_of(mad_priv_hdr, struct ib_mad_private, header);
+	dma_unmap_single(port_priv->device->dma_device,
+			 pci_unmap_addr(&recv->header, mapping),
+			 sizeof(struct ib_mad_private) -
+			 sizeof(struct ib_mad_private_header),
+			 DMA_FROM_DEVICE);
+
+	/* Setup MAD receive work completion from "normal" work completion */
+	recv->header.recv_wc.wc = wc;
+	recv->header.recv_wc.mad_len = sizeof(struct ib_mad);
+	recv->header.recv_wc.recv_buf = &recv->header.recv_buf;
+	recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad;
+	recv->header.recv_buf.grh = &recv->grh;
+
+	/* Validate MAD */
+	if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num))
+		goto out;
+
+	if (recv->header.recv_buf.mad->mad_hdr.mgmt_class ==
+	    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		smp = (struct ib_smp *)recv->header.recv_buf.mad;
+		if (!smi_handle_dr_smp_recv(smp,
+					    port_priv->device->node_type,
+					    port_priv->port_num,
+					    port_priv->device->phys_port_cnt))
+			goto out;
+		if (!smi_check_forward_dr_smp(smp))
+			goto local;
+		if (!smi_handle_dr_smp_send(smp,
+					    port_priv->device->node_type,
+					    port_priv->port_num))
+			goto out;
+		if (!smi_check_local_dr_smp(smp,
+					    port_priv->device,
+					    port_priv->port_num))
+			goto out;
+	}
+
+local:
+	/* Give driver "right of first refusal" on incoming MAD */
+	if (port_priv->device->process_mad) {
+		int ret;
+
+		if (!response) {
+			printk(KERN_ERR PFX "No memory for response MAD\n");
+			/*
+			 * Is it better to assume that
+			 * it wouldn't be processed ?
+			 */
+			goto out;
+		}
+
+		ret = port_priv->device->process_mad(port_priv->device, 0,
+						     port_priv->port_num,
+						     wc->slid,
+						     recv->header.recv_buf.mad,
+						     &response->mad.mad);
+		if (ret & IB_MAD_RESULT_SUCCESS) {
+			if (ret & IB_MAD_RESULT_CONSUMED)
+				goto out;
+			if (ret & IB_MAD_RESULT_REPLY) {
+				/* Send response */
+				if (!agent_send(response, &recv->grh, wc,
+						port_priv->device,
+						port_priv->port_num))
+					response = NULL;
+				goto out;
+			}
+		} 
+	}
+
+	/* Determine corresponding MAD agent for incoming receive MAD */
+	solicited = solicited_mad(recv->header.recv_buf.mad);
+	mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
+				   solicited);
+	if (mad_agent) {
+		ib_mad_complete_recv(mad_agent, recv, solicited);
+		/*
+		 * recv is freed up in error cases in ib_mad_complete_recv
+		 * or via recv_handler in ib_mad_complete_recv()
+		 */
+		recv = NULL;
+	}
+
+out:
+	/* Post another receive request for this QP */
+	if (response) {
+		ib_mad_post_receive_mads(qp_info, response);
+		if (recv)
+			kmem_cache_free(ib_mad_cache, recv);
+	} else
+		ib_mad_post_receive_mads(qp_info, recv);
+}
+
+static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	unsigned long delay;
+
+	if (list_empty(&mad_agent_priv->wait_list)) {
+		cancel_delayed_work(&mad_agent_priv->work);
+	} else {
+		mad_send_wr = list_entry(mad_agent_priv->wait_list.next,
+					 struct ib_mad_send_wr_private,
+					 agent_list);
+
+		if (time_after(mad_agent_priv->timeout,
+			       mad_send_wr->timeout)) {
+			mad_agent_priv->timeout = mad_send_wr->timeout;
+			cancel_delayed_work(&mad_agent_priv->work);
+			delay = mad_send_wr->timeout - jiffies;
+			if ((long)delay <= 0)
+				delay = 1;
+			queue_delayed_work(mad_agent_priv->qp_info->
+					   port_priv->wq,
+					   &mad_agent_priv->work, delay);
+		}
+	}
+}
+
+static void wait_for_response(struct ib_mad_agent_private *mad_agent_priv,
+			      struct ib_mad_send_wr_private *mad_send_wr )
+{
+	struct ib_mad_send_wr_private *temp_mad_send_wr;
+	struct list_head *list_item;
+	unsigned long delay;
+
+	list_del(&mad_send_wr->agent_list);
+
+	delay = mad_send_wr->timeout;
+	mad_send_wr->timeout += jiffies;
+
+	list_for_each_prev(list_item, &mad_agent_priv->wait_list) {
+		temp_mad_send_wr = list_entry(list_item,
+					      struct ib_mad_send_wr_private,
+					      agent_list);
+		if (time_after(mad_send_wr->timeout,
+			       temp_mad_send_wr->timeout))
+			break;
+	}
+	list_add(&mad_send_wr->agent_list, list_item);
+
+	/* Reschedule a work item if we have a shorter timeout */
+	if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) {
+		cancel_delayed_work(&mad_agent_priv->work);
+		queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq,
+				   &mad_agent_priv->work, delay);
+	}
+}
+
+/*
+ * Process a send work completion
+ */
+static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
+				    struct ib_mad_send_wc *mad_send_wc)
+{
+	struct ib_mad_agent_private	*mad_agent_priv;
+	unsigned long			flags;
+
+	mad_agent_priv = container_of(mad_send_wr->agent,
+				      struct ib_mad_agent_private, agent);
+
+	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	if (mad_send_wc->status != IB_WC_SUCCESS &&
+	    mad_send_wr->status == IB_WC_SUCCESS) {
+		mad_send_wr->status = mad_send_wc->status;
+		mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
+	}
+
+	if (--mad_send_wr->refcount > 0) {
+		if (mad_send_wr->refcount == 1 && mad_send_wr->timeout &&
+		    mad_send_wr->status == IB_WC_SUCCESS) {
+			wait_for_response(mad_agent_priv, mad_send_wr);
+		}
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		return;
+	}
+
+	/* Remove send from MAD agent and notify client of completion */
+	list_del(&mad_send_wr->agent_list);
+	adjust_timeout(mad_agent_priv);
+	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+	
+	if (mad_send_wr->status != IB_WC_SUCCESS )
+		mad_send_wc->status = mad_send_wr->status;
+	mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
+					    mad_send_wc);
+
+	/* Release reference on agent taken when sending */
+	if (atomic_dec_and_test(&mad_agent_priv->refcount))
+		wake_up(&mad_agent_priv->wait);
+
+	kfree(mad_send_wr);
+}
+
+static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv,
+				     struct ib_wc *wc)
+{
+	struct ib_mad_send_wr_private	*mad_send_wr, *queued_send_wr;
+	struct ib_mad_list_head		*mad_list;
+	struct ib_mad_qp_info		*qp_info;
+	struct ib_mad_queue		*send_queue;
+	struct ib_send_wr		*bad_send_wr;
+	unsigned long flags;
+	int ret;
+
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
+				   mad_list);
+	send_queue = mad_list->mad_queue;
+	qp_info = send_queue->qp_info;
+
+retry:
+	queued_send_wr = NULL;
+	spin_lock_irqsave(&send_queue->lock, flags);
+	list_del(&mad_list->list);
+
+	/* Move queued send to the send queue */
+	if (send_queue->count-- > send_queue->max_active) {
+		mad_list = container_of(qp_info->overflow_list.next,
+					struct ib_mad_list_head, list);
+		queued_send_wr = container_of(mad_list,
+					struct ib_mad_send_wr_private,
+					mad_list);
+		list_del(&mad_list->list);
+		list_add_tail(&mad_list->list, &send_queue->list);
+	}
+	spin_unlock_irqrestore(&send_queue->lock, flags);
+
+	/* Restore client wr_id in WC and complete send */
+	wc->wr_id = mad_send_wr->wr_id;
+	ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc);
+
+	if (queued_send_wr) {
+		ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr,
+				&bad_send_wr);
+		if (ret) {
+			printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret);
+			mad_send_wr = queued_send_wr;
+			wc->status = IB_WC_LOC_QP_OP_ERR;
+			goto retry;
+		}
+	}
+}
+
+static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_list_head *mad_list;
+	unsigned long flags;
+
+	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
+	list_for_each_entry(mad_list, &qp_info->send_queue.list, list) {
+		mad_send_wr = container_of(mad_list,
+					   struct ib_mad_send_wr_private,
+					   mad_list);
+		mad_send_wr->retry = 1;
+	}
+	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
+}
+
+static void mad_error_handler(struct ib_mad_port_private *port_priv,
+			      struct ib_wc *wc)
+{
+	struct ib_mad_list_head *mad_list;
+	struct ib_mad_qp_info *qp_info;
+	struct ib_mad_send_wr_private *mad_send_wr;
+	int ret;
+
+	/* Determine if failure was a send or receive */
+	mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
+	qp_info = mad_list->mad_queue->qp_info;
+	if (mad_list->mad_queue == &qp_info->recv_queue)
+		/*
+		 * Receive errors indicate that the QP has entered the error 
+		 * state - error handling/shutdown code will cleanup
+		 */
+		return;
+
+	/*
+	 * Send errors will transition the QP to SQE - move
+	 * QP to RTS and repost flushed work requests
+	 */
+	mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private,
+				   mad_list);
+	if (wc->status == IB_WC_WR_FLUSH_ERR) {
+		if (mad_send_wr->retry) {
+			/* Repost send */
+			struct ib_send_wr *bad_send_wr;
+
+			mad_send_wr->retry = 0;
+			ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr,
+					&bad_send_wr);
+			if (ret)
+				ib_mad_send_done_handler(port_priv, wc);
+		} else
+			ib_mad_send_done_handler(port_priv, wc);
+	} else {
+		struct ib_qp_attr *attr;
+
+		/* Transition QP to RTS and fail offending send */
+		attr = kmalloc(sizeof *attr, GFP_KERNEL);
+		if (attr) {
+			attr->qp_state = IB_QPS_RTS;
+			attr->cur_qp_state = IB_QPS_SQE;
+			ret = ib_modify_qp(qp_info->qp, attr,
+					   IB_QP_STATE | IB_QP_CUR_STATE);
+			kfree(attr);
+			if (ret)
+				printk(KERN_ERR PFX "mad_error_handler - "
+				       "ib_modify_qp to RTS : %d\n", ret);
+			else
+				mark_sends_for_retry(qp_info);
+		}
+		ib_mad_send_done_handler(port_priv, wc);
+	}
+}
+
+/*
+ * IB MAD completion callback
+ */
+static void ib_mad_completion_handler(void *data)
+{
+	struct ib_mad_port_private *port_priv;
+	struct ib_wc wc;
+
+	port_priv = (struct ib_mad_port_private *)data;
+	ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
+	
+	while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) {
+		if (wc.status == IB_WC_SUCCESS) {
+			switch (wc.opcode) {
+			case IB_WC_SEND:
+				ib_mad_send_done_handler(port_priv, &wc);
+				break;
+			case IB_WC_RECV:
+				ib_mad_recv_done_handler(port_priv, &wc);
+				break;
+			default:
+				BUG_ON(1);
+				break;
+			}
+		} else
+			mad_error_handler(port_priv, &wc);
+	}
+}
+
+static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv)
+{
+	unsigned long flags;
+	struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr;
+	struct ib_mad_send_wc mad_send_wc;
+	struct list_head cancel_list;
+
+	INIT_LIST_HEAD(&cancel_list);
+
+	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr,
+				 &mad_agent_priv->send_list, agent_list) {
+		if (mad_send_wr->status == IB_WC_SUCCESS) {
+ 			mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
+			mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
+		}
+	}
+
+	/* Empty wait list to prevent receives from finding a request */
+	list_splice_init(&mad_agent_priv->wait_list, &cancel_list);
+	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+
+	/* Report all cancelled requests */
+	mad_send_wc.status = IB_WC_WR_FLUSH_ERR;
+	mad_send_wc.vendor_err = 0;
+
+	list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr,
+				 &cancel_list, agent_list) {
+		mad_send_wc.wr_id = mad_send_wr->wr_id;
+		mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
+						   &mad_send_wc);
+
+		list_del(&mad_send_wr->agent_list);
+		kfree(mad_send_wr);
+		atomic_dec(&mad_agent_priv->refcount);
+	}
+}
+
+static struct ib_mad_send_wr_private*
+find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv,
+		   u64 wr_id)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+
+	list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list,
+			    agent_list) {
+		if (mad_send_wr->wr_id == wr_id)
+			return mad_send_wr;
+	}
+
+	list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list,
+			    agent_list) {
+		if (mad_send_wr->wr_id == wr_id)
+			return mad_send_wr;
+	}
+	return NULL;
+}
+
+void ib_cancel_mad(struct ib_mad_agent *mad_agent,
+		  u64 wr_id)
+{
+	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_send_wc mad_send_wc;
+	unsigned long flags;
+
+	mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private,
+				      agent);
+	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id);
+	if (!mad_send_wr) {
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		goto out;
+	}
+
+	if (mad_send_wr->status == IB_WC_SUCCESS)
+		mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
+
+	if (mad_send_wr->refcount != 0) {
+		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		goto out;
+	}
+
+	list_del(&mad_send_wr->agent_list);
+	adjust_timeout(mad_agent_priv);
+	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+
+	mad_send_wc.status = IB_WC_WR_FLUSH_ERR;
+	mad_send_wc.vendor_err = 0;
+	mad_send_wc.wr_id = mad_send_wr->wr_id;
+	mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
+					   &mad_send_wc);
+
+	kfree(mad_send_wr);
+	if (atomic_dec_and_test(&mad_agent_priv->refcount))
+		wake_up(&mad_agent_priv->wait);
+
+out:
+	return;
+}
+EXPORT_SYMBOL(ib_cancel_mad);
+
+static void timeout_sends(void *data)
+{
+	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_send_wc mad_send_wc;
+	unsigned long flags, delay;
+
+	mad_agent_priv = (struct ib_mad_agent_private *)data;
+
+	mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR;
+	mad_send_wc.vendor_err = 0;
+
+	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	while (!list_empty(&mad_agent_priv->wait_list)) {
+		mad_send_wr = list_entry(mad_agent_priv->wait_list.next,
+					 struct ib_mad_send_wr_private,
+					 agent_list);
+
+		if (time_after(mad_send_wr->timeout, jiffies)) {
+			delay = mad_send_wr->timeout - jiffies;
+			if ((long)delay <= 0)
+				delay = 1;
+			queue_delayed_work(mad_agent_priv->qp_info->
+					   port_priv->wq,
+					   &mad_agent_priv->work, delay);
+			break;
+		}
+
+		list_del(&mad_send_wr->agent_list);
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+
+		mad_send_wc.wr_id = mad_send_wr->wr_id;
+		mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
+						   &mad_send_wc);
+
+		kfree(mad_send_wr);
+		atomic_dec(&mad_agent_priv->refcount);
+		spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	}
+	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+}
+
+static void ib_mad_thread_completion_handler(struct ib_cq *cq)
+{
+	struct ib_mad_port_private *port_priv = cq->cq_context;
+	queue_work(port_priv->wq, &port_priv->work);
+}
+
+/*
+ * Allocate receive MADs and post receive WRs for them
+ */
+static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info,
+				    struct ib_mad_private *mad)
+{
+	unsigned long flags;
+	int post, ret;
+	struct ib_mad_private *mad_priv;
+	struct ib_sge sg_list;
+	struct ib_recv_wr recv_wr, *bad_recv_wr;
+	struct ib_mad_queue *recv_queue = &qp_info->recv_queue;
+
+	/* Initialize common scatter list fields */
+	sg_list.length = sizeof *mad_priv - sizeof mad_priv->header;
+	sg_list.lkey = (*qp_info->port_priv->mr).lkey;
+
+	/* Initialize common receive WR fields */
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &sg_list;
+	recv_wr.num_sge = 1;
+	recv_wr.recv_flags = IB_RECV_SIGNALED;
+
+	do {
+		/* Allocate and map receive buffer */
+		if (mad) {
+			mad_priv = mad;
+			mad = NULL;
+		} else {
+			mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+			if (!mad_priv) {
+				printk(KERN_ERR PFX "No memory for receive buffer\n");
+				ret = -ENOMEM;
+				break;
+			}
+		}
+		sg_list.addr = dma_map_single(qp_info->port_priv->
+						device->dma_device,
+					&mad_priv->grh,
+					sizeof *mad_priv -
+						sizeof mad_priv->header,
+					DMA_FROM_DEVICE);
+		pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
+		recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
+		mad_priv->header.mad_list.mad_queue = recv_queue;
+
+		/* Post receive WR */
+		spin_lock_irqsave(&recv_queue->lock, flags);
+		post = (++recv_queue->count < recv_queue->max_active);
+		list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list);
+		spin_unlock_irqrestore(&recv_queue->lock, flags);
+		ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
+		if (ret) {
+			spin_lock_irqsave(&recv_queue->lock, flags);
+			list_del(&mad_priv->header.mad_list.list);
+			recv_queue->count--;
+			spin_unlock_irqrestore(&recv_queue->lock, flags);
+			dma_unmap_single(qp_info->port_priv->device->dma_device,
+					 pci_unmap_addr(&mad_priv->header,
+							mapping),
+					 sizeof *mad_priv -
+					   sizeof mad_priv->header,
+					 DMA_FROM_DEVICE);
+			kmem_cache_free(ib_mad_cache, mad_priv);
+			printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret);
+			break;
+		}
+	} while (post);
+
+	return ret;
+}
+
+/*
+ * Return all the posted receive MADs
+ */
+static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info)
+{
+	struct ib_mad_private_header *mad_priv_hdr;
+	struct ib_mad_private *recv;
+	struct ib_mad_list_head *mad_list;
+
+	while (!list_empty(&qp_info->recv_queue.list)) {
+
+		mad_list = list_entry(qp_info->recv_queue.list.next,
+				      struct ib_mad_list_head, list);
+		mad_priv_hdr = container_of(mad_list,
+					    struct ib_mad_private_header,
+					    mad_list);
+		recv = container_of(mad_priv_hdr, struct ib_mad_private,
+				    header);
+
+		/* Remove from posted receive MAD list */
+		list_del(&mad_list->list);
+
+		/* Undo PCI mapping */
+		dma_unmap_single(qp_info->port_priv->device->dma_device,
+				 pci_unmap_addr(&recv->header, mapping),
+				 sizeof(struct ib_mad_private) -
+				 sizeof(struct ib_mad_private_header),
+				 DMA_FROM_DEVICE);
+		kmem_cache_free(ib_mad_cache, recv);
+	}
+
+	qp_info->recv_queue.count = 0;
+}
+
+/*
+ * Start the port
+ */
+static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
+{
+	int ret, i;
+	struct ib_qp_attr *attr;
+	struct ib_qp *qp;
+
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+ 	if (!attr) {
+		printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		qp = port_priv->qp_info[i].qp;
+		/*
+		 * PKey index for QP1 is irrelevant but
+		 * one is needed for the Reset to Init transition
+		 */
+		attr->qp_state = IB_QPS_INIT;
+		attr->pkey_index = 0;
+		attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE |
+					     IB_QP_PKEY_INDEX | IB_QP_QKEY);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't change QP%d state to "
+			       "INIT: %d\n", i, ret);
+			goto out;
+		}
+
+		attr->qp_state = IB_QPS_RTR;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't change QP%d state to "
+			       "RTR: %d\n", i, ret);
+			goto out;
+		}
+
+		attr->qp_state = IB_QPS_RTS;
+		attr->sq_psn = IB_MAD_SEND_Q_PSN;
+		ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't change QP%d state to "
+			       "RTS: %d\n", i, ret);
+			goto out;
+		}
+	}
+
+	ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		printk(KERN_ERR PFX "Failed to request completion "
+		       "notification: %d\n", ret);
+		goto out;
+	}
+
+	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
+			goto out;
+		}
+	}
+out:
+	kfree(attr);
+	return ret;
+}
+
+static void qp_event_handler(struct ib_event *event, void *qp_context)
+{
+	struct ib_mad_qp_info	*qp_info = qp_context;
+
+	/* It's worse than that! He's dead, Jim! */
+	printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n",
+		event->event, qp_info->qp->qp_num);
+}
+
+static void init_mad_queue(struct ib_mad_qp_info *qp_info,
+			   struct ib_mad_queue *mad_queue)
+{
+	mad_queue->qp_info = qp_info;
+	mad_queue->count = 0;
+	spin_lock_init(&mad_queue->lock);
+	INIT_LIST_HEAD(&mad_queue->list);
+}
+
+static void init_mad_qp(struct ib_mad_port_private *port_priv,
+			struct ib_mad_qp_info *qp_info)
+{
+	qp_info->port_priv = port_priv;
+	init_mad_queue(qp_info, &qp_info->send_queue);
+	init_mad_queue(qp_info, &qp_info->recv_queue);
+	INIT_LIST_HEAD(&qp_info->overflow_list);
+}
+
+static int create_mad_qp(struct ib_mad_qp_info *qp_info,
+			 enum ib_qp_type qp_type)
+{
+	struct ib_qp_init_attr	qp_init_attr;
+	int ret;
+
+	memset(&qp_init_attr, 0, sizeof qp_init_attr);
+	qp_init_attr.send_cq = qp_info->port_priv->cq;
+	qp_init_attr.recv_cq = qp_info->port_priv->cq;
+	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
+	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
+	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
+	qp_init_attr.qp_type = qp_type;
+	qp_init_attr.port_num = qp_info->port_priv->port_num;
+	qp_init_attr.qp_context = qp_info;
+	qp_init_attr.event_handler = qp_event_handler;
+	qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr);
+	if (IS_ERR(qp_info->qp)) {
+		printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n",
+		       get_spl_qp_index(qp_type));
+		ret = PTR_ERR(qp_info->qp);
+		goto error;		
+	}
+	/* Use minimum queue sizes unless the CQ is resized */
+	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
+	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
+	return 0;
+
+error:
+	return ret;
+}
+
+static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
+{
+	ib_destroy_qp(qp_info->qp);
+}
+
+/*
+ * Open the port
+ * Create the QP, PD, MR, and CQ if needed
+ */
+static int ib_mad_port_open(struct ib_device *device,
+			    int port_num)
+{
+	int ret, cq_size;
+	struct ib_mad_port_private *port_priv;
+	unsigned long flags;
+	char name[sizeof "ib_mad123"];
+
+	/* First, check if port already open at MAD layer */
+	port_priv = ib_get_mad_port(device, port_num);
+	if (port_priv) {
+		printk(KERN_DEBUG PFX "%s port %d already open\n",
+		       device->name, port_num);
+		return 0;
+	}
+
+	/* Create new device info */
+	port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL);
+	if (!port_priv) {
+		printk(KERN_ERR PFX "No memory for ib_mad_port_private\n");
+		return -ENOMEM;
+	}
+	memset(port_priv, 0, sizeof *port_priv);
+	port_priv->device = device;
+	port_priv->port_num = port_num;
+	spin_lock_init(&port_priv->reg_lock);
+	INIT_LIST_HEAD(&port_priv->agent_list);
+	init_mad_qp(port_priv, &port_priv->qp_info[0]);
+	init_mad_qp(port_priv, &port_priv->qp_info[1]);
+
+	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	port_priv->cq = ib_create_cq(port_priv->device,
+				     (ib_comp_handler)
+					ib_mad_thread_completion_handler,
+				     NULL, port_priv, cq_size);
+	if (IS_ERR(port_priv->cq)) {
+		printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n");
+		ret = PTR_ERR(port_priv->cq);
+		goto error3;
+	}
+
+	port_priv->pd = ib_alloc_pd(device);
+	if (IS_ERR(port_priv->pd)) {
+		printk(KERN_ERR PFX "Couldn't create ib_mad PD\n");
+		ret = PTR_ERR(port_priv->pd);
+		goto error4;
+	}
+
+	port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(port_priv->mr)) {
+		printk(KERN_ERR PFX "Couldn't get ib_mad DMA MR\n");
+		ret = PTR_ERR(port_priv->mr);
+		goto error5;
+	}
+
+	ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
+	if (ret)
+		goto error6;
+	ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI);
+	if (ret)
+		goto error7;
+
+	snprintf(name, sizeof name, "ib_mad%d", port_num);
+	port_priv->wq = create_workqueue(name);
+	if (!port_priv->wq) {
+		ret = -ENOMEM;
+		goto error8;
+	}
+	INIT_WORK(&port_priv->work, ib_mad_completion_handler, port_priv);
+
+	ret = ib_mad_port_start(port_priv);
+	if (ret) {
+		printk(KERN_ERR PFX "Couldn't start port\n");
+		goto error9;
+	}
+
+	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
+	list_add_tail(&port_priv->port_list, &ib_mad_port_list);
+	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
+	return 0;
+
+error9:
+	destroy_workqueue(port_priv->wq);
+error8:
+	destroy_mad_qp(&port_priv->qp_info[1]);
+error7:
+	destroy_mad_qp(&port_priv->qp_info[0]);
+error6:
+	ib_dereg_mr(port_priv->mr);
+error5:
+	ib_dealloc_pd(port_priv->pd);
+error4:
+	ib_destroy_cq(port_priv->cq);
+	cleanup_recv_queue(&port_priv->qp_info[1]);
+	cleanup_recv_queue(&port_priv->qp_info[0]);
+error3:
+	kfree(port_priv);
+
+	return ret;
+}
+
+/*
+ * Close the port 
+ * If there are no classes using the port, free the port 
+ * resources (CQ, MR, PD, QP) and remove the port's info structure
+ */
+static int ib_mad_port_close(struct ib_device *device, int port_num)
+{
+	struct ib_mad_port_private *port_priv;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
+	port_priv = __ib_get_mad_port(device, port_num);
+	if (port_priv == NULL) {
+		spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
+		printk(KERN_ERR PFX "Port %d not found\n", port_num);
+		return -ENODEV;
+	}
+	list_del(&port_priv->port_list);
+	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
+
+	/* Stop processing completions. */
+	flush_workqueue(port_priv->wq);
+	destroy_workqueue(port_priv->wq);
+	destroy_mad_qp(&port_priv->qp_info[1]);
+	destroy_mad_qp(&port_priv->qp_info[0]);
+	ib_dereg_mr(port_priv->mr);
+	ib_dealloc_pd(port_priv->pd);
+	ib_destroy_cq(port_priv->cq);
+	cleanup_recv_queue(&port_priv->qp_info[1]);
+	cleanup_recv_queue(&port_priv->qp_info[0]);
+	/* XXX: Handle deallocation of MAD registration tables */
+
+	kfree(port_priv);
+
+	return 0;
+}
+
+static void ib_mad_init_device(struct ib_device *device)
+{
+	int ret, num_ports, cur_port, i, ret2;
+
+	if (device->node_type == IB_NODE_SWITCH) {
+		num_ports = 1;
+		cur_port = 0;
+	} else {
+		num_ports = device->phys_port_cnt;
+		cur_port = 1;
+	}
+	for (i = 0; i < num_ports; i++, cur_port++) {
+		ret = ib_mad_port_open(device, cur_port);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
+			       device->name, cur_port);
+			goto error_device_open;
+		}
+		ret = ib_agent_port_open(device, cur_port);
+		if (ret) {
+			printk(KERN_ERR PFX "Couldn't open %s port %d "
+			       "for agents\n",
+			       device->name, cur_port);
+			goto error_device_open;
+		}
+	}
+
+	goto error_device_query;
+
+error_device_open:
+	while (i > 0) {
+		cur_port--;
+		ret2 = ib_agent_port_close(device, cur_port);
+		if (ret2) {
+			printk(KERN_ERR PFX "Couldn't close %s port %d "
+			       "for agents\n",
+			       device->name, cur_port);
+		}
+		ret2 = ib_mad_port_close(device, cur_port);
+		if (ret2) {
+			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
+			       device->name, cur_port);
+		}
+		i--;
+	}
+
+error_device_query:
+	return;
+}
+
+static void ib_mad_remove_device(struct ib_device *device)
+{
+	int ret = 0, i, num_ports, cur_port, ret2;
+
+	if (device->node_type == IB_NODE_SWITCH) {
+		num_ports = 1;
+		cur_port = 0;
+	} else {
+		num_ports = device->phys_port_cnt;
+		cur_port = 1;
+	}
+	for (i = 0; i < num_ports; i++, cur_port++) {
+		ret2 = ib_agent_port_close(device, cur_port);
+		if (ret2) {
+			printk(KERN_ERR PFX "Couldn't close %s port %d "
+			       "for agents\n",
+			       device->name, cur_port);
+			if (!ret)
+				ret = ret2;
+		}
+		ret2 = ib_mad_port_close(device, cur_port);
+		if (ret2) {
+			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
+			       device->name, cur_port);
+			if (!ret)
+				ret = ret2;
+		}
+	}
+}
+
+static struct ib_client mad_client = {
+	.name   = "mad",
+	.add = ib_mad_init_device,
+	.remove = ib_mad_remove_device
+};
+
+static int __init ib_mad_init_module(void)
+{
+	int ret;
+
+	spin_lock_init(&ib_mad_port_list_lock);
+	spin_lock_init(&ib_agent_port_list_lock);
+
+	ib_mad_cache = kmem_cache_create("ib_mad",
+					 sizeof(struct ib_mad_private),
+					 0,
+					 SLAB_HWCACHE_ALIGN,
+					 NULL,
+					 NULL);
+	if (!ib_mad_cache) {
+		printk(KERN_ERR PFX "Couldn't create ib_mad cache\n");
+		ret = -ENOMEM;
+		goto error1;
+	}
+
+	INIT_LIST_HEAD(&ib_mad_port_list);
+
+	if (ib_register_client(&mad_client)) {
+		printk(KERN_ERR PFX "Couldn't register ib_mad client\n");
+		ret = -EINVAL;
+		goto error2;
+	}
+
+	return 0;
+
+error2:
+	kmem_cache_destroy(ib_mad_cache);
+error1:
+	return ret;
+}
+
+static void __exit ib_mad_cleanup_module(void)
+{
+	ib_unregister_client(&mad_client);
+
+	if (kmem_cache_destroy(ib_mad_cache)) {
+		printk(KERN_DEBUG PFX "Failed to destroy ib_mad cache\n");
+	}
+}
+
+module_init(ib_mad_init_module);
+module_exit(ib_mad_cleanup_module);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/mad_priv.h	2004-11-23 08:10:18.221875555 -0800
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2004, Voltaire, Inc. All rights reserved.
+ * Maintained by: vtrmaint1 at voltaire.com
+ *
+ * This program is intended for the purpose of Infiniband
+ * protocol stack for Linux Servers. 
+ *
+ * This software program is free software and you are free to modifyi
+ * and/or redistribute it under a choice of one of the following two
+ * licenses:
+ *
+ * 1) under either the GNU General Public License (GPL) Version 2, June 1991,
+ *    a copy of which is in the file LICENSE_GPL_V2.txt in the root directory.
+ *    This GPL license is also available from the Free Software Foundation,
+ *    Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the
+ *    web at http://www.fsf.org/copyleft/gpl.html
+ *
+ * OR
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is in the file
+ *    LICENSE2.txt in the root directory. The license is also available from
+ *    the Open Source Initiative, on the web at
+ *    http://www.opensource.org/licenses/bsd-license.php.
+ *
+ *
+ *    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ *
+ *
+ * To obtain a copy of these licenses, the source code to this software or
+ * for other questions, you may write to Voltaire, Inc.,
+ * Attention: Voltaire openSource maintainer, 
+ * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or
+ * by Email: vtrmaint1 at voltaire.com
+ *
+ * Licensee has the right to choose either one of the above two licenses.
+ *
+ * Redistributions of source code must retain both the above copyright
+ * notice and either one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, either one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+
+#ifndef __IB_MAD_PRIV_H__
+#define __IB_MAD_PRIV_H__
+
+#include <linux/pci.h>
+#include <linux/kthread.h>
+#include <linux/workqueue.h>
+#include <ib_mad.h>
+#include <ib_smi.h>
+
+
+#define PFX "ib_mad: "
+
+#define IB_MAD_QPS_CORE		2 /* Always QP0 and QP1 as a minimum */
+
+/* QP and CQ parameters */
+#define IB_MAD_QP_SEND_SIZE	2048
+#define IB_MAD_QP_RECV_SIZE	512
+#define IB_MAD_SEND_REQ_MAX_SG	2
+#define IB_MAD_RECV_REQ_MAX_SG	1
+
+#define IB_MAD_SEND_Q_PSN	0
+
+/* Registration table sizes */
+#define MAX_MGMT_CLASS		80	
+#define MAX_MGMT_VERSION	8
+
+struct ib_mad_list_head {
+	struct list_head list;
+	struct ib_mad_queue *mad_queue;
+};
+
+struct ib_mad_private_header {
+	struct ib_mad_list_head mad_list;
+	struct ib_mad_recv_wc recv_wc;
+	struct ib_mad_recv_buf recv_buf;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+} __attribute__ ((packed));
+
+struct ib_mad_private {
+	struct ib_mad_private_header header;
+	struct ib_grh grh;
+	union {
+		struct ib_mad mad;
+		struct ib_rmpp_mad rmpp_mad;
+		struct ib_smp smp;
+	} mad;
+} __attribute__ ((packed));
+
+struct ib_mad_agent_private {
+	struct list_head agent_list;
+	struct ib_mad_agent agent;
+	struct ib_mad_reg_req *reg_req;
+	struct ib_mad_qp_info *qp_info;
+
+	spinlock_t lock;
+	struct list_head send_list;
+	struct list_head wait_list;
+	struct work_struct work;
+	unsigned long timeout;
+
+	atomic_t refcount;
+	wait_queue_head_t wait;
+	u8 rmpp_version;
+};
+
+struct ib_mad_send_wr_private {
+	struct ib_mad_list_head mad_list;
+	struct list_head agent_list;
+	struct ib_mad_agent *agent;
+	struct ib_send_wr send_wr;
+	struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG];
+	u64 wr_id;			/* client WR ID */
+	u64 tid;
+	unsigned long timeout;
+	int retry;
+	int refcount;
+	enum ib_wc_status status;
+};
+
+struct ib_mad_mgmt_method_table {
+	struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS];
+};
+
+struct ib_mad_mgmt_class_table {
+	struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS];
+};
+
+struct ib_mad_queue {
+	spinlock_t lock;
+	struct list_head list;
+	int count;
+	int max_active;
+	struct ib_mad_qp_info *qp_info;
+};
+
+struct ib_mad_qp_info {
+	struct ib_mad_port_private *port_priv;
+	struct ib_qp *qp;
+	struct ib_mad_queue send_queue;
+	struct ib_mad_queue recv_queue;
+	struct list_head overflow_list;
+};
+
+struct ib_mad_port_private {
+	struct list_head port_list;
+	struct ib_device *device;
+	int port_num;
+	struct ib_cq *cq;
+	struct ib_pd *pd;
+	struct ib_mr *mr;
+
+	spinlock_t reg_lock;
+	struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION];
+	struct list_head agent_list;
+	struct workqueue_struct *wq;
+	struct work_struct work;
+	struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE];
+};
+
+#endif	/* __IB_MAD_PRIV_H__ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/smi.c	2004-11-23 08:10:18.110891920 -0800
@@ -0,0 +1,222 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#include <ib_smi.h>
+
+
+/*
+ * Fixup a directed route SMP for sending
+ * Return 0 if the SMP should be discarded
+ */
+int smi_handle_dr_smp_send(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num)
+{
+	u8 hop_ptr, hop_cnt;
+
+	hop_ptr = smp->hop_ptr;
+	hop_cnt = smp->hop_cnt;
+
+	/* See section 14.2.2.2, Vol 1 IB spec */
+	if (!ib_get_smp_direction(smp)) {
+		/* C14-9:1 */
+		if (hop_cnt && hop_ptr == 0) {
+			smp->hop_ptr++;
+			return (smp->initial_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-9:2 */
+		if (hop_ptr && hop_ptr < hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+			
+			/* smp->return_path set when received */
+			smp->hop_ptr++;
+			return (smp->initial_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-9:3 -- We're at the end of the DR segment of path */
+		if (hop_ptr == hop_cnt) {
+			/* smp->return_path set when received */
+			smp->hop_ptr++;
+			return (node_type == IB_NODE_SWITCH ||
+				smp->dr_dlid == IB_LID_PERMISSIVE);
+		}
+
+		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */
+		/* C14-9:5 -- Fail unreasonable hop pointer */
+		return (hop_ptr == hop_cnt + 1);
+
+	} else {
+		/* C14-13:1 */
+		if (hop_cnt && hop_ptr == hop_cnt + 1) {
+			smp->hop_ptr--;
+			return (smp->return_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-13:2 */
+		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+
+			smp->hop_ptr--;
+			return (smp->return_path[smp->hop_ptr] == 
+				port_num);
+		}
+
+		/* C14-13:3 -- at the end of the DR segment of path */
+		if (hop_ptr == 1) {
+			smp->hop_ptr--;
+			/* C14-13:3 -- SMPs destined for SM shouldn't be here */
+			return (node_type == IB_NODE_SWITCH ||
+				smp->dr_slid == IB_LID_PERMISSIVE);
+		}
+
+		/* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */
+		if (hop_ptr == 0)
+			return 1;
+
+		/* C14-13:5 -- Check for unreasonable hop pointer */
+		return 0;
+	}
+}
+
+/*
+ * Adjust information for a received SMP
+ * Return 0 if the SMP should be dropped
+ */
+int smi_handle_dr_smp_recv(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num,
+			   int phys_port_cnt)
+{
+	u8 hop_ptr, hop_cnt;
+
+	hop_ptr = smp->hop_ptr;
+	hop_cnt = smp->hop_cnt;
+
+	/* See section 14.2.2.2, Vol 1 IB spec */
+	if (!ib_get_smp_direction(smp)) {
+		/* C14-9:1 -- sender should have incremented hop_ptr */
+		if (hop_cnt && hop_ptr == 0)
+			return 0;
+
+		/* C14-9:2 -- intermediate hop */
+		if (hop_ptr && hop_ptr < hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+
+			smp->return_path[hop_ptr] = port_num;
+			/* smp->hop_ptr updated when sending */
+			return (smp->initial_path[hop_ptr+1] <= phys_port_cnt);
+		}
+
+		/* C14-9:3 -- We're at the end of the DR segment of path */
+		if (hop_ptr == hop_cnt) {
+			if (hop_cnt)
+				smp->return_path[hop_ptr] = port_num;
+			/* smp->hop_ptr updated when sending */
+
+			return (node_type == IB_NODE_SWITCH ||
+				smp->dr_dlid == IB_LID_PERMISSIVE);
+		}
+		
+		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */
+		/* C14-9:5 -- fail unreasonable hop pointer */
+		return (hop_ptr == hop_cnt + 1);
+
+	} else {
+
+		/* C14-13:1 */
+		if (hop_cnt && hop_ptr == hop_cnt + 1) {
+			smp->hop_ptr--;
+			return (smp->return_path[smp->hop_ptr] ==
+				port_num);
+		}
+
+		/* C14-13:2 */
+		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
+			if (node_type != IB_NODE_SWITCH)
+				return 0;
+
+			/* smp->hop_ptr updated when sending */
+			return (smp->return_path[hop_ptr-1] <= phys_port_cnt);
+		}
+
+		/* C14-13:3 -- We're at the end of the DR segment of path */
+		if (hop_ptr == 1) {
+			if (smp->dr_slid == IB_LID_PERMISSIVE) {
+				/* giving SMP to SM - update hop_ptr */
+				smp->hop_ptr--;
+				return 1;
+			}
+			/* smp->hop_ptr updated when sending */
+			return (node_type == IB_NODE_SWITCH);
+		}
+
+		/* C14-13:4 -- hop_ptr = 0 -> give to SM */
+		/* C14-13:5 -- Check for unreasonable hop pointer */
+		return (hop_ptr == 0);
+	}
+}
+
+/*
+ * Return 1 if the received DR SMP should be forwarded to the send queue
+ * Return 0 if the SMP should be completed up the stack
+ */
+int smi_check_forward_dr_smp(struct ib_smp *smp)
+{
+	u8 hop_ptr, hop_cnt;
+
+	hop_ptr = smp->hop_ptr;
+	hop_cnt = smp->hop_cnt;
+
+	if (!ib_get_smp_direction(smp)) {
+		/* C14-9:2 -- intermediate hop */
+		if (hop_ptr && hop_ptr < hop_cnt)
+			return 1;
+
+		/* C14-9:3 -- at the end of the DR segment of path */
+		if (hop_ptr == hop_cnt)
+			return (smp->dr_dlid == IB_LID_PERMISSIVE);
+
+		/* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */
+		if (hop_ptr == hop_cnt + 1)
+			return 1;
+	} else {
+		/* C14-13:2 */
+		if (2 <= hop_ptr && hop_ptr <= hop_cnt)
+			return 1;
+
+		/* C14-13:3 -- at the end of the DR segment of path */
+		if (hop_ptr == 1)
+			return (smp->dr_slid != IB_LID_PERMISSIVE);
+	}
+	return 0;
+}
+
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/smi.h	2004-11-23 08:10:18.259869953 -0800
@@ -0,0 +1,54 @@
+/*
+  This software is available to you under a choice of one of two
+  licenses.  You may choose to be licensed under the terms of the GNU
+  General Public License (GPL) Version 2, available at
+  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+  license, available in the LICENSE.TXT file accompanying this
+  software.  These details are also available at
+  <http://openib.org/license.html>.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+  SOFTWARE.
+
+  Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+  Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+  Copyright (c) 2004 Intel Corporation.  All rights reserved.
+  Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+  Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+*/
+
+#ifndef __SMI_H_
+#define __SMI_H_
+
+int smi_handle_dr_smp_recv(struct ib_smp *smp,
+			   u8 node_type,
+			   int port_num,
+			   int phys_port_cnt);
+extern int smi_check_forward_dr_smp(struct ib_smp *smp);
+extern int smi_handle_dr_smp_send(struct ib_smp *smp,
+				  u8 node_type,
+				  int port_num);
+extern int smi_check_local_dr_smp(struct ib_smp *smp,
+				  struct ib_device *device,
+				  int port_num);
+
+/*
+ * Return 1 if the SMP should be handled by the local SMA/SM via process_mad
+ */
+static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent,
+                         	      struct ib_smp *smp)
+{
+	/* C14-9:3 -- We're at the end of the DR segment of path */
+	/* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */
+	return ((mad_agent->device->process_mad &&
+		!ib_get_smp_direction(smp) &&
+		(smp->hop_ptr == smp->hop_cnt + 1)));
+}
+
+#endif	/* __SMI_H_ */


From roland at topspin.com  Tue Nov 23 08:14:47 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:47 -0800
Subject: [openib-general] [PATCH][RFC/v2][6/21] Add InfiniBand SA (Subnet
	Administration) query support
In-Reply-To: <20041123814.sBoIUxeLIDc9lo4V@topspin.com>
Message-ID: <20041123814.UmUHBktptJzFvsrR@topspin.com>

Add support for sending queries to the SA (Subnet Administration).  In
particular the PathRecord and MCMember (multicast group member) used
by the IP-over-InfiniBand driver are implemented.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/drivers/infiniband/core/Makefile	2004-11-23 08:10:17.978911380 -0800
+++ linux-bk/drivers/infiniband/core/Makefile	2004-11-23 08:10:18.652812015 -0800
@@ -2,7 +2,8 @@
 
 obj-$(CONFIG_INFINIBAND) += \
     ib_core.o \
-    ib_mad.o
+    ib_mad.o \
+    ib_sa.o
 
 ib_core-objs := \
     packer.o \
@@ -17,3 +18,5 @@
     mad.o \
     smi.o \
     agent.o
+
+ib_sa-objs := sa_query.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/sa_query.c	2004-11-23 08:10:18.678808182 -0800
@@ -0,0 +1,816 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/err.h>
+#include <linux/random.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <linux/kref.h>
+#include <linux/idr.h>
+
+#include <ib_pack.h>
+#include <ib_sa.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("InfiniBand subnet administration query support");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct ib_sa_hdr {
+	u64			sm_key;
+	u16			attr_offset;
+	u16			reserved;
+	ib_sa_comp_mask		comp_mask;
+} __attribute__ ((packed));
+
+struct ib_sa_mad {
+	struct ib_mad_hdr	mad_hdr;
+	struct ib_rmpp_hdr	rmpp_hdr;
+	struct ib_sa_hdr	sa_hdr;
+	u8			data[200];
+} __attribute__ ((packed));
+
+struct ib_sa_sm_ah {
+	struct ib_ah        *ah;
+	struct kref          ref;
+};
+
+struct ib_sa_port {
+	struct ib_mad_agent *agent;
+	struct ib_mr        *mr;
+	struct ib_sa_sm_ah  *sm_ah;
+	struct work_struct   update_task;
+	spinlock_t           ah_lock;
+	u8                   port_num;
+};
+
+struct ib_sa_device {
+	int                     start_port, end_port;
+	struct ib_event_handler event_handler;
+	struct ib_sa_port port[0];
+};
+
+struct ib_sa_query {
+	void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *);
+	void (*release)(struct ib_sa_query *);
+	struct ib_sa_port  *port;
+	struct ib_sa_mad   *mad;
+	struct ib_sa_sm_ah *sm_ah;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	int                 id;
+};
+
+struct ib_sa_path_query {
+	void (*callback)(int, struct ib_sa_path_rec *, void *);
+	void *context;
+	struct ib_sa_query sa_query;
+};
+
+struct ib_sa_mcmember_query {
+	void (*callback)(int, struct ib_sa_mcmember_rec *, void *);
+	void *context;
+	struct ib_sa_query sa_query;
+};
+
+static void ib_sa_add_one(struct ib_device *device);
+static void ib_sa_remove_one(struct ib_device *device);
+
+static struct ib_client sa_client = {
+	.name   = "sa",
+	.add    = ib_sa_add_one,
+	.remove = ib_sa_remove_one
+};
+
+static spinlock_t idr_lock;
+static DEFINE_IDR(query_idr);
+
+static spinlock_t tid_lock;
+static u32 tid;
+
+enum {
+	IB_SA_ATTR_CLASS_PORTINFO    = 0x01,
+	IB_SA_ATTR_NOTICE	     = 0x02,
+	IB_SA_ATTR_INFORM_INFO	     = 0x03,
+	IB_SA_ATTR_NODE_REC	     = 0x11,
+	IB_SA_ATTR_PORT_INFO_REC     = 0x12,
+	IB_SA_ATTR_SL2VL_REC	     = 0x13,
+	IB_SA_ATTR_SWITCH_REC	     = 0x14,
+	IB_SA_ATTR_LINEAR_FDB_REC    = 0x15,
+	IB_SA_ATTR_RANDOM_FDB_REC    = 0x16,
+	IB_SA_ATTR_MCAST_FDB_REC     = 0x17,
+	IB_SA_ATTR_SM_INFO_REC	     = 0x18,
+	IB_SA_ATTR_LINK_REC	     = 0x20,
+	IB_SA_ATTR_GUID_INFO_REC     = 0x30,
+	IB_SA_ATTR_SERVICE_REC	     = 0x31,
+	IB_SA_ATTR_PARTITION_REC     = 0x33,
+	IB_SA_ATTR_RANGE_REC	     = 0x34,
+	IB_SA_ATTR_PATH_REC	     = 0x35,
+	IB_SA_ATTR_VL_ARB_REC	     = 0x36,
+	IB_SA_ATTR_MC_GROUP_REC	     = 0x37,
+	IB_SA_ATTR_MC_MEMBER_REC     = 0x38,
+	IB_SA_ATTR_TRACE_REC	     = 0x39,
+	IB_SA_ATTR_MULTI_PATH_REC    = 0x3a,
+	IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b
+};
+
+#define PATH_REC_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_path_rec, field),		\
+	.struct_size_bytes   = sizeof ((struct ib_sa_path_rec *) 0)->field,	\
+	.field_name          = "sa_path_rec:" #field
+
+static const struct ib_field path_rec_table[] = {
+	{ RESERVED,
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ RESERVED,
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ PATH_REC_FIELD(dgid),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ PATH_REC_FIELD(sgid),
+	  .offset_words = 6,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ PATH_REC_FIELD(dlid),
+	  .offset_words = 10,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ PATH_REC_FIELD(slid),
+	  .offset_words = 10,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ PATH_REC_FIELD(raw_traffic),
+	  .offset_words = 11,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ RESERVED,
+	  .offset_words = 11,
+	  .offset_bits  = 1,
+	  .size_bits    = 3 },
+	{ PATH_REC_FIELD(flow_label),
+	  .offset_words = 11,
+	  .offset_bits  = 4,
+	  .size_bits    = 20 },
+	{ PATH_REC_FIELD(hop_limit),
+	  .offset_words = 11,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ PATH_REC_FIELD(traffic_class),
+	  .offset_words = 12,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ PATH_REC_FIELD(reversible),
+	  .offset_words = 12,
+	  .offset_bits  = 8,
+	  .size_bits    = 1 },
+	{ PATH_REC_FIELD(numb_path),
+	  .offset_words = 12,
+	  .offset_bits  = 9,
+	  .size_bits    = 7 },
+	{ PATH_REC_FIELD(pkey),
+	  .offset_words = 12,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ RESERVED,
+	  .offset_words = 13,
+	  .offset_bits  = 0,
+	  .size_bits    = 12 },
+	{ PATH_REC_FIELD(sl),
+	  .offset_words = 13,
+	  .offset_bits  = 12,
+	  .size_bits    = 4 },
+	{ PATH_REC_FIELD(mtu_selector),
+	  .offset_words = 13,
+	  .offset_bits  = 16,
+	  .size_bits    = 2 },
+	{ PATH_REC_FIELD(mtu),
+	  .offset_words = 13,
+	  .offset_bits  = 18,
+	  .size_bits    = 6 },
+	{ PATH_REC_FIELD(rate_selector),
+	  .offset_words = 13,
+	  .offset_bits  = 24,
+	  .size_bits    = 2 },
+	{ PATH_REC_FIELD(rate),
+	  .offset_words = 13,
+	  .offset_bits  = 26,
+	  .size_bits    = 6 },
+	{ PATH_REC_FIELD(packet_life_time_selector),
+	  .offset_words = 14,
+	  .offset_bits  = 0,
+	  .size_bits    = 2 },
+	{ PATH_REC_FIELD(packet_life_time),
+	  .offset_words = 14,
+	  .offset_bits  = 2,
+	  .size_bits    = 6 },
+	{ PATH_REC_FIELD(preference),
+	  .offset_words = 14,
+	  .offset_bits  = 8,
+	  .size_bits    = 8 },
+	{ RESERVED,
+	  .offset_words = 14,
+	  .offset_bits  = 16,
+	  .size_bits    = 48 },
+};
+
+#define MCMEMBER_REC_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_mcmember_rec, field),	\
+	.struct_size_bytes   = sizeof ((struct ib_sa_mcmember_rec *) 0)->field,	\
+	.field_name          = "sa_mcmember_rec:" #field
+
+static const struct ib_field mcmember_rec_table[] = {
+	{ MCMEMBER_REC_FIELD(mgid),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ MCMEMBER_REC_FIELD(port_gid),
+	  .offset_words = 4,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ MCMEMBER_REC_FIELD(qkey),
+	  .offset_words = 8,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ MCMEMBER_REC_FIELD(mlid),
+	  .offset_words = 9,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ MCMEMBER_REC_FIELD(mtu_selector),
+	  .offset_words = 9,
+	  .offset_bits  = 16,
+	  .size_bits    = 2 },
+	{ MCMEMBER_REC_FIELD(mtu),
+	  .offset_words = 9,
+	  .offset_bits  = 18,
+	  .size_bits    = 6 },
+	{ MCMEMBER_REC_FIELD(traffic_class),
+	  .offset_words = 9,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ MCMEMBER_REC_FIELD(pkey),
+	  .offset_words = 10,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ MCMEMBER_REC_FIELD(rate_selector),
+	  .offset_words = 10,
+	  .offset_bits  = 16,
+	  .size_bits    = 2 },
+	{ MCMEMBER_REC_FIELD(rate),
+	  .offset_words = 10,
+	  .offset_bits  = 18,
+	  .size_bits    = 6 },
+	{ MCMEMBER_REC_FIELD(packet_life_time_selector),
+	  .offset_words = 10,
+	  .offset_bits  = 24,
+	  .size_bits    = 2 },
+	{ MCMEMBER_REC_FIELD(packet_life_time),
+	  .offset_words = 10,
+	  .offset_bits  = 26,
+	  .size_bits    = 6 },
+	{ MCMEMBER_REC_FIELD(sl),
+	  .offset_words = 11,
+	  .offset_bits  = 0,
+	  .size_bits    = 4 },
+	{ MCMEMBER_REC_FIELD(flow_label),
+	  .offset_words = 11,
+	  .offset_bits  = 4,
+	  .size_bits    = 20 },
+	{ MCMEMBER_REC_FIELD(hop_limit),
+	  .offset_words = 11,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ MCMEMBER_REC_FIELD(scope),
+	  .offset_words = 12,
+	  .offset_bits  = 0,
+	  .size_bits    = 4 },
+	{ MCMEMBER_REC_FIELD(join_state),
+	  .offset_words = 12,
+	  .offset_bits  = 4,
+	  .size_bits    = 4 },
+	{ MCMEMBER_REC_FIELD(proxy_join),
+	  .offset_words = 12,
+	  .offset_bits  = 8,
+	  .size_bits    = 1 },
+	{ RESERVED,
+	  .offset_words = 12,
+	  .offset_bits  = 9,
+	  .size_bits    = 23 },
+};
+
+static void free_sm_ah(struct kref *kref)
+{
+	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
+
+	ib_destroy_ah(sm_ah->ah);
+	kfree(sm_ah);
+}
+
+static void update_sm_ah(void *port_ptr)
+{
+	struct ib_sa_port *port = port_ptr;
+	struct ib_sa_sm_ah *new_ah, *old_ah;
+	struct ib_port_attr port_attr;
+	struct ib_ah_attr   ah_attr;
+
+	if (ib_query_port(port->agent->device, port->port_num, &port_attr)) {
+		printk(KERN_WARNING "Couldn't query port\n");
+		return;
+	}
+
+	new_ah = kmalloc(sizeof *new_ah, GFP_KERNEL);
+	if (!new_ah) {
+		printk(KERN_WARNING "Couldn't allocate new SM AH\n");
+		return;
+	}
+
+	kref_init(&new_ah->ref);
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid     = port_attr.sm_lid;
+	ah_attr.sl       = port_attr.sm_sl;
+	ah_attr.port_num = port->port_num;
+
+	new_ah->ah = ib_create_ah(port->agent->qp->pd, &ah_attr);
+	if (IS_ERR(new_ah->ah)) {
+		printk(KERN_WARNING "Couldn't create new SM AH\n");
+		kfree(new_ah);
+		return;
+	}
+
+	spin_lock_irq(&port->ah_lock);
+	old_ah = port->sm_ah;
+	port->sm_ah = new_ah;
+	spin_unlock_irq(&port->ah_lock);
+
+	if (old_ah)
+		kref_put(&old_ah->ref, free_sm_ah);
+}
+
+static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event)
+{
+	if (event->event == IB_EVENT_PORT_ERR    ||
+	    event->event == IB_EVENT_PORT_ACTIVE ||
+	    event->event == IB_EVENT_LID_CHANGE  ||
+	    event->event == IB_EVENT_PKEY_CHANGE ||
+	    event->event == IB_EVENT_SM_CHANGE) {
+		struct ib_sa_device *sa_dev =
+			ib_get_client_data(event->device, &sa_client);
+
+		schedule_work(&sa_dev->port[event->element.port_num -
+					    sa_dev->start_port].update_task);
+	}
+}
+
+void ib_sa_cancel_query(int id, struct ib_sa_query *query)
+{
+	unsigned long flags;
+	struct ib_mad_agent *agent;
+
+	spin_lock_irqsave(&idr_lock, flags);
+	if (idr_find(&query_idr, id) != query) {
+		spin_unlock_irqrestore(&idr_lock, flags);
+		return;
+	}
+	agent = query->port->agent;
+	spin_unlock_irqrestore(&idr_lock, flags);
+
+	ib_cancel_mad(agent, id);
+}
+EXPORT_SYMBOL(ib_sa_cancel_query);
+
+static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
+{
+	unsigned long flags;
+
+	memset(mad, 0, sizeof *mad);
+
+	mad->mad_hdr.base_version  = IB_MGMT_BASE_VERSION;
+	mad->mad_hdr.mgmt_class    = IB_MGMT_CLASS_SUBN_ADM;
+	mad->mad_hdr.class_version = IB_SA_CLASS_VERSION;
+
+	spin_lock_irqsave(&tid_lock, flags);
+	mad->mad_hdr.tid           =
+		cpu_to_be64(((u64) agent->hi_tid) << 32 | tid++);
+	spin_unlock_irqrestore(&tid_lock, flags);
+}
+
+static int send_mad(struct ib_sa_query *query, int timeout_ms)
+{
+	struct ib_sa_port *port = query->port;
+	unsigned long flags;
+	int ret;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+		.wr	     = {
+			 .ud = {
+				 .mad_hdr     = &query->mad->mad_hdr,
+				 .remote_qpn  = 1,
+				 .remote_qkey = IB_QP1_QKEY,
+				 .timeout_ms  = timeout_ms
+			 }
+		 }
+	};
+
+retry:
+	if (!idr_pre_get(&query_idr, GFP_ATOMIC))
+		return -ENOMEM;
+	spin_lock_irqsave(&idr_lock, flags);
+	ret = idr_get_new(&query_idr, query, &query->id);
+	spin_unlock_irqrestore(&idr_lock, flags);
+	if (ret == -EAGAIN)
+		goto retry;
+	if (ret)
+		return ret;
+
+	wr.wr_id = query->id;
+
+	spin_lock_irqsave(&port->ah_lock, flags);
+	kref_get(&port->sm_ah->ref);
+	query->sm_ah = port->sm_ah;
+	wr.wr.ud.ah  = port->sm_ah->ah;
+	spin_unlock_irqrestore(&port->ah_lock, flags);
+
+	gather_list.addr   = dma_map_single(port->agent->device->dma_device,
+					    query->mad,
+					    sizeof (struct ib_sa_mad),
+					    DMA_TO_DEVICE);
+	gather_list.length = sizeof (struct ib_sa_mad);
+	gather_list.lkey   = port->mr->lkey;
+	pci_unmap_addr_set(query, mapping, gather_list.addr);
+
+	ret = ib_post_send_mad(port->agent, &wr, &bad_wr);
+	if (ret) {
+		dma_unmap_single(port->agent->device->dma_device,
+				 pci_unmap_addr(query, mapping),
+				 sizeof (struct ib_sa_mad),
+				 DMA_TO_DEVICE);
+		kref_put(&query->sm_ah->ref, free_sm_ah);
+		spin_lock_irqsave(&idr_lock, flags);
+		idr_remove(&query_idr, query->id);
+		spin_unlock_irqrestore(&idr_lock, flags);
+	}
+
+	return ret;
+}
+
+static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query,
+				    int status,
+				    struct ib_sa_mad *mad)
+{
+	struct ib_sa_path_query *query =
+		container_of(sa_query, struct ib_sa_path_query, sa_query);
+
+	if (mad) {
+		struct ib_sa_path_rec rec;
+
+		ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table),
+			  mad->data, &rec);
+		query->callback(status, &rec, query->context);
+	} else
+		query->callback(status, NULL, query->context);
+}
+
+static void ib_sa_path_rec_release(struct ib_sa_query *sa_query)
+{
+	kfree(sa_query->mad);
+	kfree(container_of(sa_query, struct ib_sa_path_query, sa_query));
+}
+
+int ib_sa_path_rec_get(struct ib_device *device, u8 port_num,
+		       struct ib_sa_path_rec *rec,
+		       ib_sa_comp_mask comp_mask,
+		       int timeout_ms, int gfp_mask,
+		       void (*callback)(int status,
+					struct ib_sa_path_rec *resp,
+					void *context),
+		       void *context,
+		       struct ib_sa_query **sa_query)
+{
+	struct ib_sa_path_query *query;
+	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
+	struct ib_sa_port   *port   = &sa_dev->port[port_num - sa_dev->start_port];
+	struct ib_mad_agent *agent  = port->agent;
+	int ret;
+
+	query = kmalloc(sizeof *query, gfp_mask);
+	if (!query)
+		return -ENOMEM;
+	query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask);
+	if (!query->sa_query.mad) {
+		kfree(query);
+		return -ENOMEM;
+	}
+
+	query->callback = callback;
+	query->context  = context;
+
+	init_mad(query->sa_query.mad, agent);
+
+	query->sa_query.callback              = ib_sa_path_rec_callback;
+	query->sa_query.release               = ib_sa_path_rec_release;
+	query->sa_query.port                  = port;
+	query->sa_query.mad->mad_hdr.method   = IB_MGMT_METHOD_GET;
+	query->sa_query.mad->mad_hdr.attr_id  = cpu_to_be16(IB_SA_ATTR_PATH_REC);
+	query->sa_query.mad->sa_hdr.comp_mask = comp_mask;
+
+	ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table),
+		rec, query->sa_query.mad->data);
+
+	*sa_query = &query->sa_query;
+	ret = send_mad(&query->sa_query, timeout_ms);
+	if (ret) {
+		*sa_query = NULL;
+		kfree(query->sa_query.mad);
+		kfree(query);
+	}
+
+	return ret ? ret : query->sa_query.id;
+}
+EXPORT_SYMBOL(ib_sa_path_rec_get);
+
+static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query,
+					int status,
+					struct ib_sa_mad *mad)
+{
+	struct ib_sa_mcmember_query *query =
+		container_of(sa_query, struct ib_sa_mcmember_query, sa_query);
+
+	if (mad) {
+		struct ib_sa_mcmember_rec rec;
+
+		ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table),
+			  mad->data, &rec);
+		query->callback(status, &rec, query->context);
+	} else
+		query->callback(status, NULL, query->context);
+}
+
+static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query)
+{
+	kfree(sa_query->mad);
+	kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query));
+}
+
+int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num,
+			     u8 method,
+			     struct ib_sa_mcmember_rec *rec, 
+			     ib_sa_comp_mask comp_mask,
+			     int timeout_ms, int gfp_mask,
+			     void (*callback)(int status,
+					      struct ib_sa_mcmember_rec *resp,
+					      void *context),
+			     void *context,
+			     struct ib_sa_query **sa_query)
+{
+	struct ib_sa_mcmember_query *query;
+	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
+	struct ib_sa_port   *port   = &sa_dev->port[port_num - sa_dev->start_port];
+	struct ib_mad_agent *agent  = port->agent;
+	int ret;
+
+	query = kmalloc(sizeof *query, gfp_mask);
+	if (!query)
+		return -ENOMEM;
+	query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask);
+	if (!query->sa_query.mad) {
+		kfree(query);
+		return -ENOMEM;
+	}
+
+	query->callback = callback;
+	query->context  = context;
+
+	init_mad(query->sa_query.mad, agent);
+
+	query->sa_query.callback              = ib_sa_mcmember_rec_callback;
+	query->sa_query.release               = ib_sa_mcmember_rec_release;
+	query->sa_query.port                  = port;
+	query->sa_query.mad->mad_hdr.method   = method;
+	query->sa_query.mad->mad_hdr.attr_id  = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC);
+	query->sa_query.mad->sa_hdr.comp_mask = comp_mask;
+
+	ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table),
+		rec, query->sa_query.mad->data);
+
+	*sa_query = &query->sa_query;
+	ret = send_mad(&query->sa_query, timeout_ms);
+	if (ret) {
+		*sa_query = NULL;
+		kfree(query->sa_query.mad);
+		kfree(query);
+	}
+
+	return ret ? ret : query->sa_query.id;
+}
+EXPORT_SYMBOL(ib_sa_mcmember_rec_query);
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct ib_sa_query *query;
+	unsigned long flags;
+
+	spin_lock_irqsave(&idr_lock, flags);
+	query = idr_find(&query_idr, mad_send_wc->wr_id);
+	spin_unlock_irqrestore(&idr_lock, flags);
+
+	if (!query)
+		return;
+
+	switch (mad_send_wc->status) {
+	case IB_WC_SUCCESS:
+		/* No callback -- already got recv */
+		break;
+	case IB_WC_RESP_TIMEOUT_ERR:
+		query->callback(query, -ETIMEDOUT, NULL);
+		break;
+	case IB_WC_WR_FLUSH_ERR:
+		query->callback(query, -EINTR, NULL);
+		break;
+	default:
+		query->callback(query, -EIO, NULL);
+		break;
+	}
+
+	dma_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(query, mapping),
+			 sizeof (struct ib_sa_mad),
+			 DMA_TO_DEVICE);
+	kref_put(&query->sm_ah->ref, free_sm_ah);
+
+	query->release(query);
+
+	spin_lock_irqsave(&idr_lock, flags);
+	idr_remove(&query_idr, mad_send_wc->wr_id);
+	spin_unlock_irqrestore(&idr_lock, flags);
+}
+
+static void recv_handler(struct ib_mad_agent *mad_agent,
+			 struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_sa_query *query;
+	unsigned long flags;
+
+	spin_lock_irqsave(&idr_lock, flags);
+	query = idr_find(&query_idr, mad_recv_wc->wc->wr_id);
+	spin_unlock_irqrestore(&idr_lock, flags);
+
+	if (query) {
+		if (mad_recv_wc->wc->status == IB_WC_SUCCESS)
+			query->callback(query,
+					mad_recv_wc->recv_buf->mad->mad_hdr.status ?
+					-EINVAL : 0,
+					(struct ib_sa_mad *) mad_recv_wc->recv_buf->mad);
+		else
+			query->callback(query, -EIO, NULL);
+	}
+
+	ib_free_recv_mad(mad_recv_wc);
+}
+
+static void ib_sa_add_one(struct ib_device *device)
+{
+	struct ib_sa_device *sa_dev;
+	int s, e, i;
+
+	if (device->node_type == IB_NODE_SWITCH)
+		s = e = 0;
+	else {
+		s = 1;
+		e = device->phys_port_cnt;
+	}
+
+	sa_dev = kmalloc(sizeof *sa_dev +
+			 (e - s + 1) * sizeof (struct ib_sa_port),
+			 GFP_KERNEL);
+	if (!sa_dev)
+		return;
+
+	sa_dev->start_port = s;
+	sa_dev->end_port   = e;
+
+	for (i = 0; i <= e - s; ++i) {
+		sa_dev->port[i].mr       = NULL;
+		sa_dev->port[i].sm_ah    = NULL;
+		sa_dev->port[i].port_num = i + s;
+		spin_lock_init(&sa_dev->port[i].ah_lock);
+
+		sa_dev->port[i].agent =
+			ib_register_mad_agent(device, i + s, IB_QPT_GSI,
+					      NULL, 0, send_handler,
+					      recv_handler, sa_dev);
+		if (IS_ERR(sa_dev->port[i].agent))
+			goto err;
+
+		sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd,
+						   IB_ACCESS_LOCAL_WRITE);
+		if (IS_ERR(sa_dev->port[i].mr)) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
+			goto err;
+		}
+
+		INIT_WORK(&sa_dev->port[i].update_task,
+			  update_sm_ah, &sa_dev->port[i]);
+	}
+
+	/*
+	 * We register our event handler after everything is set up,
+	 * and then update our cached info after the event handler is
+	 * registered to avoid any problems if a port changes state
+	 * during our initialization.
+	 */
+
+	INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event);
+	if (ib_register_event_handler(&sa_dev->event_handler))
+		goto err;
+
+	for (i = 0; i <= e - s; ++i)
+		update_sm_ah(&sa_dev->port[i]);
+
+	ib_set_client_data(device, &sa_client, sa_dev);
+
+	return;
+
+err:
+	while (--i >= 0) {
+		ib_dereg_mr(sa_dev->port[i].mr);
+		ib_unregister_mad_agent(sa_dev->port[i].agent);
+	}
+
+	kfree(sa_dev);
+
+	return;
+}
+
+static void ib_sa_remove_one(struct ib_device *device)
+{
+	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
+	int i;
+
+	if (!sa_dev)
+		return;
+
+	ib_unregister_event_handler(&sa_dev->event_handler);
+
+	for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
+		ib_unregister_mad_agent(sa_dev->port[i].agent);
+		kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
+	}
+
+	kfree(sa_dev);
+}
+
+static int __init ib_sa_init(void)
+{
+	int ret;
+
+	spin_lock_init(&idr_lock);
+	spin_lock_init(&tid_lock);
+
+	get_random_bytes(&tid, sizeof tid);
+
+	ret = ib_register_client(&sa_client);
+	if (ret)
+		printk(KERN_ERR "Couldn't register ib_sa client\n");
+		
+	return ret;
+}
+
+static void __exit ib_sa_cleanup(void)
+{
+	ib_unregister_client(&sa_client);
+}
+
+module_init(ib_sa_init);
+module_exit(ib_sa_cleanup);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_sa.h	2004-11-23 08:10:18.729800663 -0800
@@ -0,0 +1,221 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#ifndef IB_SA_H
+#define IB_SA_H
+
+#include <linux/compiler.h>
+
+#include <ib_verbs.h>
+#include <ib_mad.h>
+
+enum {
+	IB_SA_CLASS_VERSION	= 2,	/* IB spec version 1.1/1.2 */
+
+	IB_SA_METHOD_DELETE	= 0x15
+};
+
+enum ib_sa_selector {
+	IB_SA_GTE  = 0,
+	IB_SA_LTE  = 1,
+	IB_SA_EQ   = 2,
+	/*
+	 * The meaning of "best" depends on the attribute: for
+	 * example, for MTU best will return the largest available
+	 * MTU, while for packet life time, best will return the
+	 * smallest available life time.
+	 */
+	IB_SA_BEST = 3
+};
+
+typedef u64 __bitwise ib_sa_comp_mask;
+
+#define IB_SA_COMP_MASK(n)	((__force ib_sa_comp_mask) cpu_to_be64(1ull << n))
+
+/*
+ * Structures for SA records are named "struct ib_sa_xxx_rec."  No
+ * attempt is made to pack structures to match the physical layout of
+ * SA records in SA MADs; all packing and unpacking is handled by the
+ * SA query code.
+ *
+ * For a record with structure ib_sa_xxx_rec, the naming convention
+ * for the component mask value for field yyy is IB_SA_XXX_REC_YYY (we
+ * never use different abbreviations or otherwise change the spelling
+ * of xxx/yyy between ib_sa_xxx_rec.yyy and IB_SA_XXX_REC_YYY).
+ *
+ * Reserved rows are indicated with comments to help maintainability.
+ */
+
+/* reserved:								 0 */
+/* reserved:								 1 */
+#define IB_SA_PATH_REC_DGID				IB_SA_COMP_MASK( 2)
+#define IB_SA_PATH_REC_SGID				IB_SA_COMP_MASK( 3)
+#define IB_SA_PATH_REC_DLID				IB_SA_COMP_MASK( 4)
+#define IB_SA_PATH_REC_SLID				IB_SA_COMP_MASK( 5)
+#define IB_SA_PATH_REC_RAW_TRAFFIC			IB_SA_COMP_MASK( 6)
+/* reserved:								 7 */
+#define IB_SA_PATH_REC_FLOW_LABEL       		IB_SA_COMP_MASK( 8)
+#define IB_SA_PATH_REC_HOP_LIMIT			IB_SA_COMP_MASK( 9)
+#define IB_SA_PATH_REC_TRAFFIC_CLASS			IB_SA_COMP_MASK(10)
+#define IB_SA_PATH_REC_REVERSIBLE			IB_SA_COMP_MASK(11)
+#define IB_SA_PATH_REC_NUMB_PATH			IB_SA_COMP_MASK(12)
+#define IB_SA_PATH_REC_PKEY				IB_SA_COMP_MASK(13)
+/* reserved:								14 */
+#define IB_SA_PATH_REC_SL				IB_SA_COMP_MASK(15)
+#define IB_SA_PATH_REC_MTU_SELECTOR			IB_SA_COMP_MASK(16)
+#define IB_SA_PATH_REC_MTU				IB_SA_COMP_MASK(17)
+#define IB_SA_PATH_REC_RATE_SELECTOR			IB_SA_COMP_MASK(18)
+#define IB_SA_PATH_REC_RATE				IB_SA_COMP_MASK(19)
+#define IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR	IB_SA_COMP_MASK(20)
+#define IB_SA_PATH_REC_PACKET_LIFE_TIME			IB_SA_COMP_MASK(21)
+#define IB_SA_PATH_REC_PREFERENCE			IB_SA_COMP_MASK(22)
+
+struct ib_sa_path_rec {
+	/* reserved */
+	/* reserved */
+	union ib_gid dgid;
+	union ib_gid sgid;
+	u16          dlid;
+	u16          slid;
+	int          raw_traffic;
+	/* reserved */
+	u32          flow_label;
+	u8           hop_limit;
+	u8           traffic_class;
+	int          reversible;
+	u8           numb_path;
+	u16          pkey;
+	/* reserved */
+	u8           sl;
+	u8           mtu_selector;
+	enum ib_mtu  mtu;
+	u8           rate_selector;
+	u8           rate;
+	u8           packet_life_time_selector;
+	u8           packet_life_time;
+	u8           preference;
+};
+
+#define IB_SA_MCMEMBER_REC_MGID				IB_SA_COMP_MASK( 0)
+#define IB_SA_MCMEMBER_REC_PORT_GID			IB_SA_COMP_MASK( 1)
+#define IB_SA_MCMEMBER_REC_QKEY				IB_SA_COMP_MASK( 2)
+#define IB_SA_MCMEMBER_REC_MLID				IB_SA_COMP_MASK( 3)
+#define IB_SA_MCMEMBER_REC_MTU_SELECTOR			IB_SA_COMP_MASK( 4)
+#define IB_SA_MCMEMBER_REC_MTU				IB_SA_COMP_MASK( 5)
+#define IB_SA_MCMEMBER_REC_TRAFFIC_CLASS		IB_SA_COMP_MASK( 6)
+#define IB_SA_MCMEMBER_REC_PKEY				IB_SA_COMP_MASK( 7)
+#define IB_SA_MCMEMBER_REC_RATE_SELECTOR		IB_SA_COMP_MASK( 8)
+#define IB_SA_MCMEMBER_REC_RATE				IB_SA_COMP_MASK( 9)
+#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR	IB_SA_COMP_MASK(10)
+#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME		IB_SA_COMP_MASK(11)
+#define IB_SA_MCMEMBER_REC_SL				IB_SA_COMP_MASK(12)
+#define IB_SA_MCMEMBER_REC_FLOW_LABEL			IB_SA_COMP_MASK(13)
+#define IB_SA_MCMEMBER_REC_HOP_LIMIT			IB_SA_COMP_MASK(14)
+#define IB_SA_MCMEMBER_REC_SCOPE			IB_SA_COMP_MASK(15)
+#define IB_SA_MCMEMBER_REC_JOIN_STATE			IB_SA_COMP_MASK(16)
+#define IB_SA_MCMEMBER_REC_PROXY_JOIN			IB_SA_COMP_MASK(17)
+
+struct ib_sa_mcmember_rec {
+	union ib_gid mgid;
+	union ib_gid port_gid;
+	u32          qkey;
+	u16          mlid;
+	u8           mtu_selector;
+	enum         ib_mtu mtu;
+	u8           traffic_class;
+	u16          pkey;
+	u8 	     rate_selector;
+	u8 	     rate;
+	u8 	     packet_life_time_selector;
+	u8 	     packet_life_time;
+	u8           sl;
+	u32          flow_label;
+	u8           hop_limit;
+	u8           scope;
+	u8           join_state;
+	int          proxy_join;
+};
+
+struct ib_sa_query;
+
+void ib_sa_cancel_query(int id, struct ib_sa_query *query);
+
+int ib_sa_path_rec_get(struct ib_device *device, u8 port_num,
+		       struct ib_sa_path_rec *rec, 
+		       ib_sa_comp_mask comp_mask,
+		       int timeout_ms, int gfp_mask,
+		       void (*callback)(int status,
+					struct ib_sa_path_rec *resp,
+					void *context),
+		       void *context,
+		       struct ib_sa_query **query);
+
+int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num,
+			     u8 method,
+			     struct ib_sa_mcmember_rec *rec,
+			     ib_sa_comp_mask comp_mask,
+			     int timeout_ms, int gfp_mask,
+			     void (*callback)(int status,
+					      struct ib_sa_mcmember_rec *resp,
+					      void *context),
+			     void *context,
+			     struct ib_sa_query **query);
+
+static inline int
+ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num,
+		       struct ib_sa_mcmember_rec *rec,
+		       ib_sa_comp_mask comp_mask,
+		       int timeout_ms, int gfp_mask,
+		       void (*callback)(int status,
+					struct ib_sa_mcmember_rec *resp,
+					void *context),
+		       void *context,
+		       struct ib_sa_query **query)
+{
+	return ib_sa_mcmember_rec_query(device, port_num,
+					IB_MGMT_METHOD_SET,
+					rec, comp_mask,
+					timeout_ms, gfp_mask, callback,
+					context, query);
+}
+
+static inline int
+ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num,
+			  struct ib_sa_mcmember_rec *rec,
+			  ib_sa_comp_mask comp_mask,
+			  int timeout_ms, int gfp_mask,
+			  void (*callback)(int status,
+					   struct ib_sa_mcmember_rec *resp,
+					   void *context),
+			  void *context,
+			  struct ib_sa_query **query)
+{
+	return ib_sa_mcmember_rec_query(device, port_num,
+					IB_SA_METHOD_DELETE,
+					rec, comp_mask,
+					timeout_ms, gfp_mask, callback,
+					context, query);
+}
+
+
+#endif /* IB_SA_H */


From roland at topspin.com  Tue Nov 23 08:14:52 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:52 -0800
Subject: [openib-general] [PATCH][RFC/v2][7/21] Add Mellanox HCA low-level
	driver
In-Reply-To: <20041123814.UmUHBktptJzFvsrR@topspin.com>
Message-ID: <20041123814.y2QOtktHRf35o3M9@topspin.com>

Add a low-level driver for Mellanox MT23108 and MT25208 HCAs.  The
MT25208 is only fully supported when in MT23108 compatibility mode;
only the very beginnings of support for native MT25208 mode (required
for HCAs without local memory) is present.

(As a side note, I believe this driver would be the first in-tree
consumer of the PCI MSI/MSI-X API)

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/drivers/infiniband/Kconfig	2004-11-23 08:10:16.399144313 -0800
+++ linux-bk/drivers/infiniband/Kconfig	2004-11-23 08:10:19.036755403 -0800
@@ -8,4 +8,6 @@
 	  any protocols you wish to use as well as drivers for your
 	  InfiniBand hardware.
 
+source "drivers/infiniband/hw/mthca/Kconfig"
+
 endmenu
--- linux-bk.orig/drivers/infiniband/Makefile	2004-11-23 08:10:16.436138859 -0800
+++ linux-bk/drivers/infiniband/Makefile	2004-11-23 08:10:18.998761005 -0800
@@ -1 +1,2 @@
 obj-$(CONFIG_INFINIBAND)		+= core/
+obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mthca/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Kconfig	2004-11-23 08:10:19.090747442 -0800
@@ -0,0 +1,26 @@
+config INFINIBAND_MTHCA
+	tristate "Mellanox HCA support"
+	depends on PCI && INFINIBAND
+	---help---
+	  This is a low-level driver for Mellanox InfiniHost host
+	  channel adapters (HCAs), including the MT23108 PCI-X HCA
+	  ("Tavor") and the MT25208 PCI Express HCA ("Arbel").
+
+config INFINIBAND_MTHCA_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_MTHCA
+	default n
+	---help---
+	  This option causes the mthca driver produce a bunch of debug
+	  messages.  Select this is you are developing the driver or
+	  trying to diagnose a problem.
+
+config INFINIBAND_MTHCA_SSE_DOORBELL
+	bool "SSE doorbell code"
+	depends on INFINIBAND_MTHCA && X86 && !X86_64
+	default n
+	---help---
+	  This option will have the mthca driver use SSE instructions
+	  to ring hardware doorbell registers.  This may improve
+	  performance for some workloads, but the driver will not run
+	  on processors without SSE instructions.
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/Makefile	2004-11-23 08:10:19.146739186 -0800
@@ -0,0 +1,12 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o
+
+ib_mthca-y :=	mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \
+		mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \
+		mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \
+		mthca_provider.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c	2004-11-23 08:10:19.197731667 -0800
@@ -0,0 +1,175 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $
+ */
+
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/bitmap.h> 
+
+#include "mthca_dev.h"
+
+/* Trivial bitmap-based allocator */
+u32 mthca_alloc(struct mthca_alloc *alloc)
+{
+	u32 obj;
+
+	spin_lock(&alloc->lock);
+	obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last);
+	if (obj >= alloc->max) {
+		alloc->top = (alloc->top + alloc->max) & alloc->mask;
+		obj = find_first_zero_bit(alloc->table, alloc->max);
+	}
+
+	if (obj < alloc->max) {
+		set_bit(obj, alloc->table);
+		obj |= alloc->top;
+	} else
+		obj = -1;
+
+	spin_unlock(&alloc->lock);
+
+	return obj;
+}
+
+void mthca_free(struct mthca_alloc *alloc, u32 obj)
+{
+	obj &= alloc->max - 1;
+	spin_lock(&alloc->lock);
+	clear_bit(obj, alloc->table);
+	alloc->last = min(alloc->last, obj);
+	alloc->top = (alloc->top + alloc->max) & alloc->mask;
+	spin_unlock(&alloc->lock);
+}
+
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved)
+{
+	int i;
+
+	/* num must be a power of 2 */
+	if (num != 1 << (ffs(num) - 1))
+		return -EINVAL;
+
+	alloc->last = 0;
+	alloc->top  = 0;
+	alloc->max  = num;
+	alloc->mask = mask;
+	spin_lock_init(&alloc->lock);
+	alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long),
+			       GFP_KERNEL);
+	if (!alloc->table)
+		return -ENOMEM;
+
+	bitmap_zero(alloc->table, num);
+	for (i = 0; i < reserved; ++i)
+		set_bit(i, alloc->table);
+
+	return 0;
+}
+
+void mthca_alloc_cleanup(struct mthca_alloc *alloc)
+{
+	kfree(alloc->table);
+}
+
+/*
+ * Array of pointers with lazy allocation of leaf pages.  Callers of
+ * _get, _set and _clear methods must use a lock or otherwise
+ * serialize access to the array.
+ */
+
+void *mthca_array_get(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (array->page_list[p].page) {
+		int i = index & (PAGE_SIZE / sizeof (void *) - 1);
+		return array->page_list[p].page[i];
+	} else
+		return NULL;
+}
+
+int mthca_array_set(struct mthca_array *array, int index, void *value)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
+	if (!array->page_list[p].page)
+		array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC);
+
+	if (!array->page_list[p].page)
+		return -ENOMEM;
+
+	array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] =
+		value;
+	++array->page_list[p].used;
+
+	return 0;
+}
+
+void mthca_array_clear(struct mthca_array *array, int index)
+{
+	int p = (index * sizeof (void *)) >> PAGE_SHIFT;
+
+	if (--array->page_list[p].used == 0) {
+		free_page((unsigned long) array->page_list[p].page);
+		array->page_list[p].page = NULL;
+	}
+
+	if (array->page_list[p].used < 0)
+		pr_debug("Array %p index %d page %d with ref count %d < 0\n",
+			 array, index, p, array->page_list[p].used);
+}
+
+int mthca_array_init(struct mthca_array *array, int nent)
+{
+	int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE;
+	int i;
+
+	array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL);
+	if (!array->page_list)
+		return -ENOMEM;
+
+	for (i = 0; i < npage; ++i) {
+		array->page_list[i].page = NULL;
+		array->page_list[i].used = 0;
+	}
+
+	return 0;
+}
+
+void mthca_array_cleanup(struct mthca_array *array, int nent)
+{
+	int i;
+
+	for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i)
+		free_page((unsigned long) array->page_list[i].page);
+
+	kfree(array->page_list);
+}
+
+/*
+ * Local Variables:
+ *  c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h	2004-11-23 08:10:19.234726213 -0800
@@ -0,0 +1,51 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $
+ */
+
+#ifndef MTHCA_CONFIG_REG_H
+#define MTHCA_CONFIG_REG_H
+
+#include <asm/page.h>
+
+#define MTHCA_HCR_BASE         0x80680
+#define MTHCA_HCR_SIZE         0x0001c
+#define MTHCA_ECR_BASE         0x80700
+#define MTHCA_ECR_SIZE         0x00008
+#define MTHCA_ECR_CLR_BASE     0x80708
+#define MTHCA_ECR_CLR_SIZE     0x00008
+#define MTHCA_ECR_OFFSET       (MTHCA_ECR_BASE     - MTHCA_HCR_BASE)
+#define MTHCA_ECR_CLR_OFFSET   (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE)
+#define MTHCA_CLR_INT_BASE     0xf00d8
+#define MTHCA_CLR_INT_SIZE     0x00008
+
+#define MTHCA_MAP_HCR_SIZE     (MTHCA_ECR_CLR_BASE   + \
+			        MTHCA_ECR_CLR_SIZE   - \
+			        MTHCA_HCR_BASE)
+
+#endif /* MTHCA_CONFIG_REG_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h	2004-11-23 08:10:19.274720315 -0800
@@ -0,0 +1,387 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_DEV_H
+#define MTHCA_DEV_H
+
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <asm/semaphore.h>
+#include <asm/scatterlist.h>
+
+#include "mthca_provider.h"
+#include "mthca_doorbell.h"
+
+#define DRV_NAME	"ib_mthca"
+#define PFX		DRV_NAME ": "
+#define DRV_VERSION	"0.06-pre"
+#define DRV_RELDATE	"November 8, 2004"
+
+/* Types of supported HCA */
+enum {
+	TAVOR,			/* MT23108                        */
+	ARBEL_COMPAT,		/* MT25208 in Tavor compat mode   */
+	ARBEL_NATIVE		/* MT25208 with extended features */
+};
+
+enum {
+	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
+	MTHCA_FLAG_SRQ        = 1 << 2,
+	MTHCA_FLAG_MSI        = 1 << 3,
+	MTHCA_FLAG_MSI_X      = 1 << 4,
+	MTHCA_FLAG_NO_LAM     = 1 << 5
+};
+
+enum {
+	MTHCA_KAR_PAGE  = 1,
+	MTHCA_MAX_PORTS = 2
+};
+
+enum {
+	MTHCA_MPT_ENTRY_SIZE  =  0x40,
+	MTHCA_EQ_CONTEXT_SIZE =  0x40,
+	MTHCA_CQ_CONTEXT_SIZE =  0x40,
+	MTHCA_QP_CONTEXT_SIZE = 0x200,
+	MTHCA_AV_SIZE         =  0x20,
+	MTHCA_MGM_ENTRY_SIZE  =  0x40
+};
+
+enum {
+	MTHCA_EQ_CMD,
+	MTHCA_EQ_ASYNC,
+	MTHCA_EQ_COMP,
+	MTHCA_NUM_EQ
+};
+
+struct mthca_cmd {
+	int                       use_events;
+	struct semaphore          hcr_sem;
+	struct semaphore 	  poll_sem;
+	struct semaphore 	  event_sem;
+	int              	  max_cmds;
+	spinlock_t                context_lock;
+	int                       free_head;
+	struct mthca_cmd_context *context;
+	u16                       token_mask;
+};
+
+struct mthca_limits {
+	int      num_ports;
+	int      vl_cap;
+	int      mtu_cap;
+	int      gid_table_len;
+	int      pkey_table_len;
+	int      local_ca_ack_delay;
+	int      max_sg;
+	int      num_qps;
+	int      reserved_qps;
+	int      num_srqs;
+	int      reserved_srqs;
+	int      num_eecs;
+	int      reserved_eecs;
+	int      num_cqs;
+	int      reserved_cqs;
+	int      num_eqs;
+	int      reserved_eqs;
+	int      num_mpts;
+	int      num_mtt_segs;
+	int      mtt_seg_size;
+	int      reserved_mtts;
+	int      reserved_mrws;
+	int      num_rdbs;
+	int      reserved_uars;
+	int      num_mgms;
+	int      num_amgms;
+	int      reserved_mcgs;
+	int      num_pds;
+	int      reserved_pds;
+};
+
+struct mthca_alloc {
+	u32            last;
+	u32            top;
+	u32            max;
+	u32            mask;
+	spinlock_t     lock;
+	unsigned long *table;
+};
+
+struct mthca_array {
+	struct {
+		void    **page;
+		int       used;
+	} *page_list;
+};
+
+struct mthca_pd_table {
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mr_table {
+	struct mthca_alloc mpt_alloc;
+	int                max_mtt_order;
+	unsigned long    **mtt_buddy;
+	u64                mtt_base;
+};
+
+struct mthca_eq_table {
+	struct mthca_alloc alloc;
+	void __iomem      *clr_int;
+	u32                clr_mask;
+	struct mthca_eq    eq[MTHCA_NUM_EQ];
+	int                have_irq;
+	u8                 inta_pin;
+};
+
+struct mthca_cq_table {
+	struct mthca_alloc alloc;
+	spinlock_t         lock;
+	struct mthca_array cq;
+};
+
+struct mthca_qp_table {
+	struct mthca_alloc alloc;
+	int                sqp_start;
+	spinlock_t         lock;
+	struct mthca_array qp;
+};
+
+struct mthca_av_table {
+	struct pci_pool   *pool;
+	int                num_ddr_avs;
+	u64                ddr_av_base;
+	void __iomem      *av_map;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_mcg_table {
+	struct semaphore   sem;
+	struct mthca_alloc alloc;
+};
+
+struct mthca_dev {
+	struct ib_device  ib_dev;
+	struct pci_dev   *pdev;
+
+	int          	 hca_type;
+	unsigned long	 mthca_flags;
+
+	u32              rev_id;
+
+	/* firmware info */
+	u64              fw_ver;
+	union {
+		struct {
+			u64 fw_start;
+			u64 fw_end;
+		}        tavor;
+		struct {
+			u64 clr_int_base;
+			u64 eq_arm_base;
+			u64 eq_set_ci_base;
+			struct scatterlist *mem;
+			u16 fw_pages;
+		}        arbel;
+	}                fw;
+
+	u64              ddr_start;
+	u64              ddr_end;
+
+	MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock)
+
+	void __iomem    *hcr;
+	void __iomem    *clr_base;
+	void __iomem    *kar;
+
+	struct mthca_cmd    cmd;
+	struct mthca_limits limits;
+
+	struct mthca_pd_table  pd_table;
+	struct mthca_mr_table  mr_table;
+	struct mthca_eq_table  eq_table;
+	struct mthca_cq_table  cq_table;
+	struct mthca_qp_table  qp_table;
+	struct mthca_av_table  av_table;
+	struct mthca_mcg_table mcg_table;
+
+	struct mthca_pd       driver_pd;
+	struct mthca_mr       driver_mr;
+
+	struct ib_mad_agent  *send_agent[MTHCA_MAX_PORTS][2];
+	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
+	spinlock_t            sm_lock;
+};
+
+#define mthca_dbg(mdev, format, arg...) \
+	dev_dbg(&mdev->pdev->dev, format, ## arg)
+#define mthca_err(mdev, format, arg...) \
+	dev_err(&mdev->pdev->dev, format, ## arg)
+#define mthca_info(mdev, format, arg...) \
+	dev_info(&mdev->pdev->dev, format, ## arg)
+#define mthca_warn(mdev, format, arg...) \
+	dev_warn(&mdev->pdev->dev, format, ## arg)
+
+extern void __buggy_use_of_MTHCA_GET(void);
+extern void __buggy_use_of_MTHCA_PUT(void);
+
+#define MTHCA_GET(dest, source, offset)                               \
+	do {                                                          \
+		void *__p = (char *) (source) + (offset);             \
+		switch (sizeof (dest)) {                              \
+			case 1: (dest) = *(u8 *) __p;       break;    \
+			case 2: (dest) = be16_to_cpup(__p); break;    \
+			case 4: (dest) = be32_to_cpup(__p); break;    \
+			case 8: (dest) = be64_to_cpup(__p); break;    \
+			default: __buggy_use_of_MTHCA_GET();          \
+		}                                                     \
+	} while (0)
+
+#define MTHCA_PUT(dest, source, offset)                               \
+	do {                                                          \
+		__typeof__(source) *__p =                             \
+			(__typeof__(source) *) ((char *) (dest) + (offset)); \
+		switch (sizeof(source)) {                             \
+			case 1: *__p = (source);            break;    \
+			case 2: *__p = cpu_to_be16(source); break;    \
+			case 4: *__p = cpu_to_be32(source); break;    \
+			case 8: *__p = cpu_to_be64(source); break;    \
+			default: __buggy_use_of_MTHCA_PUT();          \
+		}                                                     \
+	} while (0)
+
+int mthca_reset(struct mthca_dev *mdev);
+
+u32 mthca_alloc(struct mthca_alloc *alloc);
+void mthca_free(struct mthca_alloc *alloc, u32 obj);
+int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask,
+		     u32 reserved);
+void mthca_alloc_cleanup(struct mthca_alloc *alloc);
+void *mthca_array_get(struct mthca_array *array, int index);
+int mthca_array_set(struct mthca_array *array, int index, void *value);
+void mthca_array_clear(struct mthca_array *array, int index);
+int mthca_array_init(struct mthca_array *array, int nent);
+void mthca_array_cleanup(struct mthca_array *array, int nent);
+
+int mthca_init_pd_table(struct mthca_dev *dev);
+int mthca_init_mr_table(struct mthca_dev *dev);
+int mthca_init_eq_table(struct mthca_dev *dev);
+int mthca_init_cq_table(struct mthca_dev *dev);
+int mthca_init_qp_table(struct mthca_dev *dev);
+int mthca_init_av_table(struct mthca_dev *dev);
+int mthca_init_mcg_table(struct mthca_dev *dev);
+
+void mthca_cleanup_pd_table(struct mthca_dev *dev);
+void mthca_cleanup_mr_table(struct mthca_dev *dev);
+void mthca_cleanup_eq_table(struct mthca_dev *dev);
+void mthca_cleanup_cq_table(struct mthca_dev *dev);
+void mthca_cleanup_qp_table(struct mthca_dev *dev);
+void mthca_cleanup_av_table(struct mthca_dev *dev);
+void mthca_cleanup_mcg_table(struct mthca_dev *dev);
+
+int mthca_register_device(struct mthca_dev *dev);
+void mthca_unregister_device(struct mthca_dev *dev);
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd);
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr);
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr);
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr);
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry);
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited);
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq);
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq);
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn);
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn);
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type);
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr);
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr);
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe);
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp);
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp);
+void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp);
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah);
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah);
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header);
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid);
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad);
+int mthca_create_agents(struct mthca_dev *dev);
+void mthca_free_agents(struct mthca_dev *dev);
+
+static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct mthca_dev, ib_dev);
+}
+
+#endif /* MTHCA_DEV_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h	2004-11-23 08:10:19.314714418 -0800
@@ -0,0 +1,119 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/preempt.h>
+
+#define MTHCA_RD_DOORBELL      0x00
+#define MTHCA_SEND_DOORBELL    0x10
+#define MTHCA_RECEIVE_DOORBELL 0x18
+#define MTHCA_CQ_DOORBELL      0x20
+#define MTHCA_EQ_DOORBELL      0x28
+
+#if BITS_PER_LONG == 64
+/*
+ * Assume that we can just write a 64-bit doorbell atomically.  s390
+ * actually doesn't have writeq() but S/390 systems don't even have
+ * PCI so we won't worry about it.
+ */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	__raw_writeq(*(u64 *) val, dest);
+}
+
+#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL)
+/* Use SSE to write 64 bits atomically without a lock. */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name)
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
+
+static inline unsigned long mthca_get_fpu(void)
+{
+	unsigned long cr0;
+
+	preempt_disable();
+	asm volatile("mov %%cr0,%0; clts" : "=r" (cr0));
+	return cr0;
+}
+
+static inline void mthca_put_fpu(unsigned long cr0)
+{
+	asm volatile("mov %0,%%cr0" : : "r" (cr0));
+	preempt_enable();
+}
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	/* i386 stack is aligned to 8 bytes, so this should be OK: */
+	u8 xmmsave[8] __attribute__((aligned(8)));
+	unsigned long cr0;
+
+	cr0 = mthca_get_fpu();
+
+	asm volatile (
+		"movlps %%xmm0,(%0); \n\t"
+		"movlps (%1),%%xmm0; \n\t"
+		"movlps %%xmm0,(%2); \n\t"
+		"movlps (%0),%%xmm0; \n\t"
+		:
+		: "r" (xmmsave), "r" (val), "r" (dest)
+		: "memory" );
+
+	mthca_put_fpu(cr0);
+}
+
+#else
+/* Just fall back to a spinlock to protect the doorbell */
+
+#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name;
+#define MTHCA_INIT_DOORBELL_LOCK(ptr)     spin_lock_init(ptr)
+#define MTHCA_GET_DOORBELL_LOCK(ptr)      (ptr)
+
+static inline void mthca_write64(u32 val[2], void __iomem *dest,
+				 spinlock_t *doorbell_lock)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(doorbell_lock, flags);
+	__raw_writel(val[0], dest);
+	__raw_writel(val[1], dest + 4);
+	spin_unlock_irqrestore(doorbell_lock, flags);
+}
+
+#endif
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c	2004-11-23 08:10:19.352708816 -0800
@@ -0,0 +1,888 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+#include <asm/cpufeature.h>
+#endif
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+#include "mthca_profile.h"
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+#ifdef CONFIG_PCI_MSI
+
+static int msi_x = 0;
+module_param(msi_x, int, 0444);
+MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
+
+static int msi = 0;
+module_param(msi, int, 0444);
+MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero");
+
+#else /* CONFIG_PCI_MSI */
+
+#define msi_x (0)
+#define msi   (0)
+
+#endif /* CONFIG_PCI_MSI */
+
+static const char mthca_version[] __devinitdata =
+	"ib_mthca: Mellanox InfiniBand HCA driver v"
+	DRV_VERSION " (" DRV_RELDATE ")\n";
+
+static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
+{
+	int cap;
+	u16 val;
+
+	/* First try to max out Read Byte Count */
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) {
+			mthca_err(mdev, "Couldn't read PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) {
+			mthca_err(mdev, "Couldn't write PCI-X command register, "
+				  "aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == TAVOR)
+		mthca_info(mdev, "No PCI-X capability, not setting RBC.\n");
+
+	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+	if (cap) {
+		if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) {
+			mthca_err(mdev, "Couldn't read PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+		val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12);
+		if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) {
+			mthca_err(mdev, "Couldn't write PCI Express device control "
+				  "register, aborting.\n");
+			return -ENODEV;
+		}
+	} else if (mdev->hca_type == ARBEL_NATIVE ||
+		   mdev->hca_type == ARBEL_COMPAT)
+		mthca_info(mdev, "No PCI Express capability, "
+			   "not setting Max Read Request Size.\n");
+
+	return 0;
+}
+
+static int __devinit mthca_init_tavor(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	struct mthca_dev_lim        dev_lim;
+	struct mthca_init_hca_param init_hca;
+	struct mthca_adapter        adapter;
+
+	err = mthca_SYS_EN(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "SYS_EN command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "SYS_EN returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DDR(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DDR command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DDR returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+	if (dev_lim.min_page_sz > PAGE_SIZE) {
+		mthca_err(mdev, "HCA minimum page size of %d bigger than "
+			  "kernel PAGE_SIZE of %ld, aborting.\n",
+			  dev_lim.min_page_sz, PAGE_SIZE);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+	if (dev_lim.num_ports > MTHCA_MAX_PORTS) {
+		mthca_err(mdev, "HCA has %d ports, but we only support %d, "
+			  "aborting.\n",
+			  dev_lim.num_ports, MTHCA_MAX_PORTS);
+		err = -ENODEV;
+		goto err_out_disable;
+	}
+
+	mdev->limits.num_ports      	= dev_lim.num_ports;
+	mdev->limits.vl_cap             = dev_lim.max_vl;
+	mdev->limits.mtu_cap            = dev_lim.max_mtu;
+	mdev->limits.gid_table_len  	= dev_lim.max_gids;
+	mdev->limits.pkey_table_len 	= dev_lim.max_pkeys;
+	mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay;
+	mdev->limits.max_sg             = dev_lim.max_sg;
+	mdev->limits.reserved_qps       = dev_lim.reserved_qps;
+	mdev->limits.reserved_srqs      = dev_lim.reserved_srqs;
+	mdev->limits.reserved_eecs      = dev_lim.reserved_eecs;
+	mdev->limits.reserved_cqs       = dev_lim.reserved_cqs;
+	mdev->limits.reserved_eqs       = dev_lim.reserved_eqs;
+	mdev->limits.reserved_mtts      = dev_lim.reserved_mtts;
+	mdev->limits.reserved_mrws      = dev_lim.reserved_mrws;
+	mdev->limits.reserved_uars      = dev_lim.reserved_uars;
+	mdev->limits.reserved_pds       = dev_lim.reserved_pds;
+
+	if (dev_lim.flags & DEV_LIM_FLAG_SRQ)
+		mdev->mthca_flags |= MTHCA_FLAG_SRQ;
+	
+	err = mthca_make_profile(mdev, &dev_lim, &init_hca);
+	if (err)
+		goto err_out_disable;
+
+	err = mthca_INIT_HCA(mdev, &init_hca, &status);
+	if (err) {
+		mthca_err(mdev, "INIT_HCA command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "INIT_HCA returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_disable;
+	}
+
+	err = mthca_QUERY_ADAPTER(mdev, &adapter, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n");
+		goto err_out_disable;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, "
+			  "aborting.\n", status);
+		err = -EINVAL;
+		goto err_out_close;
+	}
+
+	mdev->eq_table.inta_pin = adapter.inta_pin;
+	mdev->rev_id            = adapter.revision_id;
+
+	return 0;
+
+err_out_close:
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+err_out_disable:
+	mthca_SYS_DIS(mdev, &status);
+
+	return err;
+}
+
+static int __devinit mthca_load_fw(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+	int num_sg;
+	int i;
+
+	/* FIXME: use HCA-attached memory for FW if present */
+
+	mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem *
+				     mdev->fw.arbel.fw_pages,
+				     GFP_KERNEL);
+	if (!mdev->fw.arbel.mem) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		return -ENOMEM;
+	}
+
+	memset(mdev->fw.arbel.mem, 0,
+	       sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages);
+
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) {
+		mdev->fw.arbel.mem[i].page   = alloc_page(GFP_HIGHUSER);
+		mdev->fw.arbel.mem[i].length = PAGE_SIZE;
+		if (!mdev->fw.arbel.mem[i].page) {
+			mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+	}
+	num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem,
+					   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+	if (num_sg <= 0) {
+		mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status);
+	if (err) {
+		mthca_err(mdev, "MAP_FA command failed, aborting.\n");
+		goto err_unmap;
+	}
+	if (status) {
+		mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap;
+	}
+
+	err = mthca_RUN_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "RUN_FW command failed, aborting.\n");
+		goto err_unmap_fa;
+	}
+	if (status) {
+		mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status);
+		err = -EINVAL;
+		goto err_unmap_fa;
+	}
+
+	return 0;
+
+err_unmap_fa:
+	mthca_UNMAP_FA(mdev, &status);
+
+err_unmap:
+	pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+		   mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+err_free:
+	for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+		if (mdev->fw.arbel.mem[i].page)
+			__free_page(mdev->fw.arbel.mem[i].page);
+	kfree(mdev->fw.arbel.mem);
+	return err;
+}
+
+static int __devinit mthca_init_arbel(struct mthca_dev *mdev)
+{
+	u8 status;
+	int err;
+
+	err = mthca_QUERY_FW(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "QUERY_FW command failed, aborting.\n");
+		return err;
+	}
+	if (status) {
+		mthca_err(mdev, "QUERY_FW returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_ENABLE_LAM(mdev, &status);
+	if (err) {
+		mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n");
+		return err;
+	}
+	if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) {
+		mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n");
+		mdev->mthca_flags |= MTHCA_FLAG_NO_LAM;
+	} else if (status) {
+		mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, "
+			  "aborting.\n", status);
+		return -EINVAL;
+	}
+
+	err = mthca_load_fw(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to start FW, aborting.\n");
+		goto err_out_disable;
+	}
+
+	mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, "
+		   "aborting.\n");
+	return -ENODEV;
+
+err_out_disable:
+	if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+		mthca_DISABLE_LAM(mdev, &status);
+	return err;
+}
+
+static int __devinit mthca_init_hca(struct mthca_dev *mdev)
+{
+	if (mdev->hca_type == ARBEL_NATIVE)
+		return mthca_init_arbel(mdev);
+	else
+		return mthca_init_tavor(mdev);
+}
+
+static int __devinit mthca_setup_hca(struct mthca_dev *dev)
+{
+	int err;
+
+	MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock);
+
+	err = mthca_init_pd_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "protection domain table, aborting.\n");
+		return err;
+	}
+
+	err = mthca_init_mr_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "memory region table, aborting.\n");
+		goto err_out_pd_table_free;
+	}
+
+	err = mthca_pd_alloc(dev, &dev->driver_pd);
+	if (err) {
+		mthca_err(dev, "Failed to create driver PD, "
+			  "aborting.\n");
+		goto err_out_mr_table_free;
+	}
+
+	err = mthca_init_eq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "event queue table, aborting.\n");
+		goto err_out_pd_free;
+	}
+
+	err = mthca_cmd_use_events(dev);
+	if (err) {
+		mthca_err(dev, "Failed to switch to event-driven "
+			  "firmware commands, aborting.\n");
+		goto err_out_eq_table_free;
+	}
+
+	err = mthca_init_cq_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "completion queue table, aborting.\n");
+		goto err_out_cmd_poll;
+	}
+
+	err = mthca_init_qp_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "queue pair table, aborting.\n");
+		goto err_out_cq_table_free;
+	}
+
+	err = mthca_init_av_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "address vector table, aborting.\n");
+		goto err_out_qp_table_free;
+	}
+
+	err = mthca_init_mcg_table(dev);
+	if (err) {
+		mthca_err(dev, "Failed to initialize "
+			  "multicast group table, aborting.\n");
+		goto err_out_av_table_free;
+	}
+
+	return 0;
+
+err_out_av_table_free:
+	mthca_cleanup_av_table(dev);
+
+err_out_qp_table_free:
+	mthca_cleanup_qp_table(dev);
+
+err_out_cq_table_free:
+	mthca_cleanup_cq_table(dev);
+
+err_out_cmd_poll:
+	mthca_cmd_use_polling(dev);
+
+err_out_eq_table_free:
+	mthca_cleanup_eq_table(dev);
+
+err_out_pd_free:
+	mthca_pd_free(dev, &dev->driver_pd);
+
+err_out_mr_table_free:
+	mthca_cleanup_mr_table(dev);
+
+err_out_pd_table_free:
+	mthca_cleanup_pd_table(dev);
+	return err;
+}
+
+static int __devinit mthca_request_regions(struct pci_dev *pdev,
+					   int ddr_hidden)
+{
+	int err;
+
+	/*
+	 * We request our first BAR in two chunks, since the MSI-X
+	 * vector table is right in the middle.
+	 *
+	 * This is why we can't just use pci_request_regions() -- if
+	 * we did then setting up MSI-X would fail, since the PCI core
+	 * wants to do request_mem_region on the MSI-X vector table.
+	 */
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_HCR_BASE,
+				MTHCA_MAP_HCR_SIZE,
+				DRV_NAME))
+		return -EBUSY;
+
+	if (!request_mem_region(pci_resource_start(pdev, 0) +
+				MTHCA_CLR_INT_BASE,
+				MTHCA_CLR_INT_SIZE,
+				DRV_NAME)) {
+		err = -EBUSY;
+		goto err_out_bar0_beg;
+	}
+
+	err = pci_request_region(pdev, 2, DRV_NAME);
+	if (err)
+		goto err_out_bar0_end;
+
+	if (!ddr_hidden) {
+		err = pci_request_region(pdev, 4, DRV_NAME);
+		if (err)
+			goto err_out_bar2;
+	}
+
+	return 0;
+
+err_out_bar0_beg:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+
+err_out_bar0_end:
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+
+err_out_bar2:
+	pci_release_region(pdev, 2);
+	return err;
+}
+
+static void mthca_release_regions(struct pci_dev *pdev,
+				  int ddr_hidden)
+{
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_HCR_BASE,
+			   MTHCA_MAP_HCR_SIZE);
+	release_mem_region(pci_resource_start(pdev, 0) +
+			   MTHCA_CLR_INT_BASE,
+			   MTHCA_CLR_INT_SIZE);
+	pci_release_region(pdev, 2);
+	if (!ddr_hidden)
+		pci_release_region(pdev, 4);
+}
+
+static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev)
+{
+	struct msix_entry entries[3];
+	int err;
+
+	entries[0].entry = 0;
+	entries[1].entry = 1;
+	entries[2].entry = 2;
+
+	err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries));
+	if (err) {
+		if (err > 0)
+			mthca_info(mdev, "Only %d MSI-X vectors available, "
+				   "not using MSI-X\n", err);
+		return err;
+	}
+
+	mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector;
+	mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector;
+	mdev->eq_table.eq[MTHCA_EQ_CMD  ].msi_x_vector = entries[2].vector;
+
+	return 0;
+}
+
+static void mthca_close_hca(struct mthca_dev *mdev)
+{
+	u8 status;
+	int i;
+
+	mthca_CLOSE_HCA(mdev, 0, &status);
+
+	if (mdev->hca_type == ARBEL_NATIVE) {
+		mthca_UNMAP_FA(mdev, &status);
+
+		pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem,
+			     mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL);
+
+		for (i = 0; i < mdev->fw.arbel.fw_pages; ++i)
+			__free_page(mdev->fw.arbel.mem[i].page);
+		kfree(mdev->fw.arbel.mem);
+
+		if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM))
+			mthca_DISABLE_LAM(mdev, &status);
+	} else
+		mthca_SYS_DIS(mdev, &status);
+}
+
+static int __devinit mthca_init_one(struct pci_dev *pdev,
+				    const struct pci_device_id *id)
+{
+	static int mthca_version_printed = 0;
+	int ddr_hidden = 0;
+	int err;
+	unsigned long mthca_base;
+	struct mthca_dev *mdev;
+
+	if (!mthca_version_printed) {
+		printk(KERN_INFO "%s", mthca_version);
+		++mthca_version_printed;
+	}
+
+	printk(KERN_INFO PFX "Initializing %s (%s)\n",
+	       pci_pretty_name(pdev), pci_name(pdev));
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot enable PCI device, "
+			"aborting.\n");
+		return err;
+	}
+
+	/*
+	 * Check for BARs.  We expect 0: 1MB, 2: 8MB, 4: DDR (may not
+	 * be present)
+	 */
+	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 0) != 1 << 20) {
+		dev_err(&pdev->dev, "Missing DCS, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) ||
+	    pci_resource_len(pdev, 2) != 1 << 23) {
+		dev_err(&pdev->dev, "Missing UAR, aborting.");
+		err = -ENODEV;
+		goto err_out_disable_pdev;
+	}
+	if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM))
+		ddr_hidden = 1;
+
+	err = mthca_request_regions(pdev, ddr_hidden);
+	if (err) {
+		dev_err(&pdev->dev, "Cannot obtain PCI resources, "
+			"aborting.\n");
+		goto err_out_disable_pdev;
+	}
+
+	pci_set_master(pdev);
+
+	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n");
+		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+	err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
+	if (err) {
+		dev_warn(&pdev->dev, "Warning: couldn't set 64-bit "
+			 "consistent PCI DMA mask.\n");
+		err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
+		if (err) {
+			dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, "
+				"aborting.\n");
+			goto err_out_free_res;
+		}
+	}
+
+	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
+	if (!mdev) {
+		dev_err(&pdev->dev, "Device struct alloc failed, "
+			"aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_res;
+	}
+
+	mdev->pdev     = pdev;
+	mdev->hca_type = id->driver_data;
+
+	if (ddr_hidden)
+		mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN;
+
+	/*
+	 * Now reset the HCA before we touch the PCI capabilities or
+	 * attempt a firmware command, since a boot ROM may have left
+	 * the HCA in an undefined state.
+	 */
+	err = mthca_reset(mdev);
+	if (err) {
+		mthca_err(mdev, "Failed to reset HCA, aborting.\n");
+		goto err_out_free_dev;
+	}
+
+	if (msi_x && !mthca_enable_msi_x(mdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
+	if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) &&
+	    !pci_enable_msi(pdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI;
+
+	sema_init(&mdev->cmd.hcr_sem, 1);
+	sema_init(&mdev->cmd.poll_sem, 1);
+	mdev->cmd.use_events = 0;
+
+	mthca_base = pci_resource_start(pdev, 0);
+	mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE);
+	if (!mdev->hcr) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_free_dev;
+	}
+	mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE,
+				 MTHCA_CLR_INT_SIZE);
+	if (!mdev->clr_base) {
+		mthca_err(mdev, "Couldn't map command register, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap;
+	}
+
+	mthca_base = pci_resource_start(pdev, 2);
+	mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE);
+	if (!mdev->kar) {
+		mthca_err(mdev, "Couldn't map kernel access region, "
+			  "aborting.\n");
+		err = -ENOMEM;
+		goto err_out_iounmap_clr;
+	}
+
+	err = mthca_tune_pci(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_init_hca(mdev);
+	if (err)
+		goto err_out_iounmap_kar;
+
+	err = mthca_setup_hca(mdev);
+	if (err)
+		goto err_out_close;
+
+	err = mthca_register_device(mdev);
+	if (err)
+		goto err_out_cleanup;
+
+	err = mthca_create_agents(mdev);
+	if (err)
+		goto err_out_unregister;
+
+	pci_set_drvdata(pdev, mdev);
+
+	return 0;
+
+err_out_unregister:
+	mthca_unregister_device(mdev);
+
+err_out_cleanup:
+	mthca_cleanup_mcg_table(mdev);
+	mthca_cleanup_av_table(mdev);
+	mthca_cleanup_qp_table(mdev);
+	mthca_cleanup_cq_table(mdev);
+	mthca_cmd_use_polling(mdev);
+	mthca_cleanup_eq_table(mdev);
+
+	mthca_pd_free(mdev, &mdev->driver_pd);
+
+	mthca_cleanup_mr_table(mdev);
+	mthca_cleanup_pd_table(mdev);
+
+err_out_close:
+	mthca_close_hca(mdev);
+
+err_out_iounmap_kar:
+	iounmap(mdev->kar);
+
+err_out_iounmap_clr:
+	iounmap(mdev->clr_base);
+
+err_out_iounmap:
+	iounmap(mdev->hcr);
+
+err_out_free_dev:
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+		pci_disable_msi(pdev);
+
+	ib_dealloc_device(&mdev->ib_dev);
+
+err_out_free_res:
+	mthca_release_regions(pdev, ddr_hidden);
+
+err_out_disable_pdev:
+	pci_disable_device(pdev);
+	pci_set_drvdata(pdev, NULL);
+	return err;
+}
+
+static void __devexit mthca_remove_one(struct pci_dev *pdev)
+{
+	struct mthca_dev *mdev = pci_get_drvdata(pdev);
+	u8 status;
+	int p;
+
+	if (mdev) {
+		mthca_free_agents(mdev);
+		mthca_unregister_device(mdev);
+
+		for (p = 1; p <= mdev->limits.num_ports; ++p)
+			mthca_CLOSE_IB(mdev, p, &status);
+
+		mthca_cleanup_mcg_table(mdev);
+		mthca_cleanup_av_table(mdev);
+		mthca_cleanup_qp_table(mdev);
+		mthca_cleanup_cq_table(mdev);
+		mthca_cmd_use_polling(mdev);
+		mthca_cleanup_eq_table(mdev);
+
+		mthca_pd_free(mdev, &mdev->driver_pd);
+
+		mthca_cleanup_mr_table(mdev);
+		mthca_cleanup_pd_table(mdev);
+
+		mthca_close_hca(mdev);
+
+		iounmap(mdev->hcr);
+		iounmap(mdev->clr_base);
+
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+			pci_disable_msix(pdev);
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+			pci_disable_msi(pdev);
+
+		ib_dealloc_device(&mdev->ib_dev);
+		mthca_release_regions(pdev, mdev->mthca_flags &
+				      MTHCA_FLAG_DDR_HIDDEN);
+		pci_disable_device(pdev);
+		pci_set_drvdata(pdev, NULL);
+	}
+}
+
+static struct pci_device_id mthca_pci_table[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR),
+	  .driver_data = TAVOR },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT),
+	  .driver_data = ARBEL_COMPAT },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL),
+	  .driver_data = ARBEL_NATIVE },
+	{ 0, }
+};
+
+MODULE_DEVICE_TABLE(pci, mthca_pci_table);
+
+static struct pci_driver mthca_driver = {
+	.name		= "ib_mthca",
+	.id_table	= mthca_pci_table,
+	.probe		= mthca_init_one,
+	.remove		= __devexit_p(mthca_remove_one)
+};
+
+static int __init mthca_init(void)
+{
+	int ret;
+
+	/*
+	 * TODO: measure whether dynamically choosing doorbell code at
+	 * runtime affects our performance.  Is there a "magic" way to
+	 * choose without having to follow a function pointer every
+	 * time we ring a doorbell?
+	 */
+#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL
+	if (!cpu_has_xmm) {
+		printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n");
+		printk(KERN_ERR PFX "the current CPU does not support SSE.\n");
+		printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL "
+		       "and recompile.\n");
+		return -ENODEV;
+	}
+#endif
+
+	ret = pci_register_driver(&mthca_driver);
+	return ret < 0 ? ret : 0;
+}
+
+static void __exit mthca_cleanup(void)
+{
+	pci_unregister_driver(&mthca_driver);
+}
+
+module_init(mthca_init);
+module_exit(mthca_cleanup);
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:14:58 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:14:58 -0800
Subject: [openib-general] [PATCH][RFC/v2][8/21] Add Mellanox HCA low-level
	driver (midlayer interface)
In-Reply-To: <20041123814.y2QOtktHRf35o3M9@topspin.com>
Message-ID: <20041123814.Yu9sv2vgFBLAV3pZ@topspin.com>

Add midlayer interface code for Mellanox HCA driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c	2004-11-23 08:10:19.734652499 -0800
@@ -0,0 +1,629 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $
+ */
+
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+/* Temporary until we get core support straightened out */
+enum {
+	IB_SMP_ATTRIB_NODE_INFO        = 0x0011,
+	IB_SMP_ATTRIB_GUID_INFO        = 0x0014,
+	IB_SMP_ATTRIB_PORT_INFO        = 0x0015,
+	IB_SMP_ATTRIB_PKEY_TABLE       = 0x0016
+};
+
+static int mthca_query_device(struct ib_device *ibdev,
+			      struct ib_device_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	props->fw_ver        = to_mdev(ibdev)->fw_ver;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    1, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->vendor_id      = be32_to_cpup((u32 *) (out_mad->data + 76)) &
+		0xffffff;
+	props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70));
+	props->hw_ver         = be16_to_cpup((u16 *) (out_mad->data + 72));
+	memcpy(&props->sys_image_guid, out_mad->data + 44, 8);
+	memcpy(&props->node_guid,      out_mad->data + 52, 8);
+
+	err = 0;
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_port(struct ib_device *ibdev,
+			    u8 port, struct ib_port_attr *props)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	props->lid               = be16_to_cpup((u16 *) (out_mad->data + 56));
+	props->lmc               = (*(u8 *) (out_mad->data + 74)) & 0x7;
+	props->sm_lid            = be16_to_cpup((u16 *) (out_mad->data + 58));
+	props->sm_sl             = (*(u8 *) (out_mad->data + 76)) & 0xf;
+	props->state             = (*(u8 *) (out_mad->data + 72)) & 0xf;
+	props->port_cap_flags    = be32_to_cpup((u32 *) (out_mad->data + 60));
+	props->gid_tbl_len       = to_mdev(ibdev)->limits.gid_table_len;
+	props->pkey_tbl_len      = to_mdev(ibdev)->limits.pkey_table_len;
+	props->qkey_viol_cntr    = be16_to_cpup((u16 *) (out_mad->data + 88));
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_modify_port(struct ib_device *ibdev,
+			     u8 port, int port_modify_mask,
+			     struct ib_port_modify *props)
+{
+	return 0;
+}
+
+static int mthca_query_pkey(struct ib_device *ibdev,
+			    u8 port, u16 index, u16 *pkey)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 32);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	*pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static int mthca_query_gid(struct ib_device *ibdev, u8 port,
+			   int index, union ib_gid *gid)
+{
+	struct ib_mad *in_mad  = NULL;
+	struct ib_mad *out_mad = NULL;
+	int err = -ENOMEM;
+	u8 status;
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(port);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw, out_mad->data + 48, 8);
+
+	memset(in_mad, 0, sizeof *in_mad);
+	in_mad->mad_hdr.base_version       = 1;
+	in_mad->mad_hdr.mgmt_class     	   = IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	in_mad->mad_hdr.class_version  	   = 1;
+	in_mad->mad_hdr.method         	   = IB_MGMT_METHOD_GET;
+	in_mad->mad_hdr.attr_id   	   = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO);
+	in_mad->mad_hdr.attr_mod           = cpu_to_be32(index / 8);
+
+	err = mthca_MAD_IFC(to_mdev(ibdev), 1,
+			    port, in_mad, out_mad,
+			    &status);
+	if (err)
+		goto out;
+	if (status) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8);
+
+ out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev)
+{
+	struct mthca_pd *pd;
+	int err;
+
+	pd = kmalloc(sizeof *pd, GFP_KERNEL);
+	if (!pd)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_pd_alloc(to_mdev(ibdev), pd);
+	if (err) {
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	return &pd->ibpd;
+}
+
+static int mthca_dealloc_pd(struct ib_pd *pd)
+{
+	mthca_pd_free(to_mdev(pd->device), to_mpd(pd));
+	kfree(pd);
+
+	return 0;
+}
+
+static struct ib_ah *mthca_ah_create(struct ib_pd *pd,
+				     struct ib_ah_attr *ah_attr)
+{
+	int err;
+	struct mthca_ah *ah;
+
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah);
+	if (err) {
+		kfree(ah);
+		return ERR_PTR(err);
+	}
+
+	return &ah->ibah;
+}
+
+static int mthca_ah_destroy(struct ib_ah *ah)
+{
+	mthca_destroy_ah(to_mdev(ah->device), to_mah(ah));
+	kfree(ah);
+
+	return 0;
+}
+
+static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
+				     struct ib_qp_init_attr *init_attr)
+{
+	struct mthca_qp *qp;
+	int err;
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_RC:
+	case IB_QPT_UC:
+	case IB_QPT_UD:
+	{
+		qp = kmalloc(sizeof *qp, GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd),
+				     to_mcq(init_attr->send_cq),
+				     to_mcq(init_attr->recv_cq),
+				     init_attr->qp_type, init_attr->sq_sig_type,
+				     init_attr->rq_sig_type, qp);
+		qp->ibqp.qp_num = qp->qpn;
+		break;
+	}
+	case IB_QPT_SMI:
+	case IB_QPT_GSI:
+	{
+		qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
+		if (!qp)
+			return ERR_PTR(-ENOMEM);
+
+		qp->sq.max    = init_attr->cap.max_send_wr;
+		qp->rq.max    = init_attr->cap.max_recv_wr;
+		qp->sq.max_gs = init_attr->cap.max_send_sge;
+		qp->rq.max_gs = init_attr->cap.max_recv_sge;
+
+		qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1;
+
+		err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
+				      to_mcq(init_attr->send_cq),
+				      to_mcq(init_attr->recv_cq),
+				      init_attr->sq_sig_type, init_attr->rq_sig_type,
+				      qp->ibqp.qp_num, init_attr->port_num,
+				      to_msqp(qp));
+		break;
+	}
+	default:
+		/* Don't support raw QPs */
+		return ERR_PTR(-ENOSYS);
+	}
+
+	if (err) {
+		kfree(qp);
+		return ERR_PTR(err);
+	}
+
+        init_attr->cap.max_inline_data = 0;
+
+	return &qp->ibqp;
+}
+
+static int mthca_destroy_qp(struct ib_qp *qp)
+{
+	mthca_free_qp(to_mdev(qp->device), to_mqp(qp));
+	kfree(qp);
+	return 0;
+}
+
+static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries)
+{
+	struct mthca_cq *cq;
+	int nent;
+	int err;
+
+	cq = kmalloc(sizeof *cq, GFP_KERNEL);
+	if (!cq)
+		return ERR_PTR(-ENOMEM);
+
+	for (nent = 1; nent < entries; nent <<= 1)
+		; /* nothing */
+
+	err = mthca_init_cq(to_mdev(ibdev), nent, cq);
+	if (err) {
+		kfree(cq);
+		cq = ERR_PTR(err);
+	} else
+		cq->ibcq.cqe = nent;
+
+	return &cq->ibcq;
+}
+
+static int mthca_destroy_cq(struct ib_cq *cq)
+{
+	mthca_free_cq(to_mdev(cq->device), to_mcq(cq));
+	kfree(cq);
+
+	return 0;
+}
+
+static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+{
+	mthca_arm_cq(to_mdev(cq->device), to_mcq(cq),
+		     notify == IB_CQ_SOLICITED);
+	return 0;
+}
+
+static inline u32 convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC       : 0) |
+	       (acc & IB_ACCESS_REMOTE_WRITE  ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) |
+	       (acc & IB_ACCESS_REMOTE_READ   ? MTHCA_MPT_FLAG_REMOTE_READ  : 0) |
+	       (acc & IB_ACCESS_LOCAL_WRITE   ? MTHCA_MPT_FLAG_LOCAL_WRITE  : 0) |
+	       MTHCA_MPT_FLAG_LOCAL_READ;
+}
+
+static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct mthca_mr *mr;
+	int err;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_mr_alloc_notrans(to_mdev(pd->device),
+				     to_mpd(pd)->pd_num,
+				     convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+}
+
+static struct ib_mr *mthca_reg_phys_mr(struct ib_pd       *pd,
+				       struct ib_phys_buf *buffer_list,
+				       int                 num_phys_buf,
+				       int                 acc,
+				       u64                *iova_start)
+{
+	struct mthca_mr *mr;
+	u64 *page_list;
+	u64 total_size;
+	u64 mask;
+	int shift;
+	int npages;
+	int err;
+	int i, j, n;
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	mask = 0;
+	total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (buffer_list[i].addr & ~PAGE_MASK)
+			return ERR_PTR(-EINVAL);
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return ERR_PTR(-EINVAL);
+
+		total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	/* Find largest page shift we can use to cover buffers */
+	for (shift = PAGE_SHIFT; shift < 31; ++shift)
+		if (num_phys_buf > 1) {
+			if ((1ULL << shift) & mask)
+				break;
+		} else {
+			if (1ULL << shift >= 
+			    buffer_list[0].size + 
+			    (buffer_list[0].addr & ((1ULL << shift) - 1)))
+				break;
+		}
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1);
+	buffer_list[0].addr &= ~0ull << shift;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+
+	if (!npages)
+		return &mr->ibmr;
+
+	page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL);
+	if (!page_list) {
+		kfree(mr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift;
+		     ++j)
+			page_list[n++] = buffer_list[i].addr + ((u64) j << shift);
+
+	mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) "
+		  "in PD %x; shift %d, npages %d.\n",
+		  (unsigned long long) buffer_list[0].addr,
+		  (unsigned long long) *iova_start,
+		  to_mpd(pd)->pd_num,
+		  shift, npages);
+
+	err = mthca_mr_alloc_phys(to_mdev(pd->device),
+				  to_mpd(pd)->pd_num,
+				  page_list, shift, npages,
+				  *iova_start, total_size,
+				  convert_access(acc), mr);
+
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	kfree(page_list);
+	return &mr->ibmr;
+}
+
+static int mthca_dereg_mr(struct ib_mr *mr)
+{
+	mthca_free_mr(to_mdev(mr->device), to_mmr(mr));
+	kfree(mr);
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x\n", dev->rev_id);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32),
+		       (int) (dev->fw_ver >> 16) & 0xffff,
+		       (int) dev->fw_ver & 0xffff);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
+	switch (dev->hca_type) {
+	case TAVOR:        return sprintf(buf, "MT23108\n");
+	case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n");
+	case ARBEL_NATIVE: return sprintf(buf, "MT25208\n");
+	default:           return sprintf(buf, "unknown\n");
+	}
+}
+
+static CLASS_DEVICE_ATTR(hw_rev,   S_IRUGO, show_rev,    NULL);
+static CLASS_DEVICE_ATTR(fw_ver,   S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca,    NULL);
+
+static struct class_device_attribute *mthca_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type
+};
+
+int mthca_register_device(struct mthca_dev *dev)
+{
+	int ret;
+	int i;
+
+	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
+	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
+	dev->ib_dev.dma_device           = &dev->pdev->dev;
+	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
+	dev->ib_dev.query_device         = mthca_query_device;
+	dev->ib_dev.query_port           = mthca_query_port;
+	dev->ib_dev.modify_port          = mthca_modify_port;
+	dev->ib_dev.query_pkey           = mthca_query_pkey;
+	dev->ib_dev.query_gid            = mthca_query_gid;
+	dev->ib_dev.alloc_pd             = mthca_alloc_pd;
+	dev->ib_dev.dealloc_pd           = mthca_dealloc_pd;
+	dev->ib_dev.create_ah            = mthca_ah_create;
+	dev->ib_dev.destroy_ah           = mthca_ah_destroy;
+	dev->ib_dev.create_qp            = mthca_create_qp;
+	dev->ib_dev.modify_qp            = mthca_modify_qp;
+	dev->ib_dev.destroy_qp           = mthca_destroy_qp;
+	dev->ib_dev.post_send            = mthca_post_send;
+	dev->ib_dev.post_recv            = mthca_post_receive;
+	dev->ib_dev.create_cq            = mthca_create_cq;
+	dev->ib_dev.destroy_cq           = mthca_destroy_cq;
+	dev->ib_dev.poll_cq              = mthca_poll_cq;
+	dev->ib_dev.req_notify_cq        = mthca_req_notify_cq;
+	dev->ib_dev.get_dma_mr           = mthca_get_dma_mr;
+	dev->ib_dev.reg_phys_mr          = mthca_reg_phys_mr;
+	dev->ib_dev.dereg_mr             = mthca_dereg_mr;
+	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
+	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
+	dev->ib_dev.process_mad          = mthca_process_mad;
+
+	ret = ib_register_device(&dev->ib_dev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ib_dev.class_dev,
+					       mthca_class_attributes[i]);
+		if (ret) {
+			ib_unregister_device(&dev->ib_dev);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+void mthca_unregister_device(struct mthca_dev *dev)
+{
+	ib_unregister_device(&dev->ib_dev);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h	2004-11-23 08:10:19.785644981 -0800
@@ -0,0 +1,221 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $
+ */
+
+#ifndef MTHCA_PROVIDER_H
+#define MTHCA_PROVIDER_H
+
+#include <ib_verbs.h>
+#include <ib_pack.h>
+
+#define MTHCA_MPT_FLAG_ATOMIC        (1 << 14)
+#define MTHCA_MPT_FLAG_REMOTE_WRITE  (1 << 13)
+#define MTHCA_MPT_FLAG_REMOTE_READ   (1 << 12)
+#define MTHCA_MPT_FLAG_LOCAL_WRITE   (1 << 11)
+#define MTHCA_MPT_FLAG_LOCAL_READ    (1 << 10)
+
+struct mthca_buf_list {
+	void *buf;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+struct mthca_mr {
+	struct ib_mr ibmr;
+	int order;
+	u32 first_seg;
+};
+
+struct mthca_pd {
+	struct ib_pd    ibpd;
+	u32             pd_num;
+	atomic_t        sqp_count;
+	struct mthca_mr ntmr;
+};
+
+struct mthca_eq {
+	struct mthca_dev      *dev;
+	int                    eqn;
+	u32                    ecr_mask;
+	u16                    msi_x_vector;
+	u16                    msi_x_entry;
+	int                    have_irq;
+	int                    nent;
+	int                    cons_index;
+	struct mthca_buf_list *page_list;
+	struct mthca_mr        mr;
+};
+
+struct mthca_av;
+
+struct mthca_ah {
+	struct ib_ah     ibah;
+	int              on_hca;
+	u32              key;
+	struct mthca_av *av;
+	dma_addr_t       avdma;
+};
+
+/*
+ * Quick description of our CQ/QP locking scheme:
+ *
+ * We have one global lock that protects dev->cq/qp_table.  Each
+ * struct mthca_cq/qp also has its own lock.  An individual qp lock
+ * may be taken inside of an individual cq lock.  Both cqs attached to
+ * a qp may be locked, with the send cq locked first.  No other
+ * nesting should be done.
+ *
+ * Each struct mthca_cq/qp also has an atomic_t ref count.  The
+ * pointer from the cq/qp_table to the struct counts as one reference.
+ * This reference also is good for access through the consumer API, so
+ * modifying the CQ/QP etc doesn't need to take another reference.
+ * Access because of a completion being polled does need a reference.
+ *
+ * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the
+ * destroy function to sleep on.
+ *
+ * This means that access from the consumer API requires nothing but
+ * taking the struct's lock.
+ *
+ * Access because of a completion event should go as follows:
+ * - lock cq/qp_table and look up struct
+ * - increment ref count in struct
+ * - drop cq/qp_table lock
+ * - lock struct, do your thing, and unlock struct
+ * - decrement ref count; if zero, wake up waiters
+ *
+ * To destroy a CQ/QP, we can do the following:
+ * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock
+ * - decrement ref count
+ * - wait_event until ref count is zero
+ *
+ * It is the consumer's responsibilty to make sure that no QP
+ * operations (WQE posting or state modification) are pending when the
+ * QP is destroyed.  Also, the consumer must make sure that calls to
+ * qp_modify are serialized.
+ *
+ * Possible optimizations (wait for profile data to see if/where we
+ * have locks bouncing between CPUs):
+ * - split cq/qp table lock into n separate (cache-aligned) locks,
+ *   indexed (say) by the page in the table
+ * - split QP struct lock into three (one for common info, one for the
+ *   send queue and one for the receive queue)
+ */
+
+struct mthca_cq {
+	struct ib_cq           ibcq;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	int                    cqn;
+	int                    cons_index;
+	int                    is_direct;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+	struct mthca_mr        mr;
+	wait_queue_head_t      wait;
+};
+
+struct mthca_wq {
+	int   max;
+	int   cur;
+	int   next;
+	int   last_comp;
+	void *last;
+	int   max_gs;
+	int   wqe_shift;
+	enum ib_sig_type policy;
+};
+
+struct mthca_qp {
+	struct ib_qp           ibqp;
+	spinlock_t             lock;
+	atomic_t               refcount;
+	u32                    qpn;
+	int                    transport;
+	enum ib_qp_state       state;
+	int                    is_direct;
+	struct mthca_mr        mr;
+
+	struct mthca_wq        rq;
+	struct mthca_wq        sq;
+	int                    send_wqe_offset;
+
+	u64                   *wrid;
+	union {
+		struct mthca_buf_list direct;
+		struct mthca_buf_list *page_list;
+	}                      queue;
+
+	wait_queue_head_t      wait;
+};
+
+struct mthca_sqp {
+	struct mthca_qp qp;
+	int             port;
+	int             pkey_index;
+	u32             qkey;
+	u32             send_psn;
+	struct ib_ud_header ud_header;
+	int             header_buf_size;
+	void           *header_buf;
+	dma_addr_t      header_dma;
+};
+
+static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct mthca_mr, ibmr);
+}
+
+static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct mthca_pd, ibpd);
+}
+
+static inline struct mthca_ah *to_mah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct mthca_ah, ibah);
+}
+
+static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct mthca_cq, ibcq);
+}
+
+static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct mthca_qp, ibqp);
+}
+
+static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp)
+{
+	return container_of(qp, struct mthca_sqp, qp);
+}
+
+#endif /* MTHCA_PROVIDER_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:07 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:07 -0800
Subject: [openib-general] [PATCH][RFC/v2][9/21] Add Mellanox HCA low-level
	driver (FW commands)
In-Reply-To: <20041123814.Yu9sv2vgFBLAV3pZ@topspin.com>
Message-ID: <20041123815.4PYKXCiYMYCttxq4@topspin.com>

Add firmware command processing code for Mellanox HCA driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c	2004-11-23 08:10:20.044606797 -0800
@@ -0,0 +1,1522 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#include <linux/sched.h>
+#include <linux/pci.h>
+#include <linux/errno.h>
+#include <asm/io.h>
+
+#include "mthca_dev.h"
+#include "mthca_config_reg.h"
+#include "mthca_cmd.h"
+
+#define CMD_POLL_TOKEN 0xffff
+
+enum {
+	HCR_IN_PARAM_OFFSET    = 0x00,
+	HCR_IN_MODIFIER_OFFSET = 0x08,
+	HCR_OUT_PARAM_OFFSET   = 0x0c,
+	HCR_TOKEN_OFFSET       = 0x14,
+	HCR_STATUS_OFFSET      = 0x18,
+
+	HCR_OPMOD_SHIFT        = 12,
+	HCA_E_BIT              = 22,
+	HCR_GO_BIT             = 23
+};
+
+enum {
+	/* initialization and general commands */
+	CMD_SYS_EN          = 0x1,
+	CMD_SYS_DIS         = 0x2,
+	CMD_MAP_FA          = 0xfff,
+	CMD_UNMAP_FA        = 0xffe,
+	CMD_RUN_FW          = 0xff6,
+	CMD_MOD_STAT_CFG    = 0x34,
+	CMD_QUERY_DEV_LIM   = 0x3,
+	CMD_QUERY_FW        = 0x4,
+	CMD_ENABLE_LAM      = 0xff8,
+	CMD_DISABLE_LAM     = 0xff7,
+	CMD_QUERY_DDR       = 0x5,
+	CMD_QUERY_ADAPTER   = 0x6,
+	CMD_INIT_HCA        = 0x7,
+	CMD_CLOSE_HCA       = 0x8,
+	CMD_INIT_IB         = 0x9,
+	CMD_CLOSE_IB        = 0xa,
+	CMD_QUERY_HCA       = 0xb,
+	CMD_SET_IB          = 0xc,
+	CMD_ACCESS_DDR      = 0x2e,
+	CMD_MAP_ICM         = 0xffa,
+	CMD_UNMAP_ICM       = 0xff9,
+	CMD_MAP_ICM_AUX     = 0xffc,
+	CMD_UNMAP_ICM_AUX   = 0xffb,
+	CMD_SET_ICM_SIZE    = 0xffd,
+
+	/* TPT commands */
+	CMD_SW2HW_MPT 	    = 0xd,
+	CMD_QUERY_MPT 	    = 0xe,
+	CMD_HW2SW_MPT 	    = 0xf,
+	CMD_READ_MTT        = 0x10,
+	CMD_WRITE_MTT       = 0x11,
+	CMD_SYNC_TPT        = 0x2f,
+
+	/* EQ commands */
+	CMD_MAP_EQ          = 0x12,
+	CMD_SW2HW_EQ 	    = 0x13,
+	CMD_HW2SW_EQ 	    = 0x14,
+	CMD_QUERY_EQ        = 0x15,
+
+	/* CQ commands */
+	CMD_SW2HW_CQ 	    = 0x16,
+	CMD_HW2SW_CQ 	    = 0x17,
+	CMD_QUERY_CQ 	    = 0x18,
+	CMD_RESIZE_CQ       = 0x2c,
+
+	/* SRQ commands */
+	CMD_SW2HW_SRQ 	    = 0x35,
+	CMD_HW2SW_SRQ 	    = 0x36,
+	CMD_QUERY_SRQ       = 0x37,
+
+	/* QP/EE commands */
+	CMD_RST2INIT_QPEE   = 0x19,
+	CMD_INIT2RTR_QPEE   = 0x1a,
+	CMD_RTR2RTS_QPEE    = 0x1b,
+	CMD_RTS2RTS_QPEE    = 0x1c,
+	CMD_SQERR2RTS_QPEE  = 0x1d,
+	CMD_2ERR_QPEE       = 0x1e,
+	CMD_RTS2SQD_QPEE    = 0x1f,
+	CMD_SQD2SQD_QPEE    = 0x38,
+	CMD_SQD2RTS_QPEE    = 0x20,
+	CMD_ERR2RST_QPEE    = 0x21,
+	CMD_QUERY_QPEE      = 0x22,
+	CMD_INIT2INIT_QPEE  = 0x2d,
+	CMD_SUSPEND_QPEE    = 0x32,
+	CMD_UNSUSPEND_QPEE  = 0x33,
+	/* special QPs and management commands */
+	CMD_CONF_SPECIAL_QP = 0x23,
+	CMD_MAD_IFC         = 0x24,
+
+	/* multicast commands */
+	CMD_READ_MGM        = 0x25,
+	CMD_WRITE_MGM       = 0x26,
+	CMD_MGID_HASH       = 0x27,
+
+	/* miscellaneous commands */
+	CMD_DIAG_RPRT       = 0x30,
+	CMD_NOP             = 0x31,
+
+	/* debug commands */
+	CMD_QUERY_DEBUG_MSG = 0x2a,
+	CMD_SET_DEBUG_MSG   = 0x2b,
+};
+
+/*
+ * According to Mellanox code, FW may be starved and never complete
+ * commands.  So we can't use strict timeouts described in PRM -- we
+ * just arbitrarily select 60 seconds for now.
+ */
+#if 0
+/*
+ * Round up and add 1 to make sure we get the full wait time (since we
+ * will be starting in the middle of a jiffy)
+ */
+enum {
+	CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1,
+	CMD_TIME_CLASS_B = (HZ +  99) /  100 + 1,
+	CMD_TIME_CLASS_C = (HZ +   9) /   10 + 1
+};
+#else
+enum {
+	CMD_TIME_CLASS_A = 60 * HZ,
+	CMD_TIME_CLASS_B = 60 * HZ,
+	CMD_TIME_CLASS_C = 60 * HZ
+};
+#endif
+
+enum {
+	GO_BIT_TIMEOUT = HZ * 10
+};
+
+struct mthca_cmd_context {
+	struct completion done;
+	struct timer_list timer;
+	int               result;
+	int               next;
+	u64               out_param;
+	u16               token;
+	u8                status;
+};
+
+static inline int go_bit(struct mthca_dev *dev)
+{
+	return readl(dev->hcr + HCR_STATUS_OFFSET) &
+		swab32(1 << HCR_GO_BIT);
+}
+
+static int mthca_cmd_post(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 out_param,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  u16 token,
+			  int event)
+{
+	int err = 0;
+	
+	if (down_interruptible(&dev->cmd.hcr_sem))
+		return -EINTR;
+
+	if (event) {
+		unsigned long end = jiffies + GO_BIT_TIMEOUT;
+
+		while (go_bit(dev) && time_before(jiffies, end)) {
+			set_current_state(TASK_RUNNING);
+			schedule();
+		}
+	}
+
+	if (go_bit(dev)) {
+		err = -EAGAIN;
+		goto out;
+	}
+
+	/*
+	 * We use writel (instead of something like memcpy_toio)
+	 * because writes of less than 32 bits to the HCR don't work
+	 * (and some architectures such as ia64 implement memcpy_toio
+	 * in terms of writeb).
+	 */
+	__raw_writel(cpu_to_be32(in_param >> 32),           dev->hcr + 0 * 4);
+	__raw_writel(cpu_to_be32(in_param & 0xfffffffful),  dev->hcr + 1 * 4);
+	__raw_writel(cpu_to_be32(in_modifier),              dev->hcr + 2 * 4);
+	__raw_writel(cpu_to_be32(out_param >> 32),          dev->hcr + 3 * 4);
+	__raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4);
+	__raw_writel(cpu_to_be32(token << 16),              dev->hcr + 5 * 4);
+
+	/*
+	 * Flush posted writes so GO bit is written last (needed with
+	 * __raw_writel, which may not order writes).
+	 */
+	readl(dev->hcr + HCR_STATUS_OFFSET);	
+
+	__raw_writel(cpu_to_be32((1 << HCR_GO_BIT)                |
+				 (event ? (1 << HCA_E_BIT) : 0)   |
+				 (op_modifier << HCR_OPMOD_SHIFT) |
+				 op),                       dev->hcr + 6 * 4);
+
+out:
+	up(&dev->cmd.hcr_sem);
+	return err;
+}
+
+static int mthca_cmd_poll(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	unsigned long end;
+
+	if (down_interruptible(&dev->cmd.poll_sem))
+		return -EINTR;
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, CMD_POLL_TOKEN, 0);
+	if (err)
+		goto out;
+
+	end = timeout + jiffies;
+	while (go_bit(dev) && time_before(jiffies, end)) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+
+	if (go_bit(dev)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	if (out_is_imm) {
+		memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64));
+		be64_to_cpus(out_param);
+	}
+
+	*status = readb(dev->hcr + HCR_STATUS_OFFSET);
+
+out:
+	up(&dev->cmd.poll_sem);
+	return err;
+}
+
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param)
+{
+	struct mthca_cmd_context *context =
+		&dev->cmd.context[token & dev->cmd.token_mask];
+
+	/* previously timed out command completing at long last */
+	if (token != context->token)
+		return;
+
+	context->result    = 0;
+	context->status    = status;
+	context->out_param = out_param;
+
+	context->token += dev->cmd.token_mask + 1;
+
+	complete(&context->done);
+}
+
+static void event_timeout(unsigned long context_ptr)
+{
+	struct mthca_cmd_context *context =
+		(struct mthca_cmd_context *) context_ptr;
+
+	context->result = -EBUSY;
+	complete(&context->done);
+}
+
+static int mthca_cmd_wait(struct mthca_dev *dev,
+			  u64 in_param,
+			  u64 *out_param,
+			  int out_is_imm,
+			  u32 in_modifier,
+			  u8 op_modifier,
+			  u16 op,
+			  unsigned long timeout,
+			  u8 *status)
+{
+	int err = 0;
+	struct mthca_cmd_context *context;
+
+	if (down_interruptible(&dev->cmd.event_sem))
+		return -EINTR;
+
+	spin_lock(&dev->cmd.context_lock);
+	BUG_ON(dev->cmd.free_head < 0);
+	context = &dev->cmd.context[dev->cmd.free_head];
+	dev->cmd.free_head = context->next;
+	spin_unlock(&dev->cmd.context_lock);
+
+	init_completion(&context->done);
+
+	err = mthca_cmd_post(dev, in_param,
+			     out_param ? *out_param : 0,
+			     in_modifier, op_modifier,
+			     op, context->token, 1);
+	if (err)
+		goto out;
+
+	context->timer.expires  = jiffies + timeout;
+	add_timer(&context->timer);
+
+	wait_for_completion(&context->done);
+	del_timer_sync(&context->timer);
+
+	err = context->result;
+	if (err)
+		goto out;
+
+	*status = context->status;
+	if (*status)
+		mthca_dbg(dev, "Command %02x completed with status %02x\n",
+			  op, *status);
+
+	if (out_is_imm)
+		*out_param = context->out_param;
+
+out:
+	spin_lock(&dev->cmd.context_lock);
+	context->next = dev->cmd.free_head;
+	dev->cmd.free_head = context - dev->cmd.context;
+	spin_unlock(&dev->cmd.context_lock);
+
+	up(&dev->cmd.event_sem);
+	return err;
+}
+
+/* Invoke a command with an output mailbox */
+static int mthca_cmd_box(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, &out_param, 0,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/* Invoke a command with no output parameter */
+static int mthca_cmd(struct mthca_dev *dev,
+		     u64 in_param,
+		     u32 in_modifier,
+		     u8 op_modifier,
+		     u16 op,
+		     unsigned long timeout,
+		     u8 *status)
+{
+	return mthca_cmd_box(dev, in_param, 0, in_modifier,
+			     op_modifier, op, timeout, status);
+}
+
+/*
+ * Invoke a command with an immediate output parameter (and copy the
+ * output into the caller's out_param pointer after the command
+ * executes).
+ */
+static int mthca_cmd_imm(struct mthca_dev *dev,
+			 u64 in_param,
+			 u64 *out_param,
+			 u32 in_modifier,
+			 u8 op_modifier,
+			 u16 op,
+			 unsigned long timeout,
+			 u8 *status)
+{
+	if (dev->cmd.use_events)
+		return mthca_cmd_wait(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+	else
+		return mthca_cmd_poll(dev, in_param, out_param, 1,
+				      in_modifier, op_modifier, op,
+				      timeout, status);
+}
+
+/*
+ * Switch to using events to issue FW commands (should be called after
+ * event queue to command events has been initialized).
+ */
+int mthca_cmd_use_events(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.context = kmalloc(dev->cmd.max_cmds *
+				   sizeof (struct mthca_cmd_context),
+				   GFP_KERNEL);
+	if (!dev->cmd.context)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i) {
+		dev->cmd.context[i].token = i;
+		dev->cmd.context[i].next = i + 1;
+		init_timer(&dev->cmd.context[i].timer);
+		dev->cmd.context[i].timer.data     =
+			(unsigned long) &dev->cmd.context[i];
+		dev->cmd.context[i].timer.function = event_timeout;
+	}
+
+	dev->cmd.context[dev->cmd.max_cmds - 1].next = -1;
+	dev->cmd.free_head = 0;
+
+	sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds);
+	spin_lock_init(&dev->cmd.context_lock);
+
+	for (dev->cmd.token_mask = 1;
+	     dev->cmd.token_mask < dev->cmd.max_cmds;
+	     dev->cmd.token_mask <<= 1)
+		; /* nothing */
+	--dev->cmd.token_mask;
+
+	dev->cmd.use_events = 1;
+	down(&dev->cmd.poll_sem);
+
+	return 0;
+}
+
+/*
+ * Switch back to polling (used when shutting down the device)
+ */
+void mthca_cmd_use_polling(struct mthca_dev *dev)
+{
+	int i;
+
+	dev->cmd.use_events = 0;
+
+	for (i = 0; i < dev->cmd.max_cmds; ++i)
+		down(&dev->cmd.event_sem);
+
+	kfree(dev->cmd.context);
+
+	up(&dev->cmd.poll_sem);
+}
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status)
+{
+	u64 out;
+	int ret;
+
+	ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status);
+
+	if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR)
+		mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, "
+			   "sladdr=%d, SPD source=%s\n",
+			   (int) (out >> 6) & 0xf, (int) (out >> 4) & 3,
+			   (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM");
+
+	return ret;
+}
+
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status);
+}
+
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int lg;
+	int nent = 0;
+	int i, j;
+	int err = 0;
+	int ts = 0;
+
+	inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma);
+	memset(inbox, 0, PAGE_SIZE);
+
+	for (i = 0; i < count; ++i) {
+		/*
+		 * We have to pass pages that are aligned to their
+		 * size, so find the least significant 1 in the
+		 * address or size and use that as our log2 size.
+		 */
+		lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1;
+		if (lg < 12) {
+			mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n",
+				   (unsigned long long) sg_dma_address(sglist + i),
+				   sg_dma_len(sglist + i));
+			err = -EINVAL;
+			goto out;
+		}
+		for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) {
+			*((__be64 *) (inbox + nent * 4 + 2)) =
+				cpu_to_be64((sg_dma_address(sglist + i) +
+					     (j << lg)) |
+					    (lg - 12));
+			ts += 1 << (lg - 10);
+			if (nent == PAGE_SIZE / 16) {
+				err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+						CMD_TIME_CLASS_B, status);
+				if (err || *status)
+					goto out;
+				nent = 0;
+			}
+		}
+	}
+
+	if (nent) {
+		err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA,
+				CMD_TIME_CLASS_B, status);
+	}
+
+	mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts);
+
+out:
+	pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status);
+}
+
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+	u8 lg;
+
+#define QUERY_FW_OUT_SIZE             0x100
+#define QUERY_FW_VER_OFFSET            0x00
+#define QUERY_FW_MAX_CMD_OFFSET        0x0f
+#define QUERY_FW_ERR_START_OFFSET      0x30
+#define QUERY_FW_ERR_SIZE_OFFSET       0x38
+
+#define QUERY_FW_START_OFFSET          0x20
+#define QUERY_FW_END_OFFSET            0x28
+
+#define QUERY_FW_SIZE_OFFSET           0x00
+#define QUERY_FW_CLR_INT_BASE_OFFSET   0x20
+#define QUERY_FW_EQ_ARM_BASE_OFFSET    0x40
+#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma);
+	if (!outbox) {
+		return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->fw_ver,   outbox, QUERY_FW_VER_OFFSET);
+	/*
+	 * FW subminor version is at more signifant bits than minor
+	 * version, so swap here.
+	 */
+	dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) |
+		((dev->fw_ver & 0xffff0000ull) >> 16) |
+		((dev->fw_ver & 0x0000ffffull) << 16);
+
+	MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
+	dev->cmd.max_cmds = 1 << lg;
+
+	mthca_dbg(dev, "FW version %012llx, max commands %d\n",
+		  (unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
+
+	if (dev->hca_type == ARBEL_NATIVE) {
+		MTHCA_GET(dev->fw.arbel.fw_pages,       outbox, QUERY_FW_SIZE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.clr_int_base,   outbox, QUERY_FW_CLR_INT_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_arm_base,    outbox, QUERY_FW_EQ_ARM_BASE_OFFSET);
+		MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET);
+		mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2);
+
+		mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n",
+			  (unsigned long long) dev->fw.arbel.clr_int_base,
+			  (unsigned long long) dev->fw.arbel.eq_arm_base,
+			  (unsigned long long) dev->fw.arbel.eq_set_ci_base);
+	} else {
+		MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET);
+		MTHCA_GET(dev->fw.tavor.fw_end,   outbox, QUERY_FW_END_OFFSET);
+
+		mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n",
+			  (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10),
+			  (unsigned long long) dev->fw.tavor.fw_start,
+			  (unsigned long long) dev->fw.tavor.fw_end);
+	}
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define ENABLE_LAM_OUT_SIZE         0x100
+#define ENABLE_LAM_START_OFFSET     0x00
+#define ENABLE_LAM_END_OFFSET       0x08
+#define ENABLE_LAM_INFO_OFFSET      0x13
+
+#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4)
+#define ENABLE_LAM_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM,
+			    CMD_TIME_CLASS_C, status);
+
+	if (err)
+		goto out;
+
+	if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, ENABLE_LAM_END_OFFSET);
+	MTHCA_GET(info,           outbox, ENABLE_LAM_INFO_OFFSET);
+
+	if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & ENABLE_LAM_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status);
+}
+
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status)
+{
+	u8 info;
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err = 0;
+
+#define QUERY_DDR_OUT_SIZE         0x100
+#define QUERY_DDR_START_OFFSET     0x00
+#define QUERY_DDR_END_OFFSET       0x08
+#define QUERY_DDR_INFO_OFFSET      0x13
+
+#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4)
+#define QUERY_DDR_INFO_ECC_MASK    0x3
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET);
+	MTHCA_GET(dev->ddr_end,   outbox, QUERY_DDR_END_OFFSET);
+	MTHCA_GET(info,           outbox, QUERY_DDR_INFO_OFFSET);
+
+	if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) !=
+	    !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		mthca_info(dev, "FW reports that HCA-attached memory "
+			   "is %s hidden; does not match PCI config\n",
+			   (info & QUERY_DDR_INFO_HIDDEN_FLAG) ?
+			   "" : "not");
+	}
+	if (info & QUERY_DDR_INFO_HIDDEN_FLAG)
+		mthca_dbg(dev, "HCA-attached memory is hidden.\n");
+
+	mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", 
+		  (int) ((dev->ddr_end - dev->ddr_start) >> 10),
+		  (unsigned long long) dev->ddr_start,
+		  (unsigned long long) dev->ddr_end);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	u8 field;
+	u16 size;
+	int err;
+
+#define QUERY_DEV_LIM_OUT_SIZE             0x100
+#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET    0x10
+#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET     0x11
+#define QUERY_DEV_LIM_RSVD_QP_OFFSET       0x12
+#define QUERY_DEV_LIM_MAX_QP_OFFSET        0x13
+#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET      0x14
+#define QUERY_DEV_LIM_MAX_SRQ_OFFSET       0x15
+#define QUERY_DEV_LIM_RSVD_EEC_OFFSET      0x16
+#define QUERY_DEV_LIM_MAX_EEC_OFFSET       0x17
+#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET     0x19
+#define QUERY_DEV_LIM_RSVD_CQ_OFFSET       0x1a
+#define QUERY_DEV_LIM_MAX_CQ_OFFSET        0x1b
+#define QUERY_DEV_LIM_MAX_MPT_OFFSET       0x1d
+#define QUERY_DEV_LIM_RSVD_EQ_OFFSET       0x1e
+#define QUERY_DEV_LIM_MAX_EQ_OFFSET        0x1f
+#define QUERY_DEV_LIM_RSVD_MTT_OFFSET      0x20
+#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET    0x21
+#define QUERY_DEV_LIM_RSVD_MRW_OFFSET      0x22
+#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET   0x23
+#define QUERY_DEV_LIM_MAX_AV_OFFSET        0x27
+#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET    0x29
+#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET    0x2b
+#define QUERY_DEV_LIM_MAX_RDMA_OFFSET      0x2f
+#define QUERY_DEV_LIM_ACK_DELAY_OFFSET     0x35
+#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET     0x36
+#define QUERY_DEV_LIM_VL_PORT_OFFSET       0x37
+#define QUERY_DEV_LIM_MAX_GID_OFFSET       0x3b
+#define QUERY_DEV_LIM_MAX_PKEY_OFFSET      0x3f
+#define QUERY_DEV_LIM_FLAGS_OFFSET         0x44
+#define QUERY_DEV_LIM_RSVD_UAR_OFFSET      0x48
+#define QUERY_DEV_LIM_UAR_SZ_OFFSET        0x49
+#define QUERY_DEV_LIM_PAGE_SZ_OFFSET       0x4b
+#define QUERY_DEV_LIM_MAX_SG_OFFSET        0x51
+#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET   0x52
+#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET    0x61
+#define QUERY_DEV_LIM_RSVD_MCG_OFFSET      0x62
+#define QUERY_DEV_LIM_MAX_MCG_OFFSET       0x63
+#define QUERY_DEV_LIM_RSVD_PD_OFFSET       0x64
+#define QUERY_DEV_LIM_MAX_PD_OFFSET        0x65
+#define QUERY_DEV_LIM_RSVD_RDD_OFFSET      0x66
+#define QUERY_DEV_LIM_MAX_RDD_OFFSET       0x67
+#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET  0x80
+#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET  0x82
+#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84
+#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86
+#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET  0x88
+#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET  0x8a
+#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET  0x8c
+#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET  0x8e
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET);
+	dev_lim->max_srq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET);
+	dev_lim->max_qp_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET);
+	dev_lim->reserved_qps = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET);
+	dev_lim->max_qps = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET);
+	dev_lim->reserved_srqs = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET);
+	dev_lim->max_srqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET);
+	dev_lim->reserved_eecs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET);
+	dev_lim->max_eecs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET);
+	dev_lim->max_cq_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET);
+	dev_lim->reserved_cqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET);
+	dev_lim->max_cqs = 1 << (field & 0x1f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET);
+	dev_lim->max_mpts = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET);
+	dev_lim->reserved_eqs = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET);
+	dev_lim->max_eqs = 1 << (field & 0x7);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET);
+	dev_lim->reserved_mtts = 1 << (field >> 4);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET);
+	dev_lim->max_mrw_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET);
+	dev_lim->reserved_mrws = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET);
+	dev_lim->max_mtt_seg = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET);
+	dev_lim->max_avs = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET);
+	dev_lim->max_requester_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET);
+	dev_lim->max_responder_per_qp = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET);
+	dev_lim->max_rdma_global = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET);
+	dev_lim->local_ca_ack_delay = field & 0x1f;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET);
+	dev_lim->max_mtu        = field >> 4;
+	dev_lim->max_port_width = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET);
+	dev_lim->max_vl    = field >> 4;
+	dev_lim->num_ports = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET);
+	dev_lim->max_gids = 1 << (field & 0xf);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET);
+	dev_lim->max_pkeys = 1 << (field & 0xf);
+	MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET);
+	dev_lim->reserved_uars = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET);
+	dev_lim->uar_size = 1 << ((field & 0x3f) + 20);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET);
+	dev_lim->min_page_sz = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET);
+	dev_lim->max_sg = field;
+	
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET);
+	dev_lim->max_desc_sz = size;
+
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET);
+	dev_lim->max_qp_per_mcg = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET);
+	dev_lim->reserved_mgms = field & 0xf;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET);
+	dev_lim->max_mcgs = 1 << field;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET);
+	dev_lim->reserved_pds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET);
+	dev_lim->max_pds = 1 << (field & 0x3f);
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET);
+	dev_lim->reserved_rdds = field >> 4;
+	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET);
+	dev_lim->max_rdds = 1 << (field & 0x3f);
+
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET);
+	dev_lim->eec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET);
+	dev_lim->qpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET);
+	dev_lim->eeec_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET);
+	dev_lim->eqpc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET);
+	dev_lim->eqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET);
+	dev_lim->cqc_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET);
+	dev_lim->srq_entry_sz = size;
+	MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET);
+	dev_lim->uar_scratch_entry_sz = size;
+
+	mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n",
+		  dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz);
+	mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n",
+		  dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz);
+	mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n",
+		  dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz);
+	mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n",
+		  dev_lim->reserved_mrws, dev_lim->reserved_mtts);
+	mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars);
+	mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n",
+		  dev_lim->max_pds, dev_lim->reserved_mgms);
+
+	mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status)
+{
+	u32 *outbox;
+	dma_addr_t outdma;
+	int err;
+
+#define QUERY_ADAPTER_OUT_SIZE             0x100
+#define QUERY_ADAPTER_VENDOR_ID_OFFSET     0x00
+#define QUERY_ADAPTER_DEVICE_ID_OFFSET     0x04
+#define QUERY_ADAPTER_REVISION_ID_OFFSET   0x08
+#define QUERY_ADAPTER_INTA_PIN_OFFSET      0x10
+
+	outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma);
+	if (!outbox)
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER,
+			    CMD_TIME_CLASS_A, status);
+
+	if (err)
+		goto out;
+
+	MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET);
+	MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET);
+	MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET);
+	MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET);
+
+out:
+	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	return err;
+}
+
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+
+#define INIT_HCA_IN_SIZE             	 0x200
+#define INIT_HCA_FLAGS_OFFSET        	 0x014
+#define INIT_HCA_QPC_OFFSET          	 0x020
+#define  INIT_HCA_QPC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x10)
+#define  INIT_HCA_LOG_QP_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x17)
+#define  INIT_HCA_EEC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x20)
+#define  INIT_HCA_LOG_EEC_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x27)
+#define  INIT_HCA_SRQC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x28)
+#define  INIT_HCA_LOG_SRQ_OFFSET     	 (INIT_HCA_QPC_OFFSET + 0x2f)
+#define  INIT_HCA_CQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x30)
+#define  INIT_HCA_LOG_CQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x37)
+#define  INIT_HCA_EQPC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x40)
+#define  INIT_HCA_EEEC_BASE_OFFSET   	 (INIT_HCA_QPC_OFFSET + 0x50)
+#define  INIT_HCA_EQC_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x60)
+#define  INIT_HCA_LOG_EQ_OFFSET      	 (INIT_HCA_QPC_OFFSET + 0x67)
+#define  INIT_HCA_RDB_BASE_OFFSET    	 (INIT_HCA_QPC_OFFSET + 0x70)
+#define INIT_HCA_UDAV_OFFSET         	 0x0b0
+#define  INIT_HCA_UDAV_LKEY_OFFSET   	 (INIT_HCA_UDAV_OFFSET + 0x0)
+#define  INIT_HCA_UDAV_PD_OFFSET     	 (INIT_HCA_UDAV_OFFSET + 0x4)
+#define INIT_HCA_MCAST_OFFSET        	 0x0c0
+#define  INIT_HCA_MC_BASE_OFFSET         (INIT_HCA_MCAST_OFFSET + 0x00)
+#define  INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12)
+#define  INIT_HCA_MC_HASH_SZ_OFFSET      (INIT_HCA_MCAST_OFFSET + 0x16)
+#define  INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b)
+#define INIT_HCA_TPT_OFFSET              0x0f0
+#define  INIT_HCA_MPT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x00)
+#define  INIT_HCA_MTT_SEG_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x09)
+#define  INIT_HCA_LOG_MPT_SZ_OFFSET      (INIT_HCA_TPT_OFFSET + 0x0b)
+#define  INIT_HCA_MTT_BASE_OFFSET        (INIT_HCA_TPT_OFFSET + 0x10)
+#define INIT_HCA_UAR_OFFSET              0x120
+#define  INIT_HCA_UAR_BASE_OFFSET        (INIT_HCA_UAR_OFFSET + 0x00)
+#define  INIT_HCA_UAR_PAGE_SZ_OFFSET     (INIT_HCA_UAR_OFFSET + 0x0b)
+#define  INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10)
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_HCA_IN_SIZE);
+
+#if defined(__LITTLE_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1);
+#elif defined(__BIG_ENDIAN)
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1);
+#else
+#error Host endianness not defined
+#endif
+	/* Check port for UD address vector: */
+	*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1);
+
+	/* We leave wqe_quota, responder_exu, etc as 0 (default) */
+
+	/* QPC/EEC/CQC/EQC/RDB attributes */
+
+	MTHCA_PUT(inbox, param->qpc_base,     INIT_HCA_QPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_qps,  INIT_HCA_LOG_QP_OFFSET);
+	MTHCA_PUT(inbox, param->eec_base,     INIT_HCA_EEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET);
+	MTHCA_PUT(inbox, param->srqc_base,    INIT_HCA_SRQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET);
+	MTHCA_PUT(inbox, param->cqc_base,     INIT_HCA_CQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_cqs,  INIT_HCA_LOG_CQ_OFFSET);
+	MTHCA_PUT(inbox, param->eqpc_base,    INIT_HCA_EQPC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eeec_base,    INIT_HCA_EEEC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->eqc_base,     INIT_HCA_EQC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_num_eqs,  INIT_HCA_LOG_EQ_OFFSET);
+	MTHCA_PUT(inbox, param->rdb_base,     INIT_HCA_RDB_BASE_OFFSET);
+
+	/* UD AV attributes */
+
+	/* multicast attributes */
+
+	MTHCA_PUT(inbox, param->mc_base,         INIT_HCA_MC_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mc_hash_sz,      INIT_HCA_MC_HASH_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET);
+
+	/* TPT attributes */
+
+	MTHCA_PUT(inbox, param->mpt_base,   INIT_HCA_MPT_BASE_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET);
+	MTHCA_PUT(inbox, param->mtt_base,   INIT_HCA_MTT_BASE_OFFSET);
+
+	/* UAR attributes */
+	{
+		u8 uar_page_sz = PAGE_SHIFT - 12;
+		MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET);
+		MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET);
+	}
+
+	err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA,
+			HZ, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status)
+{
+	u32 *inbox;
+	dma_addr_t indma;
+	int err;
+	u32 flags;
+
+#define INIT_IB_IN_SIZE          56
+#define INIT_IB_FLAGS_OFFSET     0x00
+#define INIT_IB_FLAG_SIG         (1 << 18)
+#define INIT_IB_FLAG_NG          (1 << 17)
+#define INIT_IB_FLAG_G0          (1 << 16)
+#define INIT_IB_FLAG_1X          (1 << 8)
+#define INIT_IB_FLAG_4X          (1 << 9)
+#define INIT_IB_FLAG_12X         (1 << 11)
+#define INIT_IB_VL_SHIFT         4
+#define INIT_IB_MTU_SHIFT        12
+#define INIT_IB_MAX_GID_OFFSET   0x06
+#define INIT_IB_MAX_PKEY_OFFSET  0x0a
+#define INIT_IB_GUID0_OFFSET     0x10
+#define INIT_IB_NODE_GUID_OFFSET 0x18
+#define INIT_IB_SI_GUID_OFFSET   0x20
+
+	inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma);
+	if (!inbox)
+		return -ENOMEM;
+
+	memset(inbox, 0, INIT_IB_IN_SIZE);
+
+	flags = 0;
+	flags |= param->enable_1x     ? INIT_IB_FLAG_1X  : 0;
+	flags |= param->enable_4x     ? INIT_IB_FLAG_4X  : 0;
+	flags |= param->set_guid0     ? INIT_IB_FLAG_G0  : 0;
+	flags |= param->set_node_guid ? INIT_IB_FLAG_NG  : 0;
+	flags |= param->set_si_guid   ? INIT_IB_FLAG_SIG : 0;
+	flags |= param->vl_cap << INIT_IB_VL_SHIFT;
+	flags |= param->mtu_cap << INIT_IB_MTU_SHIFT;
+	MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET);
+
+	MTHCA_PUT(inbox, param->gid_cap,   INIT_IB_MAX_GID_OFFSET);
+	MTHCA_PUT(inbox, param->pkey_cap,  INIT_IB_MAX_PKEY_OFFSET);
+	MTHCA_PUT(inbox, param->guid0,     INIT_IB_GUID0_OFFSET);
+	MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET);
+	MTHCA_PUT(inbox, param->si_guid,   INIT_IB_SI_GUID_OFFSET);
+
+	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
+			CMD_TIME_CLASS_A, status);
+
+	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	return err;
+}
+
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status)
+{
+	return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status);
+}
+
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status);
+}
+
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mpt_entry,
+			       MTHCA_MPT_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	if (mpt_entry) {
+		outdma = pci_map_single(dev->pdev, mpt_entry,
+					MTHCA_MPT_ENTRY_SIZE,
+					PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(outdma))
+			return -ENOMEM;
+	}
+
+	err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry,
+			    CMD_HW2SW_MPT,
+			    CMD_TIME_CLASS_B, status);
+
+	if (mpt_entry)
+		pci_unmap_single(dev->pdev, outdma,
+				 MTHCA_MPT_ENTRY_SIZE,
+				 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mtt_entry,
+			       (num_mtt + 2) * 8,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT,
+			CMD_TIME_CLASS_B, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 (num_mtt + 2) * 8, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status)
+{
+	mthca_dbg(dev, "%s mask %016llx for eqn %d\n",
+		  unmap ? "Clearing" : "Setting",
+		  (unsigned long long) event_mask, eq_num);
+	return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num,
+			 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status);
+}
+
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, eq_context,
+			       MTHCA_EQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, eq_context,
+				MTHCA_EQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, eq_num, 0,
+			    CMD_HW2SW_EQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_EQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, cq_context,
+			       MTHCA_CQ_CONTEXT_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, cq_context,
+				MTHCA_CQ_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, cq_num, 0,
+			    CMD_HW2SW_CQ,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_CQ_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status)
+{
+	static const u16 op[] = {
+		[MTHCA_TRANS_RST2INIT]  = CMD_RST2INIT_QPEE,
+		[MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE,
+		[MTHCA_TRANS_INIT2RTR]  = CMD_INIT2RTR_QPEE,
+		[MTHCA_TRANS_RTR2RTS]   = CMD_RTR2RTS_QPEE,
+		[MTHCA_TRANS_RTS2RTS]   = CMD_RTS2RTS_QPEE,
+		[MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE,
+		[MTHCA_TRANS_ANY2ERR]   = CMD_2ERR_QPEE,
+		[MTHCA_TRANS_RTS2SQD]   = CMD_RTS2SQD_QPEE,
+		[MTHCA_TRANS_SQD2SQD]   = CMD_SQD2SQD_QPEE,
+		[MTHCA_TRANS_SQD2RTS]   = CMD_SQD2RTS_QPEE,
+		[MTHCA_TRANS_ANY2RST]   = CMD_ERR2RST_QPEE
+	};
+	u8 op_mod = 0;
+
+	dma_addr_t indma;
+	int err;
+
+	if (trans < 0 || trans >= ARRAY_SIZE(op))
+		return -EINVAL;
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		indma  = 0;
+		op_mod = 3;	/* don't write outbox, any->reset */
+
+		/* For debugging */
+		qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+						  &indma);
+		op_mod = 2;	/* write outbox, any->reset */
+	} else {
+		indma = pci_map_single(dev->pdev, qp_context,
+				       MTHCA_QP_CONTEXT_SIZE,
+				       PCI_DMA_TODEVICE);
+		if (pci_dma_mapping_error(indma))
+			return -ENOMEM;
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+	}
+
+	if (trans == MTHCA_TRANS_ANY2RST) {
+		err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num,
+				    op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+		if (0) {
+			int i;
+			mthca_dbg(dev, "Dumping QP context:\n");
+			printk(" %08x\n", be32_to_cpup(qp_context));
+			for (i = 0; i < 0x100 / 4; ++i) {
+				if (i % 8 == 0)
+					printk("[%02x] ", i * 4);
+				printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2]));
+				if ((i + 1) % 8 == 0)
+					printk("\n");
+			}
+		}
+
+	} else
+		err = mthca_cmd(dev, indma, (!!is_ee << 24) | num,
+				op_mod, op[trans], CMD_TIME_CLASS_C, status);
+
+	if (trans != MTHCA_TRANS_ANY2RST)
+		pci_unmap_single(dev->pdev, indma,
+				 MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE);
+	else
+		pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE,
+				    qp_context, indma);
+	return err;
+}
+
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, qp_context,
+				MTHCA_QP_CONTEXT_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0,
+			    CMD_QUERY_QPEE,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_QP_CONTEXT_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status)
+{
+	u8 op_mod;
+
+	switch (type) {
+	case IB_QPT_SMI:
+		op_mod = 0;
+		break;
+	case IB_QPT_GSI:
+		op_mod = 1;
+		break;
+	case IB_QPT_RAW_IPV6:
+		op_mod = 2;
+		break;
+	case IB_QPT_RAW_ETY:
+		op_mod = 3;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP,
+			 CMD_TIME_CLASS_B, status);
+}
+
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status) {
+	void *box;
+	dma_addr_t dma;
+	int err;
+
+#define MAD_IFC_BOX_SIZE 512
+
+	box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma);
+	if (!box)
+		return -ENOMEM;
+
+	memcpy(box, in_mad, 256);
+
+	err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey,
+			    CMD_MAD_IFC, CMD_TIME_CLASS_C, status);
+
+	if (!err && !*status)
+		memcpy(response_mad, box + 256, 256);
+
+	pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma);
+	return err;
+}
+
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status)
+{
+	dma_addr_t outdma = 0;
+	int err;
+
+	outdma = pci_map_single(dev->pdev, mgm,
+				MTHCA_MGM_ENTRY_SIZE,
+				PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(outdma))
+		return -ENOMEM;
+
+	err = mthca_cmd_box(dev, 0, outdma, index, 0,
+			    CMD_READ_MGM,
+			    CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, outdma,
+			 MTHCA_MGM_ENTRY_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	return err;
+}
+
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	int err;
+
+	indma = pci_map_single(dev->pdev, mgm,
+			       MTHCA_MGM_ENTRY_SIZE,
+			       PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM,
+			CMD_TIME_CLASS_A, status);
+
+	pci_unmap_single(dev->pdev, indma,
+			 MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE);
+	return err;
+}
+
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status)
+{
+	dma_addr_t indma;
+	u64 imm;
+	int err;
+
+	indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(indma))
+		return -ENOMEM;
+
+	err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH,
+			    CMD_TIME_CLASS_A, status);
+	*hash = imm;
+
+	pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE);
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h	2004-11-23 08:10:20.076602080 -0800
@@ -0,0 +1,260 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $
+ */
+
+#ifndef MTHCA_CMD_H
+#define MTHCA_CMD_H
+
+#include <ib_verbs.h>
+
+#define MTHCA_CMD_MAILBOX_ALIGN 16UL
+#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1)
+
+enum {
+	/* command completed successfully: */
+	MTHCA_CMD_STAT_OK 	      = 0x00,
+	/* Internal error (such as a bus error) occurred while processing command: */
+	MTHCA_CMD_STAT_INTERNAL_ERR   = 0x01,
+	/* Operation/command not supported or opcode modifier not supported: */
+	MTHCA_CMD_STAT_BAD_OP 	      = 0x02,
+	/* Parameter not supported or parameter out of range: */
+	MTHCA_CMD_STAT_BAD_PARAM      = 0x03,
+	/* System not enabled or bad system state: */
+	MTHCA_CMD_STAT_BAD_SYS_STATE  = 0x04,
+	/* Attempt to access reserved or unallocaterd resource: */
+	MTHCA_CMD_STAT_BAD_RESOURCE   = 0x05,
+	/* Requested resource is currently executing a command, or is otherwise busy: */
+	MTHCA_CMD_STAT_RESOURCE_BUSY  = 0x06,
+	/* memory error: */
+	MTHCA_CMD_STAT_DDR_MEM_ERR    = 0x07,
+	/* Required capability exceeds device limits: */
+	MTHCA_CMD_STAT_EXCEED_LIM     = 0x08,
+	/* Resource is not in the appropriate state or ownership: */
+	MTHCA_CMD_STAT_BAD_RES_STATE  = 0x09,
+	/* Index out of range: */
+	MTHCA_CMD_STAT_BAD_INDEX      = 0x0a,
+	/* FW image corrupted: */
+	MTHCA_CMD_STAT_BAD_NVMEM      = 0x0b,
+	/* Attempt to modify a QP/EE which is not in the presumed state: */
+	MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10,
+	/* Bad segment parameters (Address/Size): */
+	MTHCA_CMD_STAT_BAD_SEG_PARAM  = 0x20,
+	/* Memory Region has Memory Windows bound to: */
+	MTHCA_CMD_STAT_REG_BOUND      = 0x21,
+	/* HCA local attached memory not present: */
+	MTHCA_CMD_STAT_LAM_NOT_PRE    = 0x22,
+        /* Bad management packet (silently discarded): */
+	MTHCA_CMD_STAT_BAD_PKT 	      = 0x30,
+        /* More outstanding CQEs in CQ than new CQ size: */
+	MTHCA_CMD_STAT_BAD_SIZE       = 0x40
+};
+
+enum {
+	MTHCA_TRANS_INVALID = 0,
+	MTHCA_TRANS_RST2INIT,
+	MTHCA_TRANS_INIT2INIT,
+	MTHCA_TRANS_INIT2RTR,
+	MTHCA_TRANS_RTR2RTS,
+	MTHCA_TRANS_RTS2RTS,
+	MTHCA_TRANS_SQERR2RTS,
+	MTHCA_TRANS_ANY2ERR,
+	MTHCA_TRANS_RTS2SQD,
+	MTHCA_TRANS_SQD2SQD,
+	MTHCA_TRANS_SQD2RTS,
+	MTHCA_TRANS_ANY2RST,
+};
+
+enum {
+	DEV_LIM_FLAG_SRQ = 1 << 6
+};
+
+struct mthca_dev_lim {
+	int max_srq_sz;
+	int max_qp_sz;
+	int reserved_qps;
+	int max_qps;
+	int reserved_srqs;
+	int max_srqs;
+	int reserved_eecs;
+	int max_eecs;
+	int max_cq_sz;
+	int reserved_cqs;
+	int max_cqs;
+	int max_mpts;
+	int reserved_eqs;
+	int max_eqs;
+	int reserved_mtts;
+	int max_mrw_sz;
+	int reserved_mrws;
+	int max_mtt_seg;
+	int max_avs;
+	int max_requester_per_qp;
+	int max_responder_per_qp;
+	int max_rdma_global;
+	int local_ca_ack_delay;
+	int max_mtu;
+	int max_port_width;
+	int max_vl;
+	int num_ports;
+	int max_gids;
+	int max_pkeys;
+	u32 flags;
+	int reserved_uars;
+	int uar_size;
+	int min_page_sz;
+	int max_sg;
+	int max_desc_sz;
+	int max_qp_per_mcg;
+	int reserved_mgms;
+	int max_mcgs;
+	int reserved_pds;
+	int max_pds;
+	int reserved_rdds;
+	int max_rdds;
+	int eec_entry_sz;
+	int qpc_entry_sz;
+	int eeec_entry_sz;
+	int eqpc_entry_sz;
+	int eqc_entry_sz;
+	int cqc_entry_sz;
+	int srq_entry_sz;
+	int uar_scratch_entry_sz;
+};
+
+struct mthca_adapter {
+	u32 vendor_id;
+	u32 device_id;
+	u32 revision_id;
+	u8  inta_pin;
+};
+
+struct mthca_init_hca_param {
+	u64 qpc_base;
+	u8  log_num_qps;
+	u64 eec_base;
+	u8  log_num_eecs;
+	u64 srqc_base;
+	u8  log_num_srqs;
+	u64 cqc_base;
+	u8  log_num_cqs;
+	u64 eqpc_base;
+	u64 eeec_base;
+	u64 eqc_base;
+	u8  log_num_eqs;
+	u64 rdb_base;
+	u64 mc_base;
+	u16 log_mc_entry_sz;
+	u16 mc_hash_sz;
+	u8  log_mc_table_sz;
+	u64 mpt_base;
+	u8  mtt_seg_sz;
+	u8  log_mpt_sz;
+	u64 mtt_base;
+	u64 uar_scratch_base;
+};
+
+struct mthca_init_ib_param {
+	int enable_1x;
+	int enable_4x;
+	int vl_cap;
+	int mtu_cap;
+	u16 gid_cap;
+	u16 pkey_cap;
+	int set_guid0;
+	u64 guid0;
+	int set_node_guid;
+	u64 node_guid;
+	int set_si_guid;
+	u64 si_guid;
+};
+
+int mthca_cmd_use_events(struct mthca_dev *dev);
+void mthca_cmd_use_polling(struct mthca_dev *dev);
+void mthca_cmd_event(struct mthca_dev *dev,
+		     u16 token,
+		     u8  status,
+		     u64 out_param);
+
+int mthca_SYS_EN(struct mthca_dev *dev, u8 *status);
+int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status);
+int mthca_MAP_FA(struct mthca_dev *dev, int count,
+		 struct scatterlist *sglist, u8 *status);
+int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status);
+int mthca_RUN_FW(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status);
+int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status);
+int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
+			struct mthca_dev_lim *dev_lim, u8 *status);
+int mthca_QUERY_ADAPTER(struct mthca_dev *dev,
+			struct mthca_adapter *adapter, u8 *status);
+int mthca_INIT_HCA(struct mthca_dev *dev,
+		   struct mthca_init_hca_param *param,
+		   u8 *status);
+int mthca_INIT_IB(struct mthca_dev *dev,
+		  struct mthca_init_ib_param *param,
+		  int port, u8 *status);
+int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status);
+int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status);
+int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry,
+		    int mpt_index, u8 *status);
+int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
+		    int num_mtt, u8 *status);
+int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
+		 int eq_num, u8 *status);
+int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context,
+		   int eq_num, u8 *status);
+int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context,
+		   int cq_num, u8 *status);
+int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num,
+		    int is_ee, void *qp_context, u32 optmask,
+		    u8 *status);
+int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee,
+		   void *qp_context, u8 *status);
+int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn,
+			  u8 *status);
+int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port,
+		  void *in_mad, void *response_mad, u8 *status);
+int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm,
+		   u8 *status);
+int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm,
+		    u8 *status);
+int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash,
+		    u8 *status);
+
+#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN))
+
+#endif /* MTHCA_CMD_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:14 -0800
Subject: [openib-general] [PATCH][RFC/v2][10/21] Add Mellanox HCA low-level
	driver (EQ)
In-Reply-To: <20041123815.4PYKXCiYMYCttxq4@topspin.com>
Message-ID: <20041123815.Ai338wEt3YqtY107@topspin.com>

Add event queue code for Mellanox HCA driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c	2004-11-23 08:10:20.359560358 -0800
@@ -0,0 +1,650 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+#include "mthca_config_reg.h"
+
+enum {
+	MTHCA_NUM_ASYNC_EQE = 0x80,
+	MTHCA_NUM_CMD_EQE   = 0x80,
+	MTHCA_EQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_eq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 pd;
+	u8  reserved1[3];
+	u8  intr;
+	u32 lost_count;
+	u32 lkey;
+	u32 reserved2[2];
+	u32 consumer_index;
+	u32 producer_index;
+	u32 reserved3[4];
+} __attribute__((packed));
+
+#define MTHCA_EQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_EQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_EQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_EQ_OWNER_SW           ( 0 << 24)
+#define MTHCA_EQ_OWNER_HW           ( 1 << 24)
+#define MTHCA_EQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_EQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_EQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_EQ_STATE_FIRED        ( 2 <<  8)
+#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 <<  8)
+
+enum {
+	MTHCA_EVENT_TYPE_COMP       	    = 0x00,
+	MTHCA_EVENT_TYPE_PATH_MIG   	    = 0x01,
+	MTHCA_EVENT_TYPE_COMM_EST   	    = 0x02,
+	MTHCA_EVENT_TYPE_SQ_DRAINED 	    = 0x03,
+	MTHCA_EVENT_TYPE_SRQ_LAST_WQE       = 0x13,
+	MTHCA_EVENT_TYPE_CQ_ERROR   	    = 0x04,
+	MTHCA_EVENT_TYPE_WQ_CATAS_ERROR     = 0x05,
+	MTHCA_EVENT_TYPE_EEC_CATAS_ERROR    = 0x06,
+	MTHCA_EVENT_TYPE_PATH_MIG_FAILED    = 0x07,
+	MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
+	MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR    = 0x11,
+	MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR    = 0x12,
+	MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR  = 0x08,
+	MTHCA_EVENT_TYPE_PORT_CHANGE        = 0x09,
+	MTHCA_EVENT_TYPE_EQ_OVERFLOW        = 0x0f,
+	MTHCA_EVENT_TYPE_ECC_DETECT         = 0x0e,
+	MTHCA_EVENT_TYPE_CMD                = 0x0a
+};
+
+#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG)           | \
+				(1ULL << MTHCA_EVENT_TYPE_COMM_EST)           | \
+				(1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED)         | \
+				(1ULL << MTHCA_EVENT_TYPE_CQ_ERROR)           | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR)     | \
+				(1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED)    | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \
+				(1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR)  | \
+				(1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE)        | \
+				(1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW)        | \
+				(1ULL << MTHCA_EVENT_TYPE_ECC_DETECT))
+#define MTHCA_SRQ_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR)    | \
+				(1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE)
+#define MTHCA_CMD_EVENT_MASK    (1ULL << MTHCA_EVENT_TYPE_CMD)
+
+#define MTHCA_EQ_DB_INC_CI     (1 << 24)
+#define MTHCA_EQ_DB_REQ_NOT    (2 << 24)
+#define MTHCA_EQ_DB_DISARM_CQ  (3 << 24)
+#define MTHCA_EQ_DB_SET_CI     (4 << 24)
+#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24)
+
+struct mthca_eqe {
+	u8 reserved1;
+	u8 type;
+	u8 reserved2;
+	u8 subtype;
+	union {
+		u32 raw[6];
+		struct {
+			u32 cqn;
+		} __attribute__((packed)) comp;
+		struct {
+			u16 reserved1;
+			u16 token;
+			u32 reserved2;
+			u8  reserved3[3];
+			u8  status;
+			u64 out_param;
+		} __attribute__((packed)) cmd;
+		struct {
+			u32 qpn;
+		} __attribute__((packed)) qp;
+		struct {
+			u32 reserved1[2];
+			u32 port;
+		} __attribute__((packed)) port_change;
+	} event;
+	u8 reserved3[3];
+	u8 owner;
+} __attribute__((packed));
+
+#define  MTHCA_EQ_ENTRY_OWNER_SW      (0 << 7)
+#define  MTHCA_EQ_ENTRY_OWNER_HW      (1 << 7)
+
+static inline u64 async_mask(struct mthca_dev *dev)
+{
+	return dev->mthca_flags & MTHCA_FLAG_SRQ ?
+		MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK :
+		MTHCA_ASYNC_EVENT_MASK;
+}
+
+static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn);
+	doorbell[1] = cpu_to_be32(ci);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void eq_req_not(struct mthca_dev *dev, int eqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
+	doorbell[1] = 0;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
+	doorbell[1] = cpu_to_be32(cqn);
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_EQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry)
+{
+	return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf
+		+ (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int next_eqe_sw(struct mthca_eq *eq)
+{
+	return !(MTHCA_EQ_ENTRY_OWNER_HW &
+		 get_eqe(eq, eq->cons_index)->owner);
+}
+
+static inline void set_eqe_hw(struct mthca_eq *eq, int entry)
+{
+	get_eqe(eq, entry)->owner =  MTHCA_EQ_ENTRY_OWNER_HW;
+}
+
+static void port_change(struct mthca_dev *dev, int port, int active)
+{
+	struct ib_event record;
+
+	mthca_dbg(dev, "Port change to %s for port %d\n",
+		  active ? "active" : "down", port);
+
+	record.device = &dev->ib_dev;
+	record.event  = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR;
+	record.element.port_num = port;
+
+	ib_dispatch_event(&record);
+}
+
+static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq)
+{
+	struct mthca_eqe *eqe;
+	int disarm_cqn;
+	int work = 0;
+
+	while (1) {
+		if (!next_eqe_sw(eq))
+			break;
+
+		eqe = get_eqe(eq, eq->cons_index);
+		work = 1;
+
+		switch (eqe->type) {
+		case MTHCA_EVENT_TYPE_COMP:
+			disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
+			disarm_cq(dev, eq->eqn, disarm_cqn);
+			mthca_cq_event(dev, disarm_cqn);
+			break;
+			
+		case MTHCA_EVENT_TYPE_PATH_MIG:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG);
+			break;
+
+		case MTHCA_EVENT_TYPE_COMM_EST:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_COMM_EST);
+			break;
+
+		case MTHCA_EVENT_TYPE_SQ_DRAINED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_SQ_DRAINED);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_FATAL);
+			break;
+
+		case MTHCA_EVENT_TYPE_PATH_MIG_FAILED:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_PATH_MIG_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_REQ_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR:
+			mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff,
+				       IB_EVENT_QP_ACCESS_ERR);
+			break;
+
+		case MTHCA_EVENT_TYPE_CMD:
+			mthca_cmd_event(dev,
+					be16_to_cpu(eqe->event.cmd.token),
+					eqe->event.cmd.status,
+					be64_to_cpu(eqe->event.cmd.out_param));
+			break;
+
+		case MTHCA_EVENT_TYPE_PORT_CHANGE:
+			port_change(dev,
+				    (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3,
+				    eqe->subtype == 0x4);
+			break;
+
+		case MTHCA_EVENT_TYPE_CQ_ERROR:
+		case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR:
+		case MTHCA_EVENT_TYPE_EQ_OVERFLOW:
+		case MTHCA_EVENT_TYPE_ECC_DETECT:
+		default:
+			mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n",
+				   eqe->type, eqe->subtype, eq->eqn);
+			break;
+		};
+
+		set_eqe_hw(eq, eq->cons_index);
+		eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1);
+	}
+
+	if (work) {
+		wmb();
+		set_eq_ci(dev, eq->eqn, eq->cons_index);
+	}
+
+	eq_req_not(dev, eq->eqn);
+}
+
+static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs)
+{
+	struct mthca_dev *dev = dev_ptr;
+	u32 ecr;
+	int work = 0;
+	int i;
+
+	if (dev->eq_table.clr_mask)
+		writel(dev->eq_table.clr_mask, dev->eq_table.clr_int);
+
+	while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) {
+		work = 1;
+
+		writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i)
+			if (ecr & dev->eq_table.eq[i].ecr_mask)
+				mthca_eq_int(dev, &dev->eq_table.eq[i]);
+	}
+
+	return IRQ_RETVAL(work);
+}
+
+static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr,
+					 struct pt_regs *regs)
+{
+	struct mthca_eq  *eq  = eq_ptr;
+	struct mthca_dev *dev = eq->dev;
+
+	writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4);
+	mthca_eq_int(dev, eq);
+
+	/* MSI-X vectors always belong to us */
+	return IRQ_HANDLED;
+}
+
+static int __devinit mthca_create_eq(struct mthca_dev *dev,
+				     int nent,
+				     u8 intr,
+				     struct mthca_eq *eq)
+{
+	int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	u64 *dma_list = NULL;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	struct mthca_eq_context *eq_context;
+	int err = -ENOMEM;
+	int i;
+	u8 status;
+
+	eq->dev = dev;
+
+	eq->page_list = kmalloc(npages * sizeof *eq->page_list,
+				GFP_KERNEL);
+	if (!eq->page_list)
+		goto err_out;
+
+	for (i = 0; i < npages; ++i)
+		eq->page_list[i].buf = NULL;
+
+	dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+	if (!dma_list)
+		goto err_out_free;
+
+	mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free;
+	eq_context = MAILBOX_ALIGN(mailbox);
+
+	for (i = 0; i < npages; ++i) {
+		eq->page_list[i].buf = pci_alloc_consistent(dev->pdev,
+							    PAGE_SIZE, &t);
+		if (!eq->page_list[i].buf)
+			goto err_out_free;
+
+		dma_list[i] = t;
+		pci_unmap_addr_set(&eq->page_list[i], mapping, t);
+
+		memset(eq->page_list[i].buf, 0, PAGE_SIZE);
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_eqe_hw(eq, i);
+
+	eq->eqn = mthca_alloc(&dev->eq_table.alloc);
+	if (eq->eqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, PAGE_SHIFT, npages,
+				  0, npages * PAGE_SIZE,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &eq->mr);
+	if (err)
+		goto err_out_free_eq;
+
+	eq->nent = nent;
+
+	memset(eq_context, 0, sizeof *eq_context);
+	eq_context->flags           = cpu_to_be32(MTHCA_EQ_STATUS_OK   |
+						  MTHCA_EQ_OWNER_HW    |
+						  MTHCA_EQ_STATE_ARMED |
+						  MTHCA_EQ_FLAG_TR);
+	eq_context->start           = cpu_to_be64(0);
+	eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	eq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	eq_context->intr            = intr;
+	eq_context->lkey            = cpu_to_be32(eq->mr.ibmr.lkey);
+
+	err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+	if (status) {
+		mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	eq->ecr_mask   = swab32(1 << eq->eqn);
+	eq->cons_index = 0;
+
+	eq_req_not(dev, eq->eqn);
+
+	mthca_dbg(dev, "Allocated EQ %d with %d entries\n",
+		  eq->eqn, nent);
+
+	return err;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &eq->mr);
+
+ err_out_free_eq:
+	mthca_free(&dev->eq_table.alloc, eq->eqn);
+
+ err_out_free:
+	for (i = 0; i < npages; ++i)
+		if (eq->page_list[i].buf)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    eq->page_list[i].buf,
+					    pci_unmap_addr(&eq->page_list[i],
+							   mapping));
+
+	kfree(eq->page_list);
+	kfree(dma_list);
+	kfree(mailbox);
+
+ err_out:
+	return err;
+}
+
+static void mthca_free_eq(struct mthca_dev *dev,
+			  struct mthca_eq *eq)
+{
+	void *mailbox = NULL;
+	int err;
+	u8 status;
+	int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+		PAGE_SIZE;
+	int i;
+
+	mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		return;
+
+	err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox),
+			     eq->eqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err);
+	if (status)
+		mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn);
+		for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+
+	mthca_free_mr(dev, &eq->mr);
+	for (i = 0; i < npages; ++i)
+		pci_free_consistent(dev->pdev, PAGE_SIZE,
+				    eq->page_list[i].buf,
+				    pci_unmap_addr(&eq->page_list[i], mapping));
+
+	kfree(eq->page_list);
+	kfree(mailbox);
+}
+
+static void mthca_free_irqs(struct mthca_dev *dev)
+{
+	int i;
+
+	if (dev->eq_table.have_irq)
+		free_irq(dev->pdev->irq, dev);
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		if (dev->eq_table.eq[i].have_irq)
+			free_irq(dev->eq_table.eq[i].msi_x_vector,
+				 dev->eq_table.eq + i);
+}
+
+int __devinit mthca_init_eq_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	u8 intr;
+	int i;
+
+	err = mthca_alloc_init(&dev->eq_table.alloc,
+			       dev->limits.num_eqs,
+			       dev->limits.num_eqs - 1,
+			       dev->limits.reserved_eqs);
+	if (err)
+		return err;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI ||
+	    dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		dev->eq_table.clr_mask = 0;
+	} else {
+		dev->eq_table.clr_mask =
+			swab32(1 << (dev->eq_table.inta_pin & 31));
+		dev->eq_table.clr_int  = dev->clr_base +
+			(dev->eq_table.inta_pin < 31 ? 4 : 0);
+	}
+
+	intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ?
+		128 : dev->eq_table.inta_pin;
+
+	err = mthca_create_eq(dev, dev->limits.num_cqs,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_COMP]);
+	if (err)
+		goto err_out_free;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+	if (err)
+		goto err_out_comp;
+
+	err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE,
+			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr,
+			      &dev->eq_table.eq[MTHCA_EQ_CMD]);
+	if (err)
+		goto err_out_async;
+
+	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
+		static const char *eq_name[] = {
+			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
+			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
+			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)" 
+		};
+
+		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
+					  mthca_msi_x_interrupt, 0,
+					  eq_name[i], dev->eq_table.eq + i);
+			if (err)
+				goto err_out_cmd;
+			dev->eq_table.eq[i].have_irq = 1;
+		}
+	} else {
+		err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ,
+				  DRV_NAME, dev);
+		if (err)
+			goto err_out_cmd;
+		dev->eq_table.have_irq = 1;
+	}
+
+	err = mthca_MAP_EQ(dev, async_mask(dev),
+			   0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status);
+
+	err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+			   0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+	if (err)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err);
+	if (status)
+		mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n",
+			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status);
+
+	return 0;
+
+err_out_cmd:
+	mthca_free_irqs(dev);
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]);
+
+err_out_async:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
+
+err_out_comp:
+	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]);
+
+err_out_free:
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+	return err;
+}
+
+void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev)
+{
+	u8 status;
+	int i;
+
+	mthca_free_irqs(dev);
+
+	mthca_MAP_EQ(dev, async_mask(dev),
+		     1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status);
+	mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
+		     1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
+
+	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+		mthca_free_eq(dev, &dev->eq_table.eq[i]);
+
+	mthca_alloc_cleanup(&dev->eq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:20 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:20 -0800
Subject: [openib-general] [PATCH][RFC/v2][11/21] Add Mellanox HCA low-level
	driver (initialization)
In-Reply-To: <20041123815.Ai338wEt3YqtY107@topspin.com>
Message-ID: <20041123815.dUhm1PnERtccLLnp@topspin.com>

Add device initializaton code for Mellanox HCA driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c	2004-11-23 08:10:20.600524828 -0800
@@ -0,0 +1,222 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include "mthca_profile.h"
+
+static int default_profile[MTHCA_RES_NUM] = {
+	[MTHCA_RES_QP]    = 1 << 16,
+	[MTHCA_RES_EQP]   = 1 << 16,
+	[MTHCA_RES_CQ]    = 1 << 16,
+	[MTHCA_RES_EQ]    = 32,
+	[MTHCA_RES_RDB]   = 1 << 18,
+	[MTHCA_RES_MCG]   = 1 << 13,
+	[MTHCA_RES_MPT]   = 1 << 17,
+	[MTHCA_RES_MTT]   = 1 << 20,
+	[MTHCA_RES_UDAV]  = 1 << 15
+};
+
+enum {
+	MTHCA_RDB_ENTRY_SIZE = 32,
+	MTHCA_MTT_SEG_SIZE   = 64
+};
+
+enum {
+	MTHCA_NUM_PDS = 1 << 15
+};
+
+int mthca_make_profile(struct mthca_dev *dev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca)
+{
+	/* just use default profile for now */
+	struct mthca_resource {
+		u64 size;
+		u64 start;
+		int type;
+		int num;
+		int log_num;
+	};
+
+	u64 total_size = 0;
+	struct mthca_resource *profile;
+	struct mthca_resource tmp;
+	int i, j;
+
+	default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE;
+
+	profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL);
+	if (!profile)
+		return -ENOMEM;
+
+	profile[MTHCA_RES_QP].size   = dev_lim->qpc_entry_sz;
+	profile[MTHCA_RES_EEC].size  = dev_lim->eec_entry_sz;
+	profile[MTHCA_RES_SRQ].size  = dev_lim->srq_entry_sz;
+	profile[MTHCA_RES_CQ].size   = dev_lim->cqc_entry_sz;
+	profile[MTHCA_RES_EQP].size  = dev_lim->eqpc_entry_sz;
+	profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz;
+	profile[MTHCA_RES_EQ].size   = dev_lim->eqc_entry_sz;
+	profile[MTHCA_RES_RDB].size  = MTHCA_RDB_ENTRY_SIZE;
+	profile[MTHCA_RES_MCG].size  = MTHCA_MGM_ENTRY_SIZE;
+	profile[MTHCA_RES_MPT].size  = MTHCA_MPT_ENTRY_SIZE;
+	profile[MTHCA_RES_MTT].size  = MTHCA_MTT_SEG_SIZE;
+	profile[MTHCA_RES_UAR].size  = dev_lim->uar_scratch_entry_sz;
+	profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE;
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		profile[i].type     = i;
+		profile[i].num      = default_profile[i];
+		profile[i].log_num  = max(ffs(default_profile[i]) - 1, 0);
+		profile[i].size    *= default_profile[i];
+	}
+
+	/* 
+	 * Sort the resources in decreasing order of size.  Since they
+	 * all have sizes that are powers of 2, we'll be able to keep
+	 * resources aligned to their size and pack them without gaps
+	 * using the sorted order.
+	 */
+	for (i = MTHCA_RES_NUM; i > 0; --i)
+		for (j = 1; j < i; ++j) {
+			if (profile[j].size > profile[j - 1].size) {
+				tmp            = profile[j];
+				profile[j]     = profile[j - 1];
+				profile[j - 1] = tmp;
+			}
+		}
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		if (profile[i].size) {
+			profile[i].start = dev->ddr_start + total_size;
+			total_size      += profile[i].size;
+		}
+		if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) {
+			mthca_err(dev, "Profile requires 0x%llx bytes; "
+				  "won't fit between DDR start at 0x%016llx "
+				  "and FW start at 0x%016llx.\n",
+				  (unsigned long long) total_size,
+				  (unsigned long long) dev->ddr_start,
+				  (unsigned long long) dev->fw.tavor.fw_start);
+			kfree(profile);
+			return -ENOMEM;
+		}
+
+		if (profile[i].size)
+			mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx "
+				  "(size 0x%8llx)\n",
+				  i, profile[i].type, profile[i].log_num,
+				  (unsigned long long) profile[i].start,
+				  (unsigned long long) profile[i].size);
+	}
+
+	mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n",
+		  (int) (total_size >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10),
+		  (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10));
+
+	for (i = 0; i < MTHCA_RES_NUM; ++i) {
+		switch (profile[i].type) {
+		case MTHCA_RES_QP:
+			dev->limits.num_qps   = profile[i].num;
+			init_hca->qpc_base    = profile[i].start;
+			init_hca->log_num_qps = profile[i].log_num;
+			break;
+		case MTHCA_RES_EEC:
+			dev->limits.num_eecs   = profile[i].num;
+			init_hca->eec_base     = profile[i].start;
+			init_hca->log_num_eecs = profile[i].log_num;
+			break;
+		case MTHCA_RES_SRQ:
+			dev->limits.num_srqs   = profile[i].num;
+			init_hca->srqc_base    = profile[i].start;
+			init_hca->log_num_srqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_CQ:
+			dev->limits.num_cqs   = profile[i].num;
+			init_hca->cqc_base    = profile[i].start;
+			init_hca->log_num_cqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_EQP:
+			init_hca->eqpc_base = profile[i].start;
+			break;
+		case MTHCA_RES_EEEC:
+			init_hca->eeec_base = profile[i].start;
+			break;
+		case MTHCA_RES_EQ:
+			dev->limits.num_eqs   = profile[i].num;
+			init_hca->eqc_base    = profile[i].start;
+			init_hca->log_num_eqs = profile[i].log_num;
+			break;
+		case MTHCA_RES_RDB:
+			dev->limits.num_rdbs = profile[i].num;
+			init_hca->rdb_base   = profile[i].start;
+			break;
+		case MTHCA_RES_MCG:
+			dev->limits.num_mgms      = profile[i].num >> 1;
+			dev->limits.num_amgms     = profile[i].num >> 1;
+			init_hca->mc_base         = profile[i].start;
+			init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1;
+			init_hca->log_mc_table_sz = profile[i].log_num;
+			init_hca->mc_hash_sz      = 1 << (profile[i].log_num - 1);
+			break;
+		case MTHCA_RES_MPT:
+			dev->limits.num_mpts = profile[i].num;
+			init_hca->mpt_base   = profile[i].start;
+			init_hca->log_mpt_sz = profile[i].log_num;
+			break;
+		case MTHCA_RES_MTT:
+			dev->limits.num_mtt_segs = profile[i].num;
+			dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE;
+			dev->mr_table.mtt_base   = profile[i].start;
+			init_hca->mtt_base       = profile[i].start;
+			init_hca->mtt_seg_sz     = ffs(MTHCA_MTT_SEG_SIZE) - 7;
+			break;
+		case MTHCA_RES_UAR:
+			init_hca->uar_scratch_base = profile[i].start;
+			break;
+		case MTHCA_RES_UDAV:
+			dev->av_table.ddr_av_base = profile[i].start;
+			dev->av_table.num_ddr_avs = profile[i].num;
+		default:
+			break;
+		}
+	}
+
+	/*
+	 * PDs don't take any HCA memory, but we assign them as part
+	 * of the HCA profile anyway.
+	 */
+	dev->limits.num_pds = MTHCA_NUM_PDS;
+
+	kfree(profile);
+	return 0;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h	2004-11-23 08:10:20.642518636 -0800
@@ -0,0 +1,58 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $
+ */
+
+#ifndef MTHCA_PROFILE_H
+#define MTHCA_PROFILE_H
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_RES_QP,
+	MTHCA_RES_EEC,
+	MTHCA_RES_SRQ,
+	MTHCA_RES_CQ,
+	MTHCA_RES_EQP,
+	MTHCA_RES_EEEC,
+	MTHCA_RES_EQ,
+	MTHCA_RES_RDB,
+	MTHCA_RES_MCG,
+	MTHCA_RES_MPT,
+	MTHCA_RES_MTT,
+	MTHCA_RES_UAR,
+	MTHCA_RES_UDAV,
+	MTHCA_RES_NUM
+};
+
+int mthca_make_profile(struct mthca_dev *mdev,
+		       struct mthca_dev_lim *dev_lim,
+		       struct mthca_init_hca_param *init_hca);
+
+#endif /* MTHCA_PROFILE_H */
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c	2004-11-23 08:10:20.724506547 -0800
@@ -0,0 +1,228 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $
+ */
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+int mthca_reset(struct mthca_dev *mdev)
+{
+	int i;
+	int err = 0;
+	u32 *hca_header    = NULL;
+	u32 *bridge_header = NULL;
+	struct pci_dev *bridge = NULL;
+
+#define MTHCA_RESET_OFFSET 0xf0010
+#define MTHCA_RESET_VALUE  cpu_to_be32(1)
+
+	/*
+	 * Reset the chip.  This is somewhat ugly because we have to
+	 * save off the PCI header before reset and then restore it
+	 * after the chip reboots.  We skip config space offsets 22
+	 * and 23 since those have a special meaning.
+	 *
+	 * To make matters worse, for Tavor (PCI-X HCA) we have to
+	 * find the associated bridge device and save off its PCI
+	 * header as well.
+	 */
+
+	if (mdev->hca_type == TAVOR) {
+		/* Look for the bridge -- its device ID will be 2 more
+		   than HCA's device ID. */
+		while ((bridge = pci_get_device(mdev->pdev->vendor,
+						mdev->pdev->device + 2,
+						bridge)) != NULL) {
+			if (bridge->hdr_type    == PCI_HEADER_TYPE_BRIDGE &&
+			    bridge->subordinate == mdev->pdev->bus) {
+				mthca_dbg(mdev, "Found bridge: %s (%s)\n",
+					  pci_pretty_name(bridge), pci_name(bridge));
+				break;
+			}
+		}
+
+		if (!bridge) {
+			/*
+			 * Didn't find a bridge for a Tavor device --
+			 * assume we're in no-bridge mode and hope for
+			 * the best.
+			 */
+			mthca_warn(mdev, "No bridge found for %s (%s)\n",
+				  pci_pretty_name(mdev->pdev), pci_name(mdev->pdev));
+		}
+			
+	}
+
+	/* For Arbel do we need to save off the full 4K PCI Express header?? */
+	hca_header = kmalloc(256, GFP_KERNEL);
+	if (!hca_header) {
+		err = -ENOMEM;
+		mthca_err(mdev, "Couldn't allocate memory to save HCA "
+			  "PCI header, aborting.\n");
+		goto out;
+	}
+
+	for (i = 0; i < 64; ++i) {
+		if (i == 22 || i == 23)
+			continue;
+		if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't save HCA "
+				  "PCI header, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (bridge) {
+		bridge_header = kmalloc(256, GFP_KERNEL);
+		if (!bridge_header) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't allocate memory to save HCA "
+				  "bridge PCI header, aborting.\n");
+			goto out;
+		}
+
+		for (i = 0; i < 64; ++i) {
+			if (i == 22 || i == 23)
+				continue;
+			if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't save HCA bridge "
+					  "PCI header, aborting.\n");
+				goto out;
+			}
+		}
+	}
+
+	/* actually hit reset */
+	{
+		void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) +
+					      MTHCA_RESET_OFFSET, 4);
+
+		if (!reset) {
+			err = -ENOMEM;
+			mthca_err(mdev, "Couldn't map HCA reset register, "
+				  "aborting.\n");
+			goto out;
+		}
+
+		writel(MTHCA_RESET_VALUE, reset);
+		iounmap(reset);
+	}
+
+	/* Docs say to wait one second before accessing device */
+	msleep(1000);
+
+	/* Now wait for PCI device to start responding again */
+	{
+		u32 v;
+		int c = 0;
+
+		for (c = 0; c < 100; ++c) {
+			if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't access HCA after reset, "
+					  "aborting.\n");
+				goto out;
+			}
+
+			if (v != 0xffffffff)
+				goto good;
+
+			msleep(100);
+		}
+
+		err = -ENODEV;
+		mthca_err(mdev, "PCI device did not come back after reset, "
+			  "aborting.\n");
+		goto out;
+	}
+
+good:
+	/* Now restore the PCI headers */
+	if (bridge) {
+		/*
+		 * Bridge control register is at 0x3e, so we'll
+		 * naturally restore it last in this loop.
+		 */
+		for (i = 0; i < 16; ++i) {
+			if (i * 4 == PCI_COMMAND)
+				continue;
+
+			if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't restore HCA bridge reg %x, "
+					  "aborting.\n", i);
+				goto out;
+			}
+		}
+
+		if (pci_write_config_dword(bridge, PCI_COMMAND,
+					   bridge_header[PCI_COMMAND / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, "
+				  "aborting.\n");
+			goto out;
+		}
+	}
+
+	for (i = 0; i < 16; ++i) {
+		if (i * 4 == PCI_COMMAND)
+			continue;
+
+		if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA reg %x, "
+				  "aborting.\n", i);
+			goto out;
+		}
+	}
+
+	if (pci_write_config_dword(mdev->pdev, PCI_COMMAND,
+				   hca_header[PCI_COMMAND / 4])) {
+		err = -ENODEV;
+		mthca_err(mdev, "Couldn't restore HCA COMMAND, "
+			  "aborting.\n");
+		goto out;
+	}
+
+out:
+	if (bridge)
+		pci_dev_put(bridge);
+	kfree(bridge_header);
+	kfree(hca_header);
+
+	return err;
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:25 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:25 -0800
Subject: [openib-general] [PATCH][RFC/v2][12/21] Add Mellanox HCA low-level
	driver (QP/CQ)
In-Reply-To: <20041123815.dUhm1PnERtccLLnp@topspin.com>
Message-ID: <20041123815.KMR5AMwRXU875N9Z@topspin.com>

Add CQ (completion queue) and QP (queue pair) code for Mellanox HCA driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c	2004-11-23 08:10:20.997466300 -0800
@@ -0,0 +1,821 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE
+};
+
+enum {
+	MTHCA_CQ_ENTRY_SIZE = 0x20
+};
+
+struct mthca_cq_context {
+	u32 flags;
+	u64 start;
+	u32 logsize_usrpage;
+	u32 error_eqn;
+	u32 comp_eqn;
+	u32 pd;
+	u32 lkey;
+	u32 last_notified_index;
+	u32 solicit_producer_index;
+	u32 consumer_index;
+	u32 producer_index;
+	u32 cqn;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_CQ_STATUS_OK          ( 0 << 28)
+#define MTHCA_CQ_STATUS_OVERFLOW    ( 9 << 28)
+#define MTHCA_CQ_STATUS_WRITE_FAIL  (10 << 28)
+#define MTHCA_CQ_FLAG_TR            ( 1 << 18)
+#define MTHCA_CQ_FLAG_OI            ( 1 << 17)
+#define MTHCA_CQ_STATE_DISARMED     ( 0 <<  8)
+#define MTHCA_CQ_STATE_ARMED        ( 1 <<  8)
+#define MTHCA_CQ_STATE_ARMED_SOL    ( 4 <<  8)
+#define MTHCA_EQ_STATE_FIRED        (10 <<  8)
+
+enum {
+	MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe
+};
+
+enum {
+	SYNDROME_LOCAL_LENGTH_ERR 	 = 0x01,
+	SYNDROME_LOCAL_QP_OP_ERR  	 = 0x02,
+	SYNDROME_LOCAL_EEC_OP_ERR 	 = 0x03,
+	SYNDROME_LOCAL_PROT_ERR   	 = 0x04,
+	SYNDROME_WR_FLUSH_ERR     	 = 0x05,
+	SYNDROME_MW_BIND_ERR      	 = 0x06,
+	SYNDROME_BAD_RESP_ERR     	 = 0x10,
+	SYNDROME_LOCAL_ACCESS_ERR 	 = 0x11,
+	SYNDROME_REMOTE_INVAL_REQ_ERR 	 = 0x12,
+	SYNDROME_REMOTE_ACCESS_ERR 	 = 0x13,
+	SYNDROME_REMOTE_OP_ERR     	 = 0x14,
+	SYNDROME_RETRY_EXC_ERR 		 = 0x15,
+	SYNDROME_RNR_RETRY_EXC_ERR 	 = 0x16,
+	SYNDROME_LOCAL_RDD_VIOL_ERR 	 = 0x20,
+	SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21,
+	SYNDROME_REMOTE_ABORTED_ERR 	 = 0x22,
+	SYNDROME_INVAL_EECN_ERR 	 = 0x23,
+	SYNDROME_INVAL_EEC_STATE_ERR 	 = 0x24
+};
+
+struct mthca_cqe {
+	u32 my_qpn;
+	u32 my_ee;
+	u32 rqpn;
+	u16 sl_g_mlpath;
+	u16 rlid;
+	u32 imm_etype_pkey_eec;
+	u32 byte_cnt;
+	u32 wqe;
+	u8  opcode;
+	u8  is_send;
+	u8  reserved;
+	u8  owner;
+} __attribute__((packed));
+
+struct mthca_err_cqe {
+	u32 my_qpn;
+	u32 reserved1[3];
+	u8  syndrome;
+	u8  reserved2;
+	u16 db_cnt;
+	u32 reserved3;
+	u32 wqe;
+	u8  opcode;
+	u8  reserved4[2];
+	u8  owner;
+} __attribute__((packed));
+
+#define MTHCA_CQ_ENTRY_OWNER_SW      (0 << 7)
+#define MTHCA_CQ_ENTRY_OWNER_HW      (1 << 7)
+
+#define MTHCA_CQ_DB_INC_CI       (1 << 24)
+#define MTHCA_CQ_DB_REQ_NOT      (2 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_SOL  (3 << 24)
+#define MTHCA_CQ_DB_SET_CI       (4 << 24)
+#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24)
+
+static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry)
+{
+	if (cq->is_direct)
+		return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE);
+	else
+		return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf
+			+ (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE;
+}
+
+static inline int cqe_sw(struct mthca_cq *cq, int i)
+{
+	return !(MTHCA_CQ_ENTRY_OWNER_HW &
+		 get_cqe(cq, i)->owner);
+}
+
+static inline int next_cqe_sw(struct mthca_cq *cq)
+{
+	return cqe_sw(cq, cq->cons_index);
+}
+
+static inline void set_cqe_hw(struct mthca_cq *cq, int entry)
+{
+	get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW;
+}
+
+static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq,
+				  int nent)
+{
+	u32 doorbell[2];
+
+	doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn);
+	doorbell[1] = cpu_to_be32(nent - 1);
+		
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+void mthca_cq_event(struct mthca_dev *dev, u32 cqn)
+{
+	struct mthca_cq *cq;
+
+	spin_lock(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock(&dev->cq_table.lock);
+
+	if (!cq) {
+		mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn);
+		return;
+	}
+
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn)
+{
+	struct mthca_cq *cq;
+	struct mthca_cqe *cqe;
+	int prod_index;
+	int nfreed = 0;
+
+	spin_lock_irq(&dev->cq_table.lock);
+	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	if (!cq)
+		return;
+
+	spin_lock_irq(&cq->lock);
+
+	/*
+	 * First we need to find the current producer index, so we
+	 * know where to start cleaning from.  It doesn't matter if HW
+	 * adds new entries after this loop -- the QP we're worried
+	 * about is already in RESET, so the new entries won't come
+	 * from our QP and therefore don't need to be checked.
+	 */
+	for (prod_index = cq->cons_index;
+	     cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1));
+	     ++prod_index)
+		if (prod_index == cq->cons_index + cq->ibcq.cqe - 1)
+			break;
+
+	if (0)
+		mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n",
+			  qpn, cqn, cq->cons_index, prod_index);
+
+	/*
+	 * Now sweep backwards through the CQ, removing CQ entries
+	 * that match our QP by copying older entries on top of them.
+	 */
+	while (prod_index > cq->cons_index) {
+		cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1));
+		if (cqe->my_qpn == cpu_to_be32(qpn))
+			++nfreed;
+		else if (nfreed)
+			memcpy(get_cqe(cq, (prod_index - 1 + nfreed) &
+				       (cq->ibcq.cqe - 1)),
+			       cqe,
+			       MTHCA_CQ_ENTRY_SIZE);
+		--prod_index;
+	}
+
+	if (nfreed) {
+		wmb();
+		inc_cons_index(dev, cq, nfreed);
+		cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1);
+	}
+
+	spin_unlock_irq(&cq->lock);
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq,
+			    struct mthca_qp *qp, int wqe_index, int is_send,
+			    struct mthca_err_cqe *cqe,
+			    struct ib_wc *entry, int *free_cqe)
+{
+	int err;
+	int dbd;
+	u32 new_wqe;
+
+	if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) {
+		int j;
+		
+		mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+	}
+
+	/*
+	 * For completions in error, only work request ID, status (and
+	 * freed resource count for RD) have to be set.
+	 */
+	switch (cqe->syndrome) {
+	case SYNDROME_LOCAL_LENGTH_ERR:
+		entry->status = IB_WC_LOC_LEN_ERR;
+		break;
+	case SYNDROME_LOCAL_QP_OP_ERR:
+		entry->status = IB_WC_LOC_QP_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_EEC_OP_ERR:
+		entry->status = IB_WC_LOC_EEC_OP_ERR;
+		break;
+	case SYNDROME_LOCAL_PROT_ERR:
+		entry->status = IB_WC_LOC_PROT_ERR;
+		break;
+	case SYNDROME_WR_FLUSH_ERR:
+		entry->status = IB_WC_WR_FLUSH_ERR;
+		break;
+	case SYNDROME_MW_BIND_ERR:
+		entry->status = IB_WC_MW_BIND_ERR;
+		break;
+	case SYNDROME_BAD_RESP_ERR:
+		entry->status = IB_WC_BAD_RESP_ERR;
+		break;
+	case SYNDROME_LOCAL_ACCESS_ERR:
+		entry->status = IB_WC_LOC_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_REQ_ERR:
+		entry->status = IB_WC_REM_INV_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ACCESS_ERR:
+		entry->status = IB_WC_REM_ACCESS_ERR;
+		break;
+	case SYNDROME_REMOTE_OP_ERR:
+		entry->status = IB_WC_REM_OP_ERR;
+		break;
+	case SYNDROME_RETRY_EXC_ERR:
+		entry->status = IB_WC_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_RNR_RETRY_EXC_ERR:
+		entry->status = IB_WC_RNR_RETRY_EXC_ERR;
+		break;
+	case SYNDROME_LOCAL_RDD_VIOL_ERR:
+		entry->status = IB_WC_LOC_RDD_VIOL_ERR;
+		break;
+	case SYNDROME_REMOTE_INVAL_RD_REQ_ERR:
+		entry->status = IB_WC_REM_INV_RD_REQ_ERR;
+		break;
+	case SYNDROME_REMOTE_ABORTED_ERR:
+		entry->status = IB_WC_REM_ABORT_ERR;
+		break;
+	case SYNDROME_INVAL_EECN_ERR:
+		entry->status = IB_WC_INV_EECN_ERR;
+		break;
+	case SYNDROME_INVAL_EEC_STATE_ERR:
+		entry->status = IB_WC_INV_EEC_STATE_ERR;
+		break;
+	default:
+		entry->status = IB_WC_GENERAL_ERR;
+		break;
+	}
+
+	err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe);
+	if (err)
+		return err;
+
+	/*
+	 * If we're at the end of the WQE chain, or we've used up our
+	 * doorbell count, free the CQE.  Otherwise just update it for
+	 * the next poll operation.
+	 */
+	if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd))
+		return 0;
+
+	cqe->db_cnt   = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd);
+	cqe->wqe      = new_wqe;
+	cqe->syndrome = SYNDROME_WR_FLUSH_ERR;
+
+	*free_cqe = 0;
+
+	return 0;
+}
+
+static void dump_cqe(struct mthca_cqe *cqe)
+{
+	int j;
+
+	for (j = 0; j < 8; ++j)
+		printk(KERN_DEBUG "  [%2x] %08x\n",
+		       j * 4, be32_to_cpu(((u32 *) cqe)[j]));
+}
+
+static inline int mthca_poll_one(struct mthca_dev *dev,
+				 struct mthca_cq *cq,
+				 struct mthca_qp **cur_qp,
+				 int *freed,
+				 struct ib_wc *entry)
+{
+	struct mthca_wq *wq;
+	struct mthca_cqe *cqe;
+	int wqe_index;
+	int is_error = 0;
+	int is_send;
+	int free_cqe = 1;
+	int err = 0;
+
+	if (!next_cqe_sw(cq))
+		return -EAGAIN;
+
+	rmb();
+
+	cqe = get_cqe(cq, cq->cons_index);
+
+	if (0) {
+		mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n",
+			  cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn),
+			  be32_to_cpu(cqe->wqe));
+
+		dump_cqe(cqe);
+	}
+
+	if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) ==
+	    MTHCA_ERROR_CQE_OPCODE_MASK) {
+		is_error = 1;
+		is_send = cqe->opcode & 1;
+	} else
+		is_send = cqe->is_send & 0x80;
+
+	if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) {
+		if (*cur_qp) {
+			spin_unlock(&(*cur_qp)->lock);
+			if (atomic_dec_and_test(&(*cur_qp)->refcount))
+				wake_up(&(*cur_qp)->wait);
+		}
+
+		spin_lock(&dev->qp_table.lock);
+		*cur_qp = mthca_array_get(&dev->qp_table.qp,
+					  be32_to_cpu(cqe->my_qpn) &
+					  (dev->limits.num_qps - 1));
+		if (*cur_qp)
+			atomic_inc(&(*cur_qp)->refcount);
+		spin_unlock(&dev->qp_table.lock);
+
+		if (!*cur_qp) {
+			mthca_warn(dev, "CQ entry for unknown QP %06x\n",
+				   be32_to_cpu(cqe->my_qpn) & 0xffffff);
+			err = -EINVAL;
+			goto out;
+		}
+
+		spin_lock(&(*cur_qp)->lock);
+	}
+
+	if (is_send) {
+		wq = &(*cur_qp)->sq;
+		wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset)
+			     >> wq->wqe_shift);
+		entry->wr_id = (*cur_qp)->wrid[wqe_index +
+					       (*cur_qp)->rq.max];
+	} else {
+		wq = &(*cur_qp)->rq;
+		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		entry->wr_id = (*cur_qp)->wrid[wqe_index];
+	}
+
+	if (wq->last_comp < wqe_index)
+		wq->cur -= wqe_index - wq->last_comp;
+	else
+		wq->cur -= wq->max - wq->last_comp + wqe_index;
+
+	wq->last_comp = wqe_index;
+
+	if (0)
+		mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n",
+			  is_send ? "Send" : "Receive",
+			  (*cur_qp)->qpn, wqe_index, wq->max);
+
+	if (is_error) {
+		err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send,
+				       (struct mthca_err_cqe *) cqe,
+				       entry, &free_cqe);
+		goto out;
+	}
+
+	if (is_send) {
+		entry->opcode = IB_WC_SEND; /* XXX */
+	} else {
+		entry->byte_len = be32_to_cpu(cqe->byte_cnt);
+		switch (cqe->opcode & 0x1f) {
+		case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV;
+			break;
+		case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE:
+		case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE:
+			entry->wc_flags = IB_WC_WITH_IMM;
+			entry->imm_data = cqe->imm_etype_pkey_eec;
+			entry->opcode = IB_WC_RECV_RDMA_WITH_IMM;
+			break;
+		default:
+			entry->wc_flags = 0;
+			entry->opcode = IB_WC_RECV;
+			break;
+		}
+		entry->slid 	   = be16_to_cpu(cqe->rlid);
+		entry->sl   	   = be16_to_cpu(cqe->sl_g_mlpath) >> 12;
+		entry->src_qp 	   = be32_to_cpu(cqe->rqpn) & 0xffffff;
+		entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f;
+		entry->pkey_index  = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16;
+		entry->wc_flags   |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ?
+					IB_WC_GRH : 0;
+	}
+
+	entry->status = IB_WC_SUCCESS;
+
+ out:
+	if (free_cqe) {
+		set_cqe_hw(cq, cq->cons_index);
+		++(*freed);
+		cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1);
+	}
+
+	return err;
+}
+
+int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
+		  struct ib_wc *entry)
+{
+	struct mthca_dev *dev = to_mdev(ibcq->device);
+	struct mthca_cq *cq = to_mcq(ibcq);
+	struct mthca_qp *qp = NULL;
+	unsigned long flags;
+	int err = 0;
+	int freed = 0;
+	int npolled;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+		err = mthca_poll_one(dev, cq, &qp,
+				     &freed, entry + npolled);
+		if (err)
+			break;
+	}
+
+	if (qp) {
+		spin_unlock(&qp->lock);
+		if (atomic_dec_and_test(&qp->refcount))
+			wake_up(&qp->wait);
+	}
+
+	wmb();
+	inc_cons_index(dev, cq, freed);
+
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return err == 0 || err == -EAGAIN ? npolled : err;
+}
+
+void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq,
+		  int solicited)
+{
+	u32 doorbell[2];
+
+	doorbell[0] =  cpu_to_be32((solicited ?
+				    MTHCA_CQ_DB_REQ_NOT_SOL :
+				    MTHCA_CQ_DB_REQ_NOT)      |
+				   cq->cqn);
+	doorbell[1] = 0xffffffff;
+
+	mthca_write64(doorbell,
+		      dev->kar + MTHCA_CQ_DOORBELL,
+		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+}
+
+int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_cq *cq)
+{
+	int size = nent * MTHCA_CQ_ENTRY_SIZE;
+	dma_addr_t t;
+	void *mailbox = NULL;
+	int npages, shift;
+	u64 *dma_list = NULL;
+	struct mthca_cq_context *cq_context;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out;
+
+	cq_context = MAILBOX_ALIGN(mailbox);
+
+	if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) {
+		if (0)
+			mthca_dbg(dev, "Creating direct CQ of size %d\n", size);
+
+		cq->is_direct = 1;
+		npages        = 1;
+		shift         = get_order(size) + PAGE_SHIFT;
+
+		cq->queue.direct.buf = pci_alloc_consistent(dev->pdev,
+							    size, &t);
+		if (!cq->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&cq->queue.direct, mapping, t);
+
+		memset(cq->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		cq->is_direct = 0;
+		npages        = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+		shift         = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list,
+					      GFP_KERNEL);
+		if (!cq->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i)
+			cq->queue.page_list[i].buf = NULL;
+
+		for (i = 0; i < npages; ++i) {
+			cq->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!cq->queue.page_list[i].buf)
+				goto err_out_free;
+			
+			dma_list[i] = t;
+			pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t);
+
+			memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE);
+		}
+	}
+
+	for (i = 0; i < nent; ++i)
+		set_cqe_hw(cq, i);
+
+	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
+	if (cq->cqn == -1)
+		goto err_out_free;
+
+	err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num,
+				  dma_list, shift, npages,
+				  0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &cq->mr);
+	if (err)
+		goto err_out_free_cq;
+
+	spin_lock_init(&cq->lock);
+	atomic_set(&cq->refcount, 1);
+	init_waitqueue_head(&cq->wait);
+
+	memset(cq_context, 0, sizeof *cq_context);
+	cq_context->flags           = cpu_to_be32(MTHCA_CQ_STATUS_OK      |
+						  MTHCA_CQ_STATE_DISARMED |
+						  MTHCA_CQ_FLAG_TR);
+	cq_context->start           = cpu_to_be64(0);
+	cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
+						  MTHCA_KAR_PAGE);
+	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
+	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
+	cq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	cq_context->lkey            = cpu_to_be32(cq->mr.ibmr.lkey);
+	cq_context->cqn             = cpu_to_be32(cq->cqn);
+
+	err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err);
+		goto err_out_free_mr;
+	}
+
+	if (status) {
+		mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_free_mr;
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	if (mthca_array_set(&dev->cq_table.cq,
+			    cq->cqn & (dev->limits.num_cqs - 1),
+			    cq)) {
+		spin_unlock_irq(&dev->cq_table.lock);
+		goto err_out_free_mr;
+	}
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	cq->cons_index = 0;
+
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return 0;
+
+ err_out_free_mr:
+	mthca_free_mr(dev, &cq->mr);
+
+ err_out_free_cq:
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+
+ err_out_free:
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev, size,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct, mapping));
+	else {
+		for (i = 0; i < npages; ++i)
+			if (cq->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    cq->queue.page_list[i].buf,
+						    pci_unmap_addr(&cq->queue.page_list[i],
+								   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+ err_out:
+	kfree(dma_list);
+	kfree(mailbox);
+
+	return err;
+}
+
+void mthca_free_cq(struct mthca_dev *dev,
+		   struct mthca_cq *cq)
+{
+	void *mailbox;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_warn(dev, "No memory for mailbox to free CQ.\n");
+		return;
+	}
+
+	err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n",
+			   status);
+
+	if (0) {
+		u32 *ctx = MAILBOX_ALIGN(mailbox);
+		int j;
+		
+		printk(KERN_ERR "context for CQN %x\n", cq->cqn);
+		for (j = 0; j < 16; ++j)
+			printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j]));
+	}
+
+	spin_lock_irq(&dev->cq_table.lock);
+	mthca_array_clear(&dev->cq_table.cq,
+			  cq->cqn & (dev->limits.num_cqs - 1));
+	spin_unlock_irq(&dev->cq_table.lock);
+
+	atomic_dec(&cq->refcount);
+	wait_event(cq->wait, !atomic_read(&cq->refcount));
+
+	mthca_free_mr(dev, &cq->mr);
+
+	if (cq->is_direct)
+		pci_free_consistent(dev->pdev,
+				    cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE,
+				    cq->queue.direct.buf,
+				    pci_unmap_addr(&cq->queue.direct,
+						   mapping));
+	else {
+		int i;
+
+		for (i = 0;
+		     i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) /
+			     PAGE_SIZE;
+		     ++i)
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    cq->queue.page_list[i].buf,
+					    pci_unmap_addr(&cq->queue.page_list[i],
+							   mapping));
+
+		kfree(cq->queue.page_list);
+	}
+
+	mthca_free(&dev->cq_table.alloc, cq->cqn);
+	kfree(mailbox);
+}
+
+int __devinit mthca_init_cq_table(struct mthca_dev *dev)
+{
+	int err;
+
+	spin_lock_init(&dev->cq_table.lock);
+
+	err = mthca_alloc_init(&dev->cq_table.alloc,
+			       dev->limits.num_cqs,
+			       (1 << 24) - 1,
+			       dev->limits.reserved_cqs);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->cq_table.cq,
+			       dev->limits.num_cqs);
+	if (err)
+		mthca_alloc_cleanup(&dev->cq_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev)
+{
+	mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs);
+	mthca_alloc_cleanup(&dev->cq_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c	2004-11-23 08:10:21.032461140 -0800
@@ -0,0 +1,1485 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_qp.c 1270 2004-11-18 21:47:31Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+#include <ib_pack.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE,
+	MTHCA_ACK_REQ_FREQ       = 10,
+	MTHCA_FLIGHT_LIMIT       = 9,
+	MTHCA_UD_HEADER_SIZE     = 72 /* largest UD header possible */
+};
+
+enum {
+	MTHCA_QP_STATE_RST  = 0,
+	MTHCA_QP_STATE_INIT = 1,
+	MTHCA_QP_STATE_RTR  = 2,
+	MTHCA_QP_STATE_RTS  = 3,
+	MTHCA_QP_STATE_SQE  = 4,
+	MTHCA_QP_STATE_SQD  = 5,
+	MTHCA_QP_STATE_ERR  = 6,
+	MTHCA_QP_STATE_DRAINING = 7
+};
+
+enum {
+	MTHCA_QP_ST_RC 	= 0x0,
+	MTHCA_QP_ST_UC 	= 0x1,
+	MTHCA_QP_ST_RD 	= 0x2,
+	MTHCA_QP_ST_UD 	= 0x3,
+	MTHCA_QP_ST_MLX = 0x7
+};
+
+enum {
+	MTHCA_QP_PM_MIGRATED = 0x3,
+	MTHCA_QP_PM_ARMED    = 0x0,
+	MTHCA_QP_PM_REARM    = 0x1
+};
+
+enum {
+	/* qp_context flags */
+	MTHCA_QP_BIT_DE  = 1 <<  8,
+	/* params1 */
+	MTHCA_QP_BIT_SRE = 1 << 15,
+	MTHCA_QP_BIT_SWE = 1 << 14,
+	MTHCA_QP_BIT_SAE = 1 << 13,
+	MTHCA_QP_BIT_SIC = 1 <<  4,
+	MTHCA_QP_BIT_SSC = 1 <<  3,
+	/* params2 */
+	MTHCA_QP_BIT_RRE = 1 << 15,
+	MTHCA_QP_BIT_RWE = 1 << 14,
+	MTHCA_QP_BIT_RAE = 1 << 13,
+	MTHCA_QP_BIT_RIC = 1 <<  4,
+	MTHCA_QP_BIT_RSC = 1 <<  3
+};
+
+struct mthca_qp_path {
+	u32 port_pkey;
+	u8  rnr_retry;
+	u8  g_mylmc;
+	u16 rlid;
+	u8  ackto;
+	u8  mgid_index;
+	u8  static_rate;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u8  rgid[16];
+} __attribute__((packed));
+
+struct mthca_qp_context {
+	u32 flags;
+	u32 sched_queue;
+	u32 mtu_msgmax;
+	u32 usr_page;
+	u32 local_qpn;
+	u32 remote_qpn;
+	u32 reserved1[2];
+	struct mthca_qp_path pri_path;
+	struct mthca_qp_path alt_path;
+	u32 rdd;
+	u32 pd;
+	u32 wqe_base;
+	u32 wqe_lkey;
+	u32 params1;
+	u32 reserved2;
+	u32 next_send_psn;
+	u32 cqn_snd;
+	u32 next_snd_wqe[2];
+	u32 last_acked_psn;
+	u32 ssn;
+	u32 params2;
+	u32 rnr_nextrecvpsn;
+	u32 ra_buff_indx;
+	u32 cqn_rcv;
+	u32 next_rcv_wqe[2];
+	u32 qkey;
+	u32 srqn;
+	u32 rmsn;
+	u32 reserved3[19];
+} __attribute__((packed));
+
+struct mthca_qp_param {
+	u32 opt_param_mask;
+	u32 reserved1;
+	struct mthca_qp_context context;
+	u32 reserved2[62];
+} __attribute__((packed));
+
+enum {
+	MTHCA_QP_OPTPAR_ALT_ADDR_PATH     = 1 << 0,
+	MTHCA_QP_OPTPAR_RRE               = 1 << 1,
+	MTHCA_QP_OPTPAR_RAE               = 1 << 2,
+	MTHCA_QP_OPTPAR_REW               = 1 << 3,
+	MTHCA_QP_OPTPAR_PKEY_INDEX        = 1 << 4,
+	MTHCA_QP_OPTPAR_Q_KEY             = 1 << 5,
+	MTHCA_QP_OPTPAR_RNR_TIMEOUT       = 1 << 6,
+	MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7,
+	MTHCA_QP_OPTPAR_SRA_MAX           = 1 << 8,
+	MTHCA_QP_OPTPAR_RRA_MAX           = 1 << 9,
+	MTHCA_QP_OPTPAR_PM_STATE          = 1 << 10,
+	MTHCA_QP_OPTPAR_PORT_NUM          = 1 << 11,
+	MTHCA_QP_OPTPAR_RETRY_COUNT       = 1 << 12,
+	MTHCA_QP_OPTPAR_ALT_RNR_RETRY     = 1 << 13,
+	MTHCA_QP_OPTPAR_ACK_TIMEOUT       = 1 << 14,
+	MTHCA_QP_OPTPAR_RNR_RETRY         = 1 << 15,
+	MTHCA_QP_OPTPAR_SCHED_QUEUE       = 1 << 16
+};
+
+enum {
+	MTHCA_OPCODE_NOP            = 0x00,
+	MTHCA_OPCODE_RDMA_WRITE     = 0x08,
+	MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09,
+	MTHCA_OPCODE_SEND           = 0x0a,
+	MTHCA_OPCODE_SEND_IMM       = 0x0b,
+	MTHCA_OPCODE_RDMA_READ      = 0x10,
+	MTHCA_OPCODE_ATOMIC_CS      = 0x11,
+	MTHCA_OPCODE_ATOMIC_FA      = 0x12,
+	MTHCA_OPCODE_BIND_MW        = 0x18,
+	MTHCA_OPCODE_INVALID        = 0xff
+};
+
+enum {
+	MTHCA_NEXT_DBD       = 1 << 7,
+	MTHCA_NEXT_FENCE     = 1 << 6,
+	MTHCA_NEXT_CQ_UPDATE = 1 << 3,
+	MTHCA_NEXT_EVENT_GEN = 1 << 2,
+	MTHCA_NEXT_SOLICIT   = 1 << 1,
+
+	MTHCA_MLX_VL15       = 1 << 17,
+	MTHCA_MLX_SLR        = 1 << 16
+};
+
+struct mthca_next_seg {
+	u32 nda_op;		/* [31:6] next WQE [4:0] next opcode */
+	u32 ee_nds;		/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
+	u32 flags;		/* [3] CQ [2] Event [1] Solicit */
+	u32 imm;		/* immediate data */
+} __attribute__((packed));
+
+struct mthca_ud_seg {
+	u32 reserved1;
+	u32 lkey;
+	u64 av_addr;
+	u32 reserved2[4];
+	u32 dqpn;
+	u32 qkey;
+	u32 reserved3[2];
+} __attribute__((packed));
+
+struct mthca_bind_seg {
+	u32 flags;		/* [31] Atomic [30] rem write [29] rem read */
+	u32 reserved;
+	u32 new_rkey;
+	u32 lkey;
+	u64 addr;
+	u64 length;
+} __attribute__((packed));
+
+struct mthca_raddr_seg {
+	u64 raddr;
+	u32 rkey;
+	u32 reserved;
+} __attribute__((packed));
+
+struct mthca_atomic_seg {
+	u64 swap_add;
+	u64 compare;
+} __attribute__((packed));
+
+struct mthca_data_seg {
+	u32 byte_count;
+	u32 lkey;
+	u64 addr;
+} __attribute__((packed));
+
+struct mthca_mlx_seg {
+	u32 nda_op;
+	u32 nds;
+	u32 flags;		/* [17] VL15 [16] SLR [14:12] static rate
+				   [11:8] SL [3] C [2] E */
+	u16 rlid;
+	u16 vcrc;
+} __attribute__((packed));
+
+static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 3;
+}
+
+static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp)
+{
+	return qp->qpn >= dev->qp_table.sqp_start &&
+		qp->qpn <= dev->qp_table.sqp_start + 1;
+}
+
+static void *get_recv_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + (n << qp->rq.wqe_shift);
+	else
+		return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf +
+			((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1));
+}
+
+static void *get_send_wqe(struct mthca_qp *qp, int n)
+{
+	if (qp->is_direct)
+		return qp->queue.direct.buf + qp->send_wqe_offset +
+			(n << qp->sq.wqe_shift);
+	else
+		return qp->queue.page_list[(qp->send_wqe_offset +
+					    (n << qp->sq.wqe_shift)) >>
+					   PAGE_SHIFT].buf +
+			((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) &
+			 (PAGE_SIZE - 1));
+}
+
+void mthca_qp_event(struct mthca_dev *dev, u32 qpn,
+		    enum ib_event_type event_type)
+{
+	struct mthca_qp *qp;
+	struct ib_event event;
+
+	spin_lock(&dev->qp_table.lock);
+	qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1));
+	if (qp)
+		atomic_inc(&qp->refcount);
+	spin_unlock(&dev->qp_table.lock);
+
+	if (!qp) {
+		mthca_warn(dev, "Async event for bogus QP %08x\n", qpn);
+		return;
+	}
+
+	event.device      = &dev->ib_dev;
+	event.event       = event_type;
+	event.element.qp  = &qp->ibqp;
+	if (qp->ibqp.event_handler)
+		qp->ibqp.event_handler(&event, qp->ibqp.qp_context);
+
+	if (atomic_dec_and_test(&qp->refcount))
+		wake_up(&qp->wait);
+}
+
+static int to_mthca_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET: return MTHCA_QP_STATE_RST;
+	case IB_QPS_INIT:  return MTHCA_QP_STATE_INIT;
+	case IB_QPS_RTR:   return MTHCA_QP_STATE_RTR;
+	case IB_QPS_RTS:   return MTHCA_QP_STATE_RTS;
+	case IB_QPS_SQD:   return MTHCA_QP_STATE_SQD;
+	case IB_QPS_SQE:   return MTHCA_QP_STATE_SQE;
+	case IB_QPS_ERR:   return MTHCA_QP_STATE_ERR;
+	default:                return -1;
+	}
+}
+
+enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS };
+
+static int to_mthca_st(int transport)
+{
+	switch (transport) {
+	case RC:  return MTHCA_QP_ST_RC;
+	case UC:  return MTHCA_QP_ST_UC;
+	case UD:  return MTHCA_QP_ST_UD;
+	case RD:  return MTHCA_QP_ST_RD;
+	case MLX: return MTHCA_QP_ST_MLX;
+	default:  return -1;
+	}
+}
+
+static const struct {
+	int trans;
+	u32 req_param[NUM_TRANS];
+	u32 opt_param[NUM_TRANS];
+} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = {
+	[IB_QPS_RESET] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_RST2INIT,
+			.req_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			},
+			/* bug-for-bug compatibility with VAPI: */
+			.opt_param = {
+				[MLX] = IB_QP_PORT
+			}
+		},
+	},
+	[IB_QPS_INIT]  = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_INIT]  = {
+			.trans = MTHCA_TRANS_INIT2INIT,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_PORT       |
+					 IB_QP_ACCESS_FLAGS),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_RTR]   = {
+			.trans = MTHCA_TRANS_INIT2RTR,
+			.req_param = {
+				[RC]  = (IB_QP_AV                  |
+					 IB_QP_PATH_MTU            |
+					 IB_QP_DEST_QPN            |
+					 IB_QP_RQ_PSN              |
+					 IB_QP_MAX_DEST_RD_ATOMIC  |
+					 IB_QP_MIN_RNR_TIMER),
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ALT_PATH     |
+					 IB_QP_ACCESS_FLAGS |
+					 IB_QP_PKEY_INDEX),
+				[MLX] = (IB_QP_PKEY_INDEX |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTR]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTR2RTS,
+			.req_param = {
+				[UD]  = IB_QP_SQ_PSN,
+				[RC]  = (IB_QP_TIMEOUT           |
+					 IB_QP_RETRY_CNT         |
+					 IB_QP_RNR_RETRY         |
+					 IB_QP_SQ_PSN            |
+					 IB_QP_MAX_QP_RD_ATOMIC),
+				[MLX] = IB_QP_SQ_PSN,
+			},
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_RTS]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_RTS2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_ACCESS_FLAGS          |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_PATH_MIG_STATE        |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_RTS2SQD,
+		},
+	},
+	[IB_QPS_SQD]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQD2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		},
+		[IB_QPS_SQD]   = {
+			.trans = MTHCA_TRANS_SQD2SQD,
+			.opt_param = {
+				[UD]  = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_AV                    |
+					 IB_QP_TIMEOUT               |
+					 IB_QP_RETRY_CNT             |
+					 IB_QP_RNR_RETRY             |
+					 IB_QP_MAX_QP_RD_ATOMIC      |
+					 IB_QP_CUR_STATE             |
+					 IB_QP_ALT_PATH              |
+					 IB_QP_ACCESS_FLAGS          |
+					 IB_QP_PKEY_INDEX            |
+					 IB_QP_MIN_RNR_TIMER         |
+					 IB_QP_PATH_MIG_STATE),
+				[MLX] = (IB_QP_PKEY_INDEX            |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_SQE]   = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR },
+		[IB_QPS_RTS]   = {
+			.trans = MTHCA_TRANS_SQERR2RTS,
+			.opt_param = {
+				[UD]  = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+				[RC]  = (IB_QP_CUR_STATE             |
+					 IB_QP_MIN_RNR_TIMER),
+				[MLX] = (IB_QP_CUR_STATE             |
+					 IB_QP_QKEY),
+			}
+		}
+	},
+	[IB_QPS_ERR] = {
+		[IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST },
+		[IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }
+	}
+};
+
+static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr,
+			int attr_mask)
+{
+	if (attr_mask & IB_QP_PKEY_INDEX)
+		sqp->pkey_index = attr->pkey_index;
+	if (attr_mask & IB_QP_QKEY)
+		sqp->qkey = attr->qkey;
+	if (attr_mask & IB_QP_SQ_PSN)
+		sqp->send_psn = attr->sq_psn;
+}
+
+static void init_port(struct mthca_dev *dev, int port)
+{
+	int err;
+	u8 status;
+	struct mthca_init_ib_param param;
+
+	memset(&param, 0, sizeof param);
+
+	param.enable_1x = 1;
+	param.enable_4x = 1;
+	param.vl_cap    = dev->limits.vl_cap;
+	param.mtu_cap   = dev->limits.mtu_cap;
+	param.gid_cap   = dev->limits.gid_table_len;
+	param.pkey_cap  = dev->limits.pkey_table_len;
+
+	err = mthca_INIT_IB(dev, &param, port, &status);
+	if (err)
+		mthca_warn(dev, "INIT_IB failed, return code %d.\n", err);
+	if (status)
+		mthca_warn(dev, "INIT_IB returned status %02x.\n", status);
+}
+
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	enum ib_qp_state cur_state, new_state;
+	void *mailbox = NULL;
+	struct mthca_qp_param *qp_param;
+	struct mthca_qp_context *qp_context;
+	u32 req_param, opt_param;
+	u8 status;
+	int err;
+
+	if (attr_mask & IB_QP_CUR_STATE) {
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			cur_state = attr->cur_qp_state;
+	} else {
+		spin_lock_irq(&qp->lock);
+		cur_state = qp->state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	if (attr_mask & IB_QP_STATE) {
+               if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
+			return -EINVAL;
+		new_state = attr->qp_state;
+	} else
+		new_state = cur_state;
+
+	if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) {
+		mthca_dbg(dev, "Illegal QP transition "
+			  "%d->%d\n", cur_state, new_state);
+		return -EINVAL;
+	}
+
+	req_param = state_table[cur_state][new_state].req_param[qp->transport];
+	opt_param = state_table[cur_state][new_state].opt_param[qp->transport];
+
+	if ((req_param & attr_mask) != req_param) {
+		mthca_dbg(dev, "QP transition "
+			  "%d->%d missing req attr 0x%08x\n",
+			  cur_state, new_state,
+			  req_param & ~attr_mask);
+		return -EINVAL;
+	}
+
+	if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) {
+		mthca_dbg(dev, "QP transition (transport %d) "
+			  "%d->%d has extra attr 0x%08x\n",
+			  qp->transport,
+			  cur_state, new_state,
+			  attr_mask & ~(req_param | opt_param |
+						 IB_QP_STATE));
+		return -EINVAL;
+	}
+
+	mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	qp_param = MAILBOX_ALIGN(mailbox);
+	qp_context = &qp_param->context;
+	memset(qp_param, 0, sizeof *qp_param);
+
+	qp_context->flags      = cpu_to_be32((to_mthca_state(new_state) << 28) |
+					     (to_mthca_st(qp->transport) << 16));
+	qp_context->flags     |= cpu_to_be32(MTHCA_QP_BIT_DE);
+	if (!(attr_mask & IB_QP_PATH_MIG_STATE))
+		qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+	else {
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE);
+		switch (attr->path_mig_state) {
+		case IB_MIG_MIGRATED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11);
+			break;
+		case IB_MIG_REARM:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11);
+			break;
+		case IB_MIG_ARMED:
+			qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11);
+			break;
+		}
+	}
+	/* leave sched_queue as 0 */
+	if (qp->transport == MLX || qp->transport == UD)
+		qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) |
+						     (11 << 24));
+	else if (attr_mask & IB_QP_PATH_MTU) {
+		qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) |
+						     (31 << 24));
+	}
+	qp_context->usr_page   = cpu_to_be32(MTHCA_KAR_PAGE);
+	qp_context->local_qpn  = cpu_to_be32(qp->qpn);
+	if (attr_mask & IB_QP_DEST_QPN) {
+		qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
+	}
+
+	if (qp->transport == MLX)
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(to_msqp(qp)->port << 24);
+	else {
+		if (attr_mask & IB_QP_PORT) {
+			qp_context->pri_path.port_pkey |=
+				cpu_to_be32(attr->port_num << 24);
+			qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM);
+		}
+	}
+
+	if (attr_mask & IB_QP_PKEY_INDEX) {
+		qp_context->pri_path.port_pkey |=
+			cpu_to_be32(attr->pkey_index);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX);
+	}
+
+	if (attr_mask & IB_QP_RNR_RETRY) {
+		qp_context->pri_path.rnr_retry = attr->rnr_retry << 5;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY);
+	}
+
+	if (attr_mask & IB_QP_AV) {
+		qp_context->pri_path.g_mylmc     = attr->ah_attr.src_path_bits & 0x7f;
+		qp_context->pri_path.rlid        = cpu_to_be16(attr->ah_attr.dlid);
+		qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3;
+		if (attr->ah_attr.ah_flags & IB_AH_GRH) {
+			qp_context->pri_path.g_mylmc |= 1 << 7;
+			qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index;
+			qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit;
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32((attr->ah_attr.sl << 28)                |
+					    (attr->ah_attr.grh.traffic_class << 20) |
+					    (attr->ah_attr.grh.flow_label));
+			memcpy(qp_context->pri_path.rgid,
+			       attr->ah_attr.grh.dgid.raw, 16);
+		} else {
+			qp_context->pri_path.sl_tclass_flowlabel =
+				cpu_to_be32(attr->ah_attr.sl << 28);
+		}
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH);	
+	}
+
+	if (attr_mask & IB_QP_TIMEOUT) {
+		qp_context->pri_path.ackto = attr->timeout;
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT);
+	}
+
+	/* XXX alt_path */
+
+	/* leave rdd as 0 */
+	qp_context->pd         = cpu_to_be32(to_mpd(ibqp->pd)->pd_num);
+	/* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */
+	qp_context->wqe_lkey   = cpu_to_be32(qp->mr.ibmr.lkey);
+	qp_context->params1    = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) |
+					     (MTHCA_FLIGHT_LIMIT << 24) |
+					     MTHCA_QP_BIT_SRE           |
+					     MTHCA_QP_BIT_SWE           |
+					     MTHCA_QP_BIT_SAE);
+	if (qp->sq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC);
+	if (attr_mask & IB_QP_RETRY_CNT) {
+		qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT);
+	}
+
+	/* XXX initiator resources */
+	if (attr_mask & IB_QP_SQ_PSN)
+		qp_context->next_send_psn = cpu_to_be32(attr->sq_psn);
+	qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn);
+
+	/* XXX RDMA/atomic enable, responder resources */
+
+	if (qp->rq.policy == IB_SIGNAL_ALL_WR)
+		qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC);
+	if (attr_mask & IB_QP_MIN_RNR_TIMER) {
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT);
+	}
+	if (attr_mask & IB_QP_RQ_PSN)
+		qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn);
+
+	/* XXX ra_buff_indx */
+
+	qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn);
+
+	if (attr_mask & IB_QP_QKEY) {
+		qp_context->qkey = cpu_to_be32(attr->qkey);
+		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY);
+	}
+
+	err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans,
+			      qp->qpn, 0, qp_param, 0, &status);
+	if (status) {
+		mthca_warn(dev, "modify QP %d returned status %02x.\n",
+			   state_table[cur_state][new_state].trans, status);
+		err = -EINVAL;
+	}
+
+	if (!err) {
+		spin_lock_irq(&qp->lock);
+		/* XXX deal with async transitions to ERROR */
+		qp->state = new_state;
+		spin_unlock_irq(&qp->lock);
+	}
+
+	kfree(mailbox);
+
+	if (is_sqp(dev, qp))
+		store_attrs(to_msqp(qp), attr, attr_mask);
+
+	/* 
+	 * If we are moving QP0 to RTR, bring the IB link up; if we
+	 * are moving QP0 to RESET or ERROR, bring the link back down.
+	 */
+	if (is_qp0(dev, qp)) {
+		if (cur_state != IB_QPS_RTR &&
+		    new_state == IB_QPS_RTR)
+			init_port(dev, to_msqp(qp)->port);
+
+		if (cur_state != IB_QPS_RESET &&
+		    cur_state != IB_QPS_ERR &&
+		    (new_state == IB_QPS_RESET ||
+		     new_state == IB_QPS_ERR))
+			mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status);
+	}
+
+	return err;
+}
+
+/*
+ * Allocate and register buffer for WQEs.  qp->rq.max, sq.max,
+ * rq.max_gs and sq.max_gs must all be assigned.
+ * mthca_alloc_wqe_buf will calculate rq.wqe_shift and
+ * sq.wqe_shift (as well as send_wqe_offset, is_direct, and
+ * queue)
+ */
+static int mthca_alloc_wqe_buf(struct mthca_dev *dev,
+			       struct mthca_pd *pd,
+			       struct mthca_qp *qp)
+{
+	int size;
+	int i;
+	int npages, shift;
+	dma_addr_t t;
+	u64 *dma_list = NULL;
+	int err = -ENOMEM;
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->rq.max_gs * sizeof (struct mthca_data_seg);
+
+	for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size;
+	     qp->rq.wqe_shift++)
+		; /* nothing */
+
+	size = sizeof (struct mthca_next_seg) +
+		qp->sq.max_gs * sizeof (struct mthca_data_seg);
+	if (qp->transport == MLX)
+		size += 2 * sizeof (struct mthca_data_seg);
+	else if (qp->transport == UD)
+		size += sizeof (struct mthca_ud_seg);
+	else /* bind seg is as big as atomic + raddr segs */
+		size += sizeof (struct mthca_bind_seg);
+
+	for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size;
+	     qp->sq.wqe_shift++)
+		; /* nothing */
+
+	qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift,
+				    1 << qp->sq.wqe_shift);
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64),
+			   GFP_KERNEL);
+	if (!qp->wrid)
+		goto err_out;
+
+	if (size <= MTHCA_MAX_DIRECT_QP_SIZE) {
+		qp->is_direct = 1;
+		npages = 1;
+		shift = get_order(size) + PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n",
+				  size, shift);
+
+		qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t);
+		if (!qp->queue.direct.buf)
+			goto err_out;
+
+		pci_unmap_addr_set(&qp->queue.direct, mapping, t);
+
+		memset(qp->queue.direct.buf, 0, size);
+
+		while (t & ((1 << shift) - 1)) {
+			--shift;
+			npages *= 2;
+		}
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out_free;
+
+		for (i = 0; i < npages; ++i)
+			dma_list[i] = t + i * (1 << shift);
+	} else {
+		qp->is_direct = 0;
+		npages = size / PAGE_SIZE;
+		shift = PAGE_SHIFT;
+
+		if (0)
+			mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages);
+
+		dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL);
+		if (!dma_list)
+			goto err_out;
+
+		qp->queue.page_list = kmalloc(npages *
+					      sizeof *qp->queue.page_list,
+					      GFP_KERNEL);
+		if (!qp->queue.page_list)
+			goto err_out;
+
+		for (i = 0; i < npages; ++i) {
+			qp->queue.page_list[i].buf =
+				pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t);
+			if (!qp->queue.page_list[i].buf)
+				goto err_out_free;
+
+			memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE);
+
+			pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t);
+			dma_list[i] = t;
+		}
+	}
+
+	err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift,
+				  npages, 0, size,
+				  MTHCA_MPT_FLAG_LOCAL_WRITE |
+				  MTHCA_MPT_FLAG_LOCAL_READ,
+				  &qp->mr);
+	if (err)
+		goto err_out_free;
+
+	kfree(dma_list);
+	return 0;
+
+ err_out_free:
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else
+		for (i = 0; i < npages; ++i) {
+			if (qp->queue.page_list[i].buf)
+				pci_free_consistent(dev->pdev, PAGE_SIZE,
+						    qp->queue.page_list[i].buf,
+						    pci_unmap_addr(&qp->queue.page_list[i],
+								   mapping));
+
+		}
+
+ err_out:
+	kfree(qp->wrid);
+	kfree(dma_list);
+	return err;
+}
+
+static int mthca_alloc_qp_common(struct mthca_dev *dev,
+				 struct mthca_pd *pd,
+				 struct mthca_cq *send_cq,
+				 struct mthca_cq *recv_cq,
+				 enum ib_sig_type send_policy,
+				 enum ib_sig_type recv_policy,
+				 struct mthca_qp *qp)
+{
+	int err;
+
+	spin_lock_init(&qp->lock);
+	atomic_set(&qp->refcount, 1);
+	qp->state    	 = IB_QPS_RESET;
+	qp->sq.policy    = send_policy;
+	qp->rq.policy    = recv_policy;
+	qp->rq.cur       = 0;
+	qp->sq.cur       = 0;
+	qp->rq.next      = 0;
+	qp->sq.next      = 0;
+	qp->rq.last_comp = qp->rq.max - 1;
+	qp->sq.last_comp = qp->sq.max - 1;
+	qp->rq.last      = NULL;
+	qp->sq.last      = NULL;
+
+	err = mthca_alloc_wqe_buf(dev, pd, qp);
+	return err;
+}
+
+int mthca_alloc_qp(struct mthca_dev *dev,
+		   struct mthca_pd *pd,
+		   struct mthca_cq *send_cq,
+		   struct mthca_cq *recv_cq,
+		   enum ib_qp_type type,
+		   enum ib_sig_type send_policy,
+		   enum ib_sig_type recv_policy,
+		   struct mthca_qp *qp)
+{
+	int err;
+
+	switch (type) {
+	case IB_QPT_RC: qp->transport = RC; break;
+	case IB_QPT_UC: qp->transport = UC; break;
+	case IB_QPT_UD: qp->transport = UD; break;
+	default: return -EINVAL;
+	}		
+
+	qp->qpn = mthca_alloc(&dev->qp_table.alloc);
+	if (qp->qpn == -1)
+		return -ENOMEM;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy, qp);
+	if (err) {
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+		return err;
+	}
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_set(&dev->qp_table.qp,
+			qp->qpn & (dev->limits.num_qps - 1), qp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	return 0;
+}
+
+int mthca_alloc_sqp(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct mthca_cq *send_cq,
+		    struct mthca_cq *recv_cq,
+		    enum ib_sig_type send_policy,
+		    enum ib_sig_type recv_policy,
+		    int qpn,
+		    int port,
+		    struct mthca_sqp *sqp)
+{
+	int err = 0;
+	u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1;
+
+	sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE;
+	sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size,
+					     &sqp->header_dma, GFP_KERNEL);
+	if (!sqp->header_buf)
+		return -ENOMEM;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	if (mthca_array_get(&dev->qp_table.qp, mqpn))
+		err = -EBUSY;
+	else
+		mthca_array_set(&dev->qp_table.qp, mqpn, sqp);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	if (err)
+		goto err_out;
+
+	sqp->port = port;
+	sqp->qp.qpn       = mqpn;
+	sqp->qp.transport = MLX;
+
+	err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq,
+				    send_policy, recv_policy,
+				    &sqp->qp);
+	if (err)
+		goto err_out_free;
+
+	atomic_inc(&pd->sqp_count);
+
+	return 0;
+
+ err_out_free:
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp, mqpn);
+	spin_unlock_irq(&dev->qp_table.lock);
+
+ err_out:
+	dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size,
+			  sqp->header_buf, sqp->header_dma);
+
+	return err;
+}
+
+void mthca_free_qp(struct mthca_dev *dev,
+		   struct mthca_qp *qp)
+{
+	u8 status;
+	int size;
+	int i;
+
+	spin_lock_irq(&dev->qp_table.lock);
+	mthca_array_clear(&dev->qp_table.qp,
+			  qp->qpn & (dev->limits.num_qps - 1));
+	spin_unlock_irq(&dev->qp_table.lock);
+
+	atomic_dec(&qp->refcount);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
+
+	if (qp->state != IB_QPS_RESET)
+		mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status);
+
+	mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn);
+	if (qp->ibqp.send_cq != qp->ibqp.recv_cq)
+		mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn);
+
+	mthca_free_mr(dev, &qp->mr);
+
+	size = PAGE_ALIGN(qp->send_wqe_offset +
+			  (qp->sq.max << qp->sq.wqe_shift));
+
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else {
+		for (i = 0; i < size / PAGE_SIZE; ++i) {
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    qp->queue.page_list[i].buf,
+					    pci_unmap_addr(&qp->queue.page_list[i],
+							   mapping));
+		}
+	}
+
+	kfree(qp->wrid);
+
+	if (is_sqp(dev, qp)) {
+		atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count));
+		dma_free_coherent(&dev->pdev->dev,
+				  to_msqp(qp)->header_buf_size,
+				  to_msqp(qp)->header_buf,
+				  to_msqp(qp)->header_dma);
+	}
+	else
+		mthca_free(&dev->qp_table.alloc, qp->qpn);
+}
+
+/* Create UD header for an MLX send and build a data segment for it */
+static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp,
+			    int ind, struct ib_send_wr *wr,
+			    struct mthca_mlx_seg *mlx,
+			    struct mthca_data_seg *data)
+{
+	int header_size;
+	int err;
+
+	ib_ud_header_init(256, /* assume a MAD */
+			  sqp->ud_header.grh_present,
+			  &sqp->ud_header);
+
+	err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header);
+	if (err)
+		return err;
+	mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1);
+	mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) |
+				  (sqp->ud_header.lrh.destination_lid == 0xffff ?
+				   MTHCA_MLX_SLR : 0) |
+				  (sqp->ud_header.lrh.service_level << 8));
+	mlx->rlid = sqp->ud_header.lrh.destination_lid;
+	mlx->vcrc = 0;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY;
+		sqp->ud_header.immediate_present = 0;
+		break;
+	case IB_WR_SEND_WITH_IMM:
+		sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+		sqp->ud_header.immediate_present = 1;
+		sqp->ud_header.immediate_data = wr->imm_data;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	sqp->ud_header.lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
+	if (sqp->ud_header.lrh.destination_lid == 0xffff)
+		sqp->ud_header.lrh.source_lid = 0xffff;
+	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+	if (!sqp->qp.ibqp.qp_num)
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   sqp->pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	else
+		ib_cached_pkey_get(&dev->ib_dev, sqp->port,
+				   wr->wr.ud.pkey_index,
+				   &sqp->ud_header.bth.pkey);
+	cpu_to_be16s(&sqp->ud_header.bth.pkey);
+	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
+	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
+	sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
+					       sqp->qkey : wr->wr.ud.remote_qkey);
+	sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
+
+	header_size = ib_ud_header_pack(&sqp->ud_header,
+					sqp->header_buf +
+					ind * MTHCA_UD_HEADER_SIZE);
+
+	data->byte_count = cpu_to_be32(header_size);
+	data->lkey       = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey);
+	data->addr       = cpu_to_be64(sqp->header_dma +
+				       ind * MTHCA_UD_HEADER_SIZE);
+
+	return 0;
+}
+
+int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		    struct ib_send_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	void *wqe;
+	void *prev_wqe;
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	u32 f0 = 0;
+	int ind;
+	u8 op0 = 0;
+
+	static const u8 opcode[] = {
+		[IB_WR_SEND]                 = MTHCA_OPCODE_SEND,
+		[IB_WR_SEND_WITH_IMM]        = MTHCA_OPCODE_SEND_IMM,
+		[IB_WR_RDMA_WRITE]           = MTHCA_OPCODE_RDMA_WRITE,
+		[IB_WR_RDMA_WRITE_WITH_IMM]  = MTHCA_OPCODE_RDMA_WRITE_IMM,
+		[IB_WR_RDMA_READ]            = MTHCA_OPCODE_RDMA_READ,
+		[IB_WR_ATOMIC_CMP_AND_SWP]   = MTHCA_OPCODE_ATOMIC_CS,
+		[IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA,
+	};
+
+	spin_lock_irqsave(&qp->lock, flags);
+
+	/* XXX check that state is OK to post send */
+
+	ind = qp->sq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->sq.cur + nreq >= qp->sq.max) {
+			mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n",
+				  qp->sq.cur, qp->sq.max, nreq);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_send_wqe(qp, ind);
+		prev_wqe = qp->sq.last;
+		qp->sq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 0;
+		((struct mthca_next_seg *) wqe)->flags =
+			((wr->send_flags & IB_SEND_SIGNALED) ?
+			 cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) |
+			((wr->send_flags & IB_SEND_SOLICITED) ?
+			 cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0)   |
+			cpu_to_be32(1);
+		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
+		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+			((struct mthca_next_seg *) wqe)->flags = wr->imm_data;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (qp->transport == UD) {
+			((struct mthca_ud_seg *) wqe)->lkey =
+				cpu_to_be32(to_mah(wr->wr.ud.ah)->key);
+			((struct mthca_ud_seg *) wqe)->av_addr =
+				cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma);
+			((struct mthca_ud_seg *) wqe)->dqpn =
+				cpu_to_be32(wr->wr.ud.remote_qpn);
+			((struct mthca_ud_seg *) wqe)->qkey =
+				cpu_to_be32(wr->wr.ud.remote_qkey);
+
+			wqe += sizeof (struct mthca_ud_seg);
+			size += sizeof (struct mthca_ud_seg) / 16;
+		} else if (qp->transport == MLX) {
+			err = build_mlx_header(dev, to_msqp(qp), ind, wr,
+					       wqe - sizeof (struct mthca_next_seg),
+					       wqe);
+			if (err) {
+				*bad_wr = wr;
+				goto out;
+			}
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		if (wr->num_sge > qp->sq.max_gs) {
+			mthca_err(dev, "too many gathers\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		/* Add one more inline data segment for ICRC */
+		if (qp->transport == MLX) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32((1 << 31) | 4);
+			((u32 *) wqe)[1] = 0;
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind + qp->rq.max] = wr->wr_id;
+
+		if (wr->opcode >= ARRAY_SIZE(opcode)) {
+			mthca_err(dev, "opcode invalid\n");
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32(((ind << qp->sq.wqe_shift) +
+					     qp->send_wqe_offset) |
+					    opcode[wr->opcode]);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size);
+		}
+
+		if (!size0) {
+			size0 = size;
+			op0   = opcode[wr->opcode];
+		}
+
+		++ind;
+		if (unlikely(ind >= qp->sq.max))
+			ind -= qp->sq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) +
+					   qp->send_wqe_offset) | f0 | op0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_SEND_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->sq.cur += nreq;
+	qp->sq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		       struct ib_recv_wr **bad_wr)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	unsigned long flags;
+	int err = 0;
+	int nreq;
+	int i;
+	int size;
+	int size0 = 0;
+	int ind;
+	void *wqe;
+	void *prev_wqe;
+
+	spin_lock_irqsave(&qp->lock, flags);
+	
+	/* XXX check that state is OK to post receive */
+
+	ind = qp->rq.next;
+
+	for (nreq = 0; wr; ++nreq, wr = wr->next) {
+		if (qp->rq.cur + nreq >= qp->rq.max) {
+			mthca_err(dev, "RQ %06x full\n", qp->qpn);
+			err = -ENOMEM;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		wqe = get_recv_wqe(qp, ind);
+		prev_wqe = qp->rq.last;
+		qp->rq.last = wqe;
+
+		((struct mthca_next_seg *) wqe)->nda_op = 0;
+		((struct mthca_next_seg *) wqe)->ee_nds = 
+			cpu_to_be32(MTHCA_NEXT_DBD);
+		((struct mthca_next_seg *) wqe)->flags =
+			(wr->recv_flags & IB_RECV_SIGNALED) ?
+			cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0;
+
+		wqe += sizeof (struct mthca_next_seg);
+		size = sizeof (struct mthca_next_seg) / 16;
+
+		if (wr->num_sge > qp->rq.max_gs) {
+			err = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		for (i = 0; i < wr->num_sge; ++i) {
+			((struct mthca_data_seg *) wqe)->byte_count =
+				cpu_to_be32(wr->sg_list[i].length);
+			((struct mthca_data_seg *) wqe)->lkey =
+				cpu_to_be32(wr->sg_list[i].lkey);
+			((struct mthca_data_seg *) wqe)->addr =
+				cpu_to_be64(wr->sg_list[i].addr);
+			wqe += sizeof (struct mthca_data_seg);
+			size += sizeof (struct mthca_data_seg) / 16;
+		}
+
+		qp->wrid[ind] = wr->wr_id;
+
+		if (prev_wqe) {
+			((struct mthca_next_seg *) prev_wqe)->nda_op =
+				cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
+			smp_wmb();
+			((struct mthca_next_seg *) prev_wqe)->ee_nds =
+				cpu_to_be32(MTHCA_NEXT_DBD | size);
+		}
+
+		if (!size0)
+			size0 = size;
+
+		++ind;
+		if (unlikely(ind >= qp->rq.max))
+			ind -= qp->rq.max;
+	}
+
+out:
+	if (nreq) {
+		u32 doorbell[2];
+
+		doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0);
+		doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq);
+
+		wmb();
+
+		mthca_write64(doorbell,
+			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
+	}
+
+	qp->rq.cur += nreq;
+	qp->rq.next = ind;
+
+	spin_unlock_irqrestore(&qp->lock, flags);
+	return err;
+}
+
+int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,
+		       int index, int *dbd, u32 *new_wqe)
+{
+	struct mthca_next_seg *next;
+
+	if (is_send)
+		next = get_send_wqe(qp, index);
+	else
+		next = get_recv_wqe(qp, index);
+
+	*dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD));
+	if (next->ee_nds & cpu_to_be32(0x3f))
+		*new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) |
+			(next->ee_nds & cpu_to_be32(0x3f));
+	else
+		*new_wqe = 0;
+
+	return 0;
+}
+
+int __devinit mthca_init_qp_table(struct mthca_dev *dev)
+{
+	int err;
+	u8 status;
+	int i;
+
+	spin_lock_init(&dev->qp_table.lock);
+
+	/*
+	 * We reserve 2 extra QPs per port for the special QPs.  The
+	 * special QP for port 1 has to be even, so round up.
+	 */
+	dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL;
+	err = mthca_alloc_init(&dev->qp_table.alloc,
+			       dev->limits.num_qps,
+			       (1 << 24) - 1,
+			       dev->qp_table.sqp_start +
+			       MTHCA_MAX_PORTS * 2);
+	if (err)
+		return err;
+
+	err = mthca_array_init(&dev->qp_table.qp,
+			       dev->limits.num_qps);
+	if (err) {
+		mthca_alloc_cleanup(&dev->qp_table.alloc);
+		return err;
+	}
+
+	for (i = 0; i < 2; ++i) {
+		err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI,
+					    dev->qp_table.sqp_start + i * 2,
+					    &status);
+		if (err)
+			goto err_out;
+		if (status) {
+			mthca_warn(dev, "CONF_SPECIAL_QP returned "
+				   "status %02x, aborting.\n",
+				   status);
+			err = -EINVAL;
+			goto err_out;
+		}
+	}
+	return 0;
+
+ err_out:
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps);
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev)
+{
+	int i;
+	u8 status;
+
+	for (i = 0; i < 2; ++i)
+		mthca_CONF_SPECIAL_QP(dev, i, 0, &status);
+
+	mthca_alloc_cleanup(&dev->qp_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:38 -0800
Subject: [openib-general] [PATCH][RFC/v2][13/21] Add Mellanox HCA low-level
	driver (last bits)
In-Reply-To: <20041123815.KMR5AMwRXU875N9Z@topspin.com>
Message-ID: <20041123815.NWFV7rNrbnpqbYAH@topspin.com>

Add code for remaining InfiniBand objects (address vectors, multicast
groups, memory regions and protection domains)

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c	2004-11-23 08:10:21.345414995 -0800
@@ -0,0 +1,212 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $
+ */
+
+#include <linux/init.h>
+
+#include <ib_verbs.h>
+#include <ib_cache.h>
+
+#include "mthca_dev.h"
+
+struct mthca_av {
+	u32 port_pd;
+	u8  reserved1;
+	u8  g_slid;
+	u16 dlid;
+	u8  reserved2;
+	u8  gid_index;
+	u8  msg_sr;
+	u8  hop_limit;
+	u32 sl_tclass_flowlabel;
+	u32 dgid[4];
+} __attribute__((packed));
+
+int mthca_create_ah(struct mthca_dev *dev,
+		    struct mthca_pd *pd,
+		    struct ib_ah_attr *ah_attr,
+		    struct mthca_ah *ah)
+{
+	u32 index = -1;
+	struct mthca_av *av = NULL;
+
+	ah->on_hca = 0;
+
+	if (!atomic_read(&pd->sqp_count) &&
+	    !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		index = mthca_alloc(&dev->av_table.alloc);
+
+		/* fall back to allocate in host memory */
+		if (index == -1)
+			goto host_alloc;
+
+		av = kmalloc(sizeof *av, GFP_KERNEL);
+		if (!av)
+			goto host_alloc;
+			
+		ah->on_hca = 1;
+		ah->avdma  = dev->av_table.ddr_av_base +
+			index * MTHCA_AV_SIZE;
+	}
+
+ host_alloc:
+	if (!ah->on_hca) {
+		ah->av = pci_pool_alloc(dev->av_table.pool,
+					SLAB_KERNEL, &ah->avdma);
+		if (!ah->av)
+			return -ENOMEM;
+
+		av = ah->av;
+	}
+
+	ah->key = pd->ntmr.ibmr.lkey;
+
+	memset(av, 0, MTHCA_AV_SIZE);
+
+	av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24));
+	av->g_slid  = ah_attr->src_path_bits;
+	av->dlid    = cpu_to_be16(ah_attr->dlid);
+	av->msg_sr  = (3 << 4) | /* 2K message */
+		ah_attr->static_rate;
+	av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+	if (ah_attr->ah_flags & IB_AH_GRH) {
+		av->g_slid |= 0x80;
+		av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len +
+			ah_attr->grh.sgid_index;
+		av->hop_limit = ah_attr->grh.hop_limit;
+		av->sl_tclass_flowlabel |=
+			cpu_to_be32((ah_attr->grh.traffic_class << 20) |
+				    ah_attr->grh.flow_label);
+		memcpy(av->dgid, ah_attr->grh.dgid.raw, 16);
+	}
+
+	if (0) {
+		int j;
+		
+		mthca_dbg(dev, "Created UDAV at %p/%08lx:\n",
+			  av, (unsigned long) ah->avdma);
+		for (j = 0; j < 8; ++j)
+			printk(KERN_DEBUG "  [%2x] %08x\n",
+			       j * 4, be32_to_cpu(((u32 *) av)[j]));
+	}
+
+	if (ah->on_hca) {
+		memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
+			    av, MTHCA_AV_SIZE);
+		kfree(av);
+	}
+
+	return 0;
+}
+
+int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
+{
+	if (ah->on_hca)
+		mthca_free(&dev->av_table.alloc,
+ 			   (ah->avdma - dev->av_table.ddr_av_base) /
+			   MTHCA_AV_SIZE);
+	else
+		pci_pool_free(dev->av_table.pool, ah->av, ah->avdma);
+
+	return 0;
+}
+
+int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
+		  struct ib_ud_header *header)
+{
+	if (ah->on_hca)
+		return -EINVAL;
+
+	header->lrh.service_level   = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28;
+	header->lrh.destination_lid = ah->av->dlid;
+	header->lrh.source_lid      = ah->av->g_slid & 0x7f;
+	if (ah->av->g_slid & 0x80) {
+		header->grh_present = 1;
+		header->grh.traffic_class =
+			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
+		header->grh.flow_label    =
+			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		ib_cached_gid_get(&dev->ib_dev,
+				  be32_to_cpu(ah->av->port_pd) >> 24,
+				  ah->av->gid_index,
+				  &header->grh.source_gid);
+		memcpy(header->grh.destination_gid.raw,
+		       ah->av->dgid, 16);
+	} else {
+		header->grh_present = 0;
+	}
+
+	return 0;
+}
+
+int __devinit mthca_init_av_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->av_table.alloc,
+			       dev->av_table.num_ddr_avs,
+			       dev->av_table.num_ddr_avs - 1,
+			       0);
+	if (err)
+		return err;
+
+	dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev,
+					     MTHCA_AV_SIZE,
+					     MTHCA_AV_SIZE, 0);
+	if (!dev->av_table.pool)
+		goto out_free_alloc;
+
+	if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) {
+		dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) +
+					       dev->av_table.ddr_av_base -
+					       dev->ddr_start,
+					       dev->av_table.num_ddr_avs *
+					       MTHCA_AV_SIZE);
+		if (!dev->av_table.av_map)
+			goto out_free_pool;
+	} else
+		dev->av_table.av_map = NULL;
+
+	return 0;
+
+ out_free_pool:
+	pci_pool_destroy(dev->av_table.pool);
+
+ out_free_alloc:
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+	return -ENOMEM;
+}
+
+void __devexit mthca_cleanup_av_table(struct mthca_dev *dev)
+{
+	if (dev->av_table.av_map)
+		iounmap(dev->av_table.av_map);
+	pci_pool_destroy(dev->av_table.pool);
+	mthca_alloc_cleanup(&dev->av_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c	2004-11-23 08:10:21.371411162 -0800
@@ -0,0 +1,372 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $
+ */
+
+#include <linux/init.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2)
+};
+
+struct mthca_mgm {
+	u32 next_gid_index;
+	u32 reserved[3];
+	u8  gid[16];
+	u32 qp[MTHCA_QP_PER_MGM];
+} __attribute__((packed));
+
+static const u8 zero_gid[16];	/* automatically initialized to 0 */
+
+/*
+ * Caller must hold MCG table semaphore.  gid and mgm parameters must
+ * be properly aligned for command interface.
+ *
+ *  Returns 0 unless a firmware command error occurs.
+ *
+ * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1
+ * and *mgm holds MGM entry.
+ *
+ * if GID is found in AMGM, *index = index in AMGM, *prev = index of
+ * previous entry in hash chain and *mgm holds AMGM entry.
+ *
+ * If no AMGM exists for given gid, *index = -1, *prev = index of last
+ * entry in hash chain and *mgm holds end of hash chain.
+ */
+static int find_mgm(struct mthca_dev *dev,
+		    u8 *gid, struct mthca_mgm *mgm,
+		    u16 *hash, int *prev, int *index)
+{
+	void *mailbox;
+	u8 *mgid;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgid = MAILBOX_ALIGN(mailbox);
+
+	memcpy(mgid, gid, 16);
+
+	err = mthca_MGID_HASH(dev, mgid, hash, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "MGID_HASH returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (0)
+		mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:"
+			  "%04x:%04x:%04x:%04x is %04x\n",
+			  be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]),
+			  be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]),
+			  be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]),
+			  be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]),
+			  *hash);
+
+	*index = *hash;
+	*prev  = -1;
+
+	do {
+		err = mthca_READ_MGM(dev, *index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			return -EINVAL;
+		}
+
+		if (!memcmp(mgm->gid, zero_gid, 16)) {
+			if (*index != *hash) {
+				mthca_err(dev, "Found zero MGID in AMGM.\n");
+				err = -EINVAL;
+			}
+			goto out;
+		}
+
+		if (!memcmp(mgm->gid, gid, 16))
+			goto out;
+
+		*prev = *index;
+		*index = be32_to_cpu(mgm->next_gid_index) >> 5;
+	} while (*index);
+
+	*index = -1;
+
+ out:
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int index, prev;
+	int link = 0;
+	int i;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index != -1) {
+		if (!memcmp(mgm->gid, zero_gid, 16))
+			memcpy(mgm->gid, gid->raw, 16);
+	} else {
+		link = 1;
+
+		index = mthca_alloc(&dev->mcg_table.alloc);
+		if (index == -1) {
+			mthca_err(dev, "No AMGM entries left\n");
+			err = -ENOMEM;
+			goto out;
+		}
+
+		err = mthca_READ_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		memcpy(mgm->gid, gid->raw, 16);
+		mgm->next_gid_index = 0;
+	}
+
+	for (i = 0; i < MTHCA_QP_PER_MGM; ++i)
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) {
+			mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31));
+			break;
+		}
+
+	if (i == MTHCA_QP_PER_MGM) {
+		mthca_err(dev, "MGM at index %x is full.\n", index);
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+	if (!link)
+		goto out;
+
+	err = mthca_READ_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "READ_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->next_gid_index = cpu_to_be32(index << 5);
+
+	err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	void *mailbox;
+	struct mthca_mgm *mgm;
+	u16 hash;
+	int prev, index;
+	int i, loc;
+	int err;
+	u8 status;
+
+	mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL);
+	if (!mailbox)
+		return -ENOMEM;
+	mgm = MAILBOX_ALIGN(mailbox);
+
+	if (down_interruptible(&dev->mcg_table.sem))
+		return -EINTR;
+
+	err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index);
+	if (err)
+		goto out;
+
+	if (index == -1) {	
+		mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x "
+			  "not found\n",
+			  be16_to_cpu(((u16 *) gid->raw)[0]),
+			  be16_to_cpu(((u16 *) gid->raw)[1]),
+			  be16_to_cpu(((u16 *) gid->raw)[2]),
+			  be16_to_cpu(((u16 *) gid->raw)[3]),
+			  be16_to_cpu(((u16 *) gid->raw)[4]),
+			  be16_to_cpu(((u16 *) gid->raw)[5]),
+			  be16_to_cpu(((u16 *) gid->raw)[6]),
+			  be16_to_cpu(((u16 *) gid->raw)[7]));
+		err = -EINVAL;
+		goto out;
+	}
+
+	for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) {
+		if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31)))
+			loc = i;
+		if (!(mgm->qp[i] & cpu_to_be32(1 << 31)))
+			break;
+	}
+
+	if (loc == -1) {
+		mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num);
+		err = -EINVAL;
+		goto out;
+	}
+
+	mgm->qp[loc]   = mgm->qp[i - 1];
+	mgm->qp[i - 1] = 0;
+
+	err = mthca_WRITE_MGM(dev, index, mgm, &status);
+	if (err)
+		goto out;
+	if (status) {
+		mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (i != 1)
+		goto out;
+
+	goto out;
+
+	if (prev == -1) {
+		/* Remove entry from MGM */
+		if (be32_to_cpu(mgm->next_gid_index) >> 5) {
+			err = mthca_READ_MGM(dev,
+					     be32_to_cpu(mgm->next_gid_index) >> 5,
+					     mgm, &status);
+			if (err)
+				goto out;
+			if (status) {
+				mthca_err(dev, "READ_MGM returned status %02x\n",
+					  status);
+				err = -EINVAL;
+				goto out;
+			}
+		} else
+			memset(mgm->gid, 0, 16);
+
+		err = mthca_WRITE_MGM(dev, index, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		/* Remove entry from AMGM */
+		index = be32_to_cpu(mgm->next_gid_index) >> 5;
+		err = mthca_READ_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "READ_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+
+		mgm->next_gid_index = cpu_to_be32(index << 5);
+
+		err = mthca_WRITE_MGM(dev, prev, mgm, &status);
+		if (err)
+			goto out;
+		if (status) {
+			mthca_err(dev, "WRITE_MGM returned status %02x\n", status);
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+ out:
+	up(&dev->mcg_table.sem);
+	kfree(mailbox);
+	return err;
+}
+
+int __devinit mthca_init_mcg_table(struct mthca_dev *dev)
+{
+	int err;
+
+	err = mthca_alloc_init(&dev->mcg_table.alloc,
+			       dev->limits.num_amgms,
+			       dev->limits.num_amgms - 1,
+			       0);
+	if (err)
+		return err;
+
+	init_MUTEX(&dev->mcg_table.sem);
+
+	return 0;
+}
+
+void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev)
+{
+	mthca_alloc_cleanup(&dev->mcg_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c	2004-11-23 08:10:21.410405413 -0800
@@ -0,0 +1,389 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+struct mthca_mpt_entry {
+	u32 flags;
+	u32 page_size;
+	u32 key;
+	u32 pd;
+	u64 start;
+	u64 length;
+	u32 lkey;
+	u32 window_count;
+	u32 window_count_limit;
+	u64 mtt_seg;
+	u32 reserved[3];
+} __attribute__((packed));
+
+#define MTHCA_MPT_FLAG_SW_OWNS       (0xfUL << 28)
+#define MTHCA_MPT_FLAG_MIO           (1 << 17)
+#define MTHCA_MPT_FLAG_BIND_ENABLE   (1 << 15)
+#define MTHCA_MPT_FLAG_PHYSICAL      (1 <<  9)
+#define MTHCA_MPT_FLAG_REGION        (1 <<  8)
+
+#define MTHCA_MTT_FLAG_PRESENT       1
+
+/*
+ * Buddy allocator for MTT segments (currently not very efficient
+ * since it doesn't keep a free list and just searches linearly
+ * through the bitmaps)
+ */
+
+static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+{
+	int o;
+	int m;
+	u32 seg;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	for (o = order; o <= dev->mr_table.max_mtt_order; ++o) {
+		m = 1 << (dev->mr_table.max_mtt_order - o);
+		seg = find_first_bit(dev->mr_table.mtt_buddy[o], m);
+		if (seg < m)
+			goto found;
+	}
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	return -1;
+
+ found:
+	clear_bit(seg, dev->mr_table.mtt_buddy[o]);
+
+	while (o > order) {
+		--o;
+		seg <<= 1;
+		set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]);
+	}
+					  
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+
+	seg <<= order;
+
+	return seg;
+}
+
+static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+{
+	seg >>= order;
+
+	spin_lock(&dev->mr_table.mpt_alloc.lock);
+
+	while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) {
+		clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]);
+		seg >>= 1;
+		++order;
+	}
+
+	set_bit(seg, dev->mr_table.mtt_buddy[order]);
+
+	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+}
+
+int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
+			   u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	struct mthca_mpt_entry *mpt_entry;
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	mr->order = -1;
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox) {
+		mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+		return -ENOMEM;
+	}
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_PHYSICAL    |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+	mpt_entry->page_size = 0;
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = 0;
+	mpt_entry->length    = ~0ULL;
+
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+}
+
+int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
+			u64 *buffer_list, int buffer_size_shift,
+			int list_len, u64 iova, u64 total_size,
+			u32 access, struct mthca_mr *mr)
+{
+	void *mailbox;
+	u64 *mtt_entry;
+	struct mthca_mpt_entry *mpt_entry;
+	int err = -ENOMEM;
+	u8 status;
+	int i;
+
+	might_sleep();
+	WARN_ON(buffer_size_shift >= 32);
+
+	mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (mr->ibmr.lkey == -1)
+		return -ENOMEM;
+	mr->ibmr.rkey = mr->ibmr.lkey;
+
+	for (i = dev->limits.mtt_seg_size / 8, mr->order = 0;
+	     i < list_len;
+	     i <<= 1, ++mr->order)
+		/* nothing */ ;
+
+	mr->first_seg = mthca_alloc_mtt(dev, mr->order);
+	if (mr->first_seg == -1)
+		goto err_out_mpt_free;
+
+	/*
+	 * If list_len is odd, we add one more dummy entry for
+	 * firmware efficiency.
+	 */
+	mailbox = kmalloc(max(sizeof *mpt_entry,
+			      (size_t) 8 * (list_len + (list_len & 1) + 2)) +
+			  MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free_mtt;
+
+	mtt_entry = MAILBOX_ALIGN(mailbox);
+
+	mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base +
+				   mr->first_seg * dev->limits.mtt_seg_size);
+	mtt_entry[1] = 0;
+	for (i = 0; i < list_len; ++i)
+		mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+	if (list_len & 1) {
+		mtt_entry[i + 2] = 0;
+		++list_len;
+	}
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry\n");
+		for (i = 0; i < list_len + 2; ++i)
+			printk(KERN_ERR "[%2d] %016llx\n",
+			       i, (unsigned long long) be64_to_cpu(mtt_entry[i]));
+	}
+
+	err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status);
+	if (err) {
+		mthca_warn(dev, "WRITE_MTT failed (%d)\n", err);
+		goto err_out_mailbox_free;
+	}
+	if (status) {
+		mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_mailbox_free;
+	}
+
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+
+	mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12);
+	mpt_entry->key       = cpu_to_be32(mr->ibmr.lkey);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	mpt_entry->start     = cpu_to_be64(iova);
+	mpt_entry->length    = cpu_to_be64(total_size);
+	memset(&mpt_entry->lkey, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
+	mpt_entry->mtt_seg   = cpu_to_be64(dev->mr_table.mtt_base +
+					   mr->first_seg * dev->limits.mtt_seg_size);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
+		for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i]));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+	else if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+	}
+
+	kfree(mailbox);
+	return err;
+
+ err_out_mailbox_free:
+	kfree(mailbox);
+
+ err_out_free_mtt:
+	mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+ err_out_mpt_free:
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	return err;
+}
+
+void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr)
+{
+	int err;
+	u8 status;
+
+	might_sleep();
+
+	err = mthca_HW2SW_MPT(dev, NULL,
+			      mr->ibmr.lkey & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err)
+		mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err);
+	else if (status)
+		mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n",
+			   status);
+
+	if (mr->order >= 0)
+		mthca_free_mtt(dev, mr->first_seg, mr->order);
+
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);		   
+}
+
+int __devinit mthca_init_mr_table(struct mthca_dev *dev)
+{
+	int err;
+	int i, s;
+
+	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
+			       dev->limits.num_mpts,
+			       ~0, dev->limits.reserved_mrws);
+	if (err)
+		return err;
+
+	err = -ENOMEM;
+
+	for (i = 1, dev->mr_table.max_mtt_order = 0;
+	     i < dev->limits.num_mtt_segs;
+	     i <<= 1, ++dev->mr_table.max_mtt_order)
+		/* nothing */ ;
+
+	dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) *
+					  sizeof (long *),
+					  GFP_KERNEL);
+	if (!dev->mr_table.mtt_buddy)
+		goto err_out;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		dev->mr_table.mtt_buddy[i] = NULL;
+
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) {
+		s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i));
+		dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long),
+						     GFP_KERNEL);
+		if (!dev->mr_table.mtt_buddy[i])
+			goto err_out_free;
+		bitmap_zero(dev->mr_table.mtt_buddy[i],
+			    1 << (dev->mr_table.max_mtt_order - i));
+	}
+
+	set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]);
+
+	for (i = 0; i < dev->mr_table.max_mtt_order; ++i)
+		if (1 << i >= dev->limits.reserved_mtts)
+			break;
+
+	if (i == dev->mr_table.max_mtt_order) {
+		mthca_err(dev, "MTT table of order %d is "
+			  "too small.\n", i);
+		goto err_out_free;
+	}
+
+	(void) mthca_alloc_mtt(dev, i);
+
+	return 0;
+
+ err_out_free:
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+
+ err_out:
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+
+	return err;
+}
+
+void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev)
+{
+	int i;
+
+	/* XXX check if any MRs are still allocated? */
+	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
+		kfree(dev->mr_table.mtt_buddy[i]);
+	kfree(dev->mr_table.mtt_buddy);
+	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c	2004-11-23 08:10:21.436401580 -0800
@@ -0,0 +1,76 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "mthca_dev.h"
+
+int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	int err;
+
+	might_sleep();
+
+	atomic_set(&pd->sqp_count, 0);
+	pd->pd_num = mthca_alloc(&dev->pd_table.alloc);
+	if (pd->pd_num == -1)
+		return -ENOMEM;
+
+	err = mthca_mr_alloc_notrans(dev, pd->pd_num,
+				     MTHCA_MPT_FLAG_LOCAL_READ |
+				     MTHCA_MPT_FLAG_LOCAL_WRITE,
+				     &pd->ntmr);
+	if (err)
+		mthca_free(&dev->pd_table.alloc, pd->pd_num);
+
+	return err;
+}
+
+void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd)
+{
+	might_sleep();
+	mthca_free_mr(dev, &pd->ntmr);
+	mthca_free(&dev->pd_table.alloc, pd->pd_num);
+}
+
+int __devinit mthca_init_pd_table(struct mthca_dev *dev)
+{
+	return mthca_alloc_init(&dev->pd_table.alloc,
+				dev->limits.num_pds,
+				(1 << 24) - 1,
+				dev->limits.reserved_pds);
+}
+
+void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev)
+{
+	/* XXX check if any PDs are still allocated? */
+	mthca_alloc_cleanup(&dev->pd_table.alloc);
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:46 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:46 -0800
Subject: [openib-general] [PATCH][RFC/v2][14/21] Add Mellanox HCA low-level
	driver (MAD)
In-Reply-To: <20041123815.NWFV7rNrbnpqbYAH@topspin.com>
Message-ID: <20041123815.Irsm0l3oz7MStqls@topspin.com>

Add MAD (management datagram) code for Mellanox HCA driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c	2004-11-23 08:10:21.738357057 -0800
@@ -0,0 +1,321 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $
+ */
+
+#include <ib_verbs.h>
+#include <ib_mad.h>
+
+#include "mthca_dev.h"
+#include "mthca_cmd.h"
+
+enum {
+	IB_SM_PORT_INFO        = 0x0015,
+	IB_SM_PKEY_TABLE       = 0x0016,
+	IB_SM_SM_INFO          = 0x0020,
+	IB_SM_VENDOR_START     = 0xff00
+};
+
+enum {
+	MTHCA_VENDOR_CLASS1 = 0x9,
+	MTHCA_VENDOR_CLASS2 = 0xa
+};
+
+struct mthca_trap_mad {
+	struct ib_mad *mad;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+static void update_sm_ah(struct mthca_dev *dev,
+			 u8 port_num, u16 lid, u8 sl)
+{
+	struct ib_ah *new_ah;
+	struct ib_ah_attr ah_attr;
+	unsigned long flags;
+
+	if (!dev->send_agent[port_num - 1][0])
+		return;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid     = lid;
+	ah_attr.sl       = sl;
+	ah_attr.port_num = port_num;
+
+	new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd,
+			      &ah_attr);
+	if (IS_ERR(new_ah))
+		return;
+
+	spin_lock_irqsave(&dev->sm_lock, flags);
+	if (dev->sm_ah[port_num - 1])
+		ib_destroy_ah(dev->sm_ah[port_num - 1]);
+	dev->sm_ah[port_num - 1] = new_ah;
+	spin_unlock_irqrestore(&dev->sm_lock, flags);
+}
+
+/*
+ * Snoop SM MADs for port info and P_Key table sets, so we can
+ * synthesize LID change and P_Key change events.
+ */
+static void smp_snoop(struct ib_device *ibdev,
+		      u8 port_num,
+		      struct ib_mad *mad)
+{
+	struct ib_event event;
+
+	if ((mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     mad->mad_hdr.mgmt_class  == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+	    mad->mad_hdr.method     == IB_MGMT_METHOD_SET) {
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) {
+			update_sm_ah(to_mdev(ibdev), port_num,
+				     be16_to_cpup((__be16 *) (mad->data + 58)),
+				     (*(u8 *) (mad->data + 76)) & 0xf);
+
+			event.device           = ibdev;
+			event.event            = IB_EVENT_LID_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+
+		if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) {
+			event.device           = ibdev;
+			event.event            = IB_EVENT_PKEY_CHANGE;
+			event.element.port_num = port_num;
+			ib_dispatch_event(&event);
+		}
+	}
+}
+
+static void forward_trap(struct mthca_dev *dev,
+			 u8 port_num,
+			 struct ib_mad *mad)
+{
+	int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED;
+	struct mthca_trap_mad *tmad;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+		.wr	     = {
+			 .ud = {
+				 .remote_qpn  = qpn,
+				 .remote_qkey = qpn ? IB_QP1_QKEY : 0,
+				 .timeout_ms  = 0
+			 }
+		 }
+	};
+	struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn];
+	int ret;
+	unsigned long flags;
+
+	if (agent) {
+		tmad = kmalloc(sizeof *tmad, GFP_KERNEL);
+		if (!tmad)
+			return;
+
+		tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL);
+		if (!tmad->mad) {
+			kfree(tmad);
+			return;
+		}
+
+		memcpy(tmad->mad, mad, sizeof *mad);
+
+		wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr;
+		wr.wr_id         = (unsigned long) tmad;
+
+		gather_list.addr   = dma_map_single(agent->device->dma_device,
+						    tmad->mad,
+						    sizeof *tmad->mad,
+						    DMA_TO_DEVICE);
+		gather_list.length = sizeof *tmad->mad;
+		gather_list.lkey   = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey;
+		pci_unmap_addr_set(tmad, mapping, gather_list.addr);
+		
+		/*
+		 * We rely here on the fact that MLX QPs don't use the
+		 * address handle after the send is posted (this is
+		 * wrong following the IB spec strictly, but we know
+		 * it's OK for our devices).
+		 */
+		spin_lock_irqsave(&dev->sm_lock, flags);
+		wr.wr.ud.ah      = dev->sm_ah[port_num - 1];
+		if (wr.wr.ud.ah)
+			ret = ib_post_send_mad(agent, &wr, &bad_wr);
+		else
+			ret = -EINVAL;
+		spin_unlock_irqrestore(&dev->sm_lock, flags);
+
+		if (ret) {
+			dma_unmap_single(agent->device->dma_device,
+					 pci_unmap_addr(tmad, mapping),
+					 sizeof *tmad->mad,
+					 DMA_TO_DEVICE);
+			kfree(tmad->mad);
+			kfree(tmad);
+		}
+	}
+}
+
+int mthca_process_mad(struct ib_device *ibdev,
+		      int mad_flags,
+		      u8 port_num,
+		      u16 slid,
+		      struct ib_mad *in_mad,
+		      struct ib_mad *out_mad)
+{
+	int err;
+	u8 status;
+
+	/* Forward locally generated traps to the SM */
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
+	    slid == 0) {
+		forward_trap(to_mdev(ibdev), port_num, in_mad);
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+	}
+
+	/*
+	 * Only handle SM gets, sets and trap represses for SM class
+	 *
+	 * Only handle PMA and Mellanox vendor-specific class gets and
+	 * sets for other classes.
+	 */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || 
+	    in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		if (in_mad->mad_hdr.method   != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_SET &&
+		    in_mad->mad_hdr.method   != IB_MGMT_METHOD_TRAP_REPRESS)
+			return IB_MAD_RESULT_SUCCESS;
+
+		/* 
+		 * Don't process SMInfo queries or vendor-specific
+		 * MADs -- the SMA can't handle them.
+		 */
+		if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO ||
+		    be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START)
+			return IB_MAD_RESULT_SUCCESS;
+	} else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT ||
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1     || 
+		   in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) {
+		if (in_mad->mad_hdr.method  != IB_MGMT_METHOD_GET &&
+		    in_mad->mad_hdr.method  != IB_MGMT_METHOD_SET)
+			return IB_MAD_RESULT_SUCCESS;
+	} else
+		return IB_MAD_RESULT_SUCCESS;
+
+	err = mthca_MAD_IFC(to_mdev(ibdev),
+			    !!(mad_flags & IB_MAD_IGNORE_MKEY),
+			    port_num, in_mad, out_mad,
+			    &status);
+	if (err) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC failed\n");
+		return IB_MAD_RESULT_FAILURE;
+	}
+	if (status == MTHCA_CMD_STAT_BAD_PKT)
+		return IB_MAD_RESULT_SUCCESS;
+	if (status) {
+		mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n",
+			  status);
+		return IB_MAD_RESULT_FAILURE;
+	}
+
+	if (!out_mad->mad_hdr.status)
+		smp_snoop(ibdev, port_num, in_mad);
+
+	/* set return bit in status of directed route responses */
+	if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		out_mad->mad_hdr.status |= cpu_to_be16(1 << 15);
+
+	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS)
+		/* no response for trap repress */
+		return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+
+	return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct mthca_trap_mad *tmad =
+		(void *) (unsigned long) mad_send_wc->wr_id;
+
+	dma_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(tmad, mapping),
+			 sizeof *tmad->mad,
+			 DMA_TO_DEVICE);
+	kfree(tmad->mad);
+	kfree(tmad);
+}
+
+int mthca_create_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	spin_lock_init(&dev->sm_lock);
+
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q) {
+			agent = ib_register_mad_agent(&dev->ib_dev, p + 1,
+						      q ? IB_QPT_GSI : IB_QPT_SMI,
+						      NULL, 0, send_handler,
+						      NULL, NULL);
+			if (IS_ERR(agent))
+				goto err;
+			dev->send_agent[p][q] = agent;
+		}
+
+	return 0;
+
+err:
+	for (p = 0; p < dev->limits.num_ports; ++p)
+		for (q = 0; q <= 1; ++q)
+			if (dev->send_agent[p][q])
+				ib_unregister_mad_agent(dev->send_agent[p][q]);
+
+	return PTR_ERR(agent);
+}
+
+void mthca_free_agents(struct mthca_dev *dev)
+{
+	struct ib_mad_agent *agent;
+	int p, q;
+
+	for (p = 0; p < dev->limits.num_ports; ++p) {
+		for (q = 0; q <= 1; ++q) {
+			agent = dev->send_agent[p][q];
+			dev->send_agent[p][q] = NULL;
+			ib_unregister_mad_agent(agent);
+		}
+
+		if (dev->sm_ah[p])
+			ib_destroy_ah(dev->sm_ah[p]);
+	}
+}
+
+/*
+ * Local Variables:
+ * c-file-style: "linux"
+ * indent-tabs-mode: t
+ * End:
+ */


From roland at topspin.com  Tue Nov 23 08:15:52 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:52 -0800
Subject: [openib-general] [PATCH][RFC/v2][15/21] IPoIB IPv4 multicast
In-Reply-To: <20041123815.Irsm0l3oz7MStqls@topspin.com>
Message-ID: <20041123815.3UphmLcWp4RG6D85@topspin.com>

Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB
hardware addresses.  Also add <linux/if_infiniband.h> so INFINIBAND_ALEN
has a home.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/include/linux/if_infiniband.h	2004-11-23 08:10:22.004317841 -0800
@@ -0,0 +1,29 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#ifndef _LINUX_IF_INFINIBAND_H
+#define _LINUX_IF_INFINIBAND_H
+
+#define INFINIBAND_ALEN		20	/* Octets in IPoIB HW addr	*/
+
+#endif /* _LINUX_IF_INFINIBAND_H */
--- linux-bk.orig/include/net/ip.h	2004-11-23 08:09:44.620829918 -0800
+++ linux-bk/include/net/ip.h	2004-11-23 08:10:22.005317694 -0800
@@ -229,6 +229,39 @@
 	buf[3]=addr&0x7F;
 }
 
+/*
+ *	Map a multicast IP onto multicast MAC for type IP-over-InfiniBand.
+ *	Leave P_Key as 0 to be filled in by driver.
+ */
+
+static inline void ip_ib_mc_map(u32 addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	addr    = ntohl(addr);
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x40;		/* IPv4 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	buf[10] = 0;
+	buf[11] = 0;
+	buf[12] = 0;
+	buf[13] = 0;
+	buf[14] = 0;
+	buf[15] = 0;
+	buf[19] = addr & 0xff;
+	addr  >>= 8;
+	buf[18] = addr & 0xff;
+	addr  >>= 8;
+	buf[17] = addr & 0xff;
+	addr  >>= 8;
+	buf[16] = addr & 0x0f;
+}
+
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 #include <linux/ipv6.h>
 #endif
--- linux-bk.orig/net/ipv4/arp.c	2004-11-23 08:09:54.024443395 -0800
+++ linux-bk/net/ipv4/arp.c	2004-11-23 08:10:22.005317694 -0800
@@ -213,6 +213,9 @@
 	case ARPHRD_IEEE802_TR:
 		ip_tr_mc_map(addr, haddr);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ip_ib_mc_map(addr, haddr);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(haddr, dev->broadcast, dev->addr_len);


From roland at topspin.com  Tue Nov 23 08:15:57 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:15:57 -0800
Subject: [openib-general] [PATCH][RFC/v2][16/21] IPoIB IPv6 support
In-Reply-To: <20041123815.3UphmLcWp4RG6D85@topspin.com>
Message-ID: <20041123815.OuqXEOqXJtDtY180@topspin.com>

Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB
hardware addresses, and add support for autoconfiguration for devices
with type ARPHRD_INFINIBAND.

The mapping for multicast addresses is described in
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Nitin Hande <Nitin.Hande at Sun.Com>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/include/net/if_inet6.h	2004-11-23 08:09:55.180272973 -0800
+++ linux-bk/include/net/if_inet6.h	2004-11-23 08:10:22.300274203 -0800
@@ -266,5 +266,20 @@
 {
 	buf[0] = 0x00;
 }
+
+static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf)
+{
+	buf[0]  = 0;		/* Reserved */
+	buf[1]  = 0xff;		/* Multicast QPN */
+	buf[2]  = 0xff;
+	buf[3]  = 0xff;
+	buf[4]  = 0xff;
+	buf[5]  = 0x12;		/* link local scope */
+	buf[6]  = 0x60;		/* IPv6 signature */
+	buf[7]  = 0x1b;
+	buf[8]  = 0;		/* P_Key */
+	buf[9]  = 0;
+	memcpy(buf + 10, addr->s6_addr + 6, 10);
+}
 #endif
 #endif
--- linux-bk.orig/net/ipv6/addrconf.c	2004-11-23 08:09:54.776332532 -0800
+++ linux-bk/net/ipv6/addrconf.c	2004-11-23 08:10:22.302273908 -0800
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/if_arp.h>
 #include <linux/if_arcnet.h>
+#include <linux/if_infiniband.h>
 #include <linux/route.h>
 #include <linux/inetdevice.h>
 #include <linux/init.h>
@@ -1098,6 +1099,12 @@
 		memset(eui, 0, 7);
 		eui[7] = *(u8*)dev->dev_addr;
 		return 0;
+	case ARPHRD_INFINIBAND:
+		if (dev->addr_len != INFINIBAND_ALEN)
+			return -1;
+		memcpy(eui, dev->dev_addr + 12, 8);
+		eui[0] |= 2;
+		return 0;
 	}
 	return -1;
 }
@@ -1797,6 +1804,7 @@
 	if ((dev->type != ARPHRD_ETHER) && 
 	    (dev->type != ARPHRD_FDDI) &&
 	    (dev->type != ARPHRD_IEEE802_TR) &&
+	    (dev->type != ARPHRD_INFINIBAND) &&
 	    (dev->type != ARPHRD_ARCNET)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;
--- linux-bk.orig/net/ipv6/ndisc.c	2004-11-23 08:09:38.159782567 -0800
+++ linux-bk/net/ipv6/ndisc.c	2004-11-23 08:10:22.302273908 -0800
@@ -260,6 +260,9 @@
 	case ARPHRD_ARCNET:
 		ipv6_arcnet_mc_map(addr, buf);
 		return 0;
+	case ARPHRD_INFINIBAND:
+		ipv6_ib_mc_map(addr, buf);
+		return 0;
 	default:
 		if (dir) {
 			memcpy(buf, dev->broadcast, dev->addr_len);


From roland at topspin.com  Tue Nov 23 08:16:03 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:16:03 -0800
Subject: [openib-general] [PATCH][RFC/v2][17/21] Add IPoIB
	(IP-over-InfiniBand) driver
In-Reply-To: <20041123815.OuqXEOqXJtDtY180@topspin.com>
Message-ID: <20041123816.7BdwvFRYhI45pb9i@topspin.com>


Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol.
This is a network device driver of type ARPHRD_INFINIBAND (and
addr_len INFINIBAND_ALEN bytes).

The ARP/ND implementation for this driver is not completely
straightforward, because InfiniBand requires an additional path lookup
be performed (through an IB-specific mechanism) after a remote
hardware address has been resolved.  We are very open to suggestions
of a better way to handle this than the current implementation.

Although IB has a special multicast group join mode intended to
support IP multicast routing (non member join), no means to identify
different multicast styles has yet been determined, so all joins by
the driver are currently full member joins.  We are looking for
guidance in how to solve this.

The IPoIB protocol/encapsulation is described in the Internet-Drafts
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt
  http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/drivers/infiniband/Kconfig	2004-11-23 08:10:19.036755403 -0800
+++ linux-bk/drivers/infiniband/Kconfig	2004-11-23 08:10:22.620227027 -0800
@@ -10,4 +10,6 @@
 
 source "drivers/infiniband/hw/mthca/Kconfig"
 
+source "drivers/infiniband/ulp/ipoib/Kconfig"
+
 endmenu
--- linux-bk.orig/drivers/infiniband/Makefile	2004-11-23 08:10:18.998761005 -0800
+++ linux-bk/drivers/infiniband/Makefile	2004-11-23 08:10:22.583232481 -0800
@@ -1,2 +1,3 @@
 obj-$(CONFIG_INFINIBAND)		+= core/
 obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mthca/
+obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig	2004-11-23 08:10:22.719212431 -0800
@@ -0,0 +1,33 @@
+config INFINIBAND_IPOIB
+	tristate "IP-over-InfiniBand"
+	depends on INFINIBAND && NETDEVICES && INET
+	---help---
+	  Support for the IP-over-InfiniBand protocol (IPoIB). This
+	  transports IP packets over InfiniBand so you can use your IB
+	  device as a fancy NIC.
+
+	  The IPoIB protocol is defined by the IETF ipoib working
+	  group: <http://www.ietf.org/html.charters/ipoib-charter.html>.
+
+config INFINIBAND_IPOIB_DEBUG
+	bool "IP-over-InfiniBand debugging"
+	depends on INFINIBAND_IPOIB
+	---help---
+	  This option causes debugging code to be compiled into the
+	  IPoIB driver.  The output can be turned on via the
+	  debug_level and mcast_debug_level module parameters (which
+	  can also be set after the driver is loaded through sysfs).
+
+	  This option also creates an "ipoib_debugfs," which can be
+	  mounted to expose debugging information about IB multicast
+	  groups used by the IPoIB driver.
+
+config INFINIBAND_IPOIB_DEBUG_DATA
+	bool "IP-over-InfiniBand data path debugging"
+	depends on INFINIBAND_IPOIB_DEBUG
+	---help---
+	  This option compiles debugging code into the the data path
+	  of the IPoIB driver.  The output can be turned on by setting
+	  the debug_level parameter to 2; however, even with output
+	  turned off, this debugging code will have some performance
+	  impact.
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile	2004-11-23 08:10:22.683217739 -0800
@@ -0,0 +1,11 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+obj-$(CONFIG_INFINIBAND_IPOIB)			+= ib_ipoib.o
+
+ib_ipoib-y					:= ipoib_main.o \
+						   ipoib_ib.o \
+						   ipoib_multicast.o \
+						   ipoib_verbs.o \
+						   ipoib_vlan.o
+ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG)	+= ipoib_fs.o
+
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h	2004-11-23 08:10:22.764205797 -0800
@@ -0,0 +1,314 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib.h 1275 2004-11-22 23:04:04Z roland $
+ */
+
+#ifndef _IPOIB_H
+#define _IPOIB_H
+
+#include <linux/list.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/workqueue.h>
+#include <linux/pci.h>
+#include <linux/config.h>
+#include <linux/kref.h>
+#include <linux/if_infiniband.h>
+
+#include <net/neighbour.h>
+
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
+
+#include <ib_verbs.h>
+#include <ib_pack.h>
+#include <ib_sa.h>
+
+#include "ipoib_proto.h"
+
+/* constants */
+
+enum {
+	IPOIB_PACKET_SIZE         = 2048,
+	IPOIB_BUF_SIZE 		  = IPOIB_PACKET_SIZE + IB_GRH_BYTES,
+
+	IPOIB_ENCAP_LEN 	  = 4,
+
+	IPOIB_RX_RING_SIZE 	  = 128,
+	IPOIB_TX_RING_SIZE 	  = 64,
+
+	IPOIB_NUM_WC 		  = 4,
+
+	IPOIB_MAX_PATH_REC_QUEUE  = 3,
+	IPOIB_MAX_MCAST_QUEUE     = 3,
+
+	IPOIB_FLAG_TX_FULL 	  = 0,
+	IPOIB_FLAG_OPER_UP 	  = 1,
+	IPOIB_FLAG_ADMIN_UP 	  = 2,
+	IPOIB_PKEY_ASSIGNED 	  = 3,
+	IPOIB_PKEY_STOP 	  = 4,
+	IPOIB_FLAG_SUBINTERFACE   = 5,
+	IPOIB_MCAST_RUN 	  = 6,
+	IPOIB_STOP_REAPER         = 7,
+
+	IPOIB_MAX_BACKOFF_SECONDS = 16,
+
+	IPOIB_MCAST_FLAG_FOUND 	  = 0,	/* used in set_multicast_list */
+	IPOIB_MCAST_FLAG_SENDONLY = 1,
+	IPOIB_MCAST_FLAG_BUSY 	  = 2,	/* joining or already joined */
+	IPOIB_MCAST_FLAG_ATTACHED = 3,
+};
+
+/* structs */
+
+struct ipoib_header {
+	u16 proto;
+	u16 reserved;
+};
+
+struct ipoib_pseudoheader {
+	u8  hwaddr[INFINIBAND_ALEN];
+};
+
+struct ipoib_mcast;
+
+struct ipoib_buf {
+	struct sk_buff *skb;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+struct ipoib_dev_priv {
+	spinlock_t lock;
+
+	struct net_device *dev;
+
+	unsigned long flags;
+
+	struct semaphore mcast_mutex;
+	struct semaphore vlan_mutex;
+
+	struct ipoib_mcast *broadcast;
+	struct list_head multicast_list;
+	struct rb_root multicast_tree;
+
+	struct work_struct pkey_task;
+	struct work_struct mcast_task;
+	struct work_struct flush_task;
+	struct work_struct restart_task;
+	struct work_struct ah_reap_task;
+
+	struct ib_device *ca;
+	u8            	  port;
+	u16           	  pkey;
+	struct ib_pd  	 *pd;
+	struct ib_mr  	 *mr;
+	struct ib_cq  	 *cq;
+	struct ib_qp  	 *qp;
+	u32           	  qkey;
+
+	union ib_gid local_gid;
+	u16          local_lid;
+
+	unsigned int admin_mtu;
+	unsigned int mcast_mtu;
+
+	struct ipoib_buf *rx_ring;
+
+	struct ipoib_buf *tx_ring;
+	unsigned tx_head;
+	unsigned tx_tail;
+
+	struct ib_wc ibwc[IPOIB_NUM_WC];
+
+	struct list_head dead_ahs;
+
+	struct ib_event_handler event_handler;
+
+	struct net_device_stats stats;
+
+	struct net_device *parent;
+	struct list_head child_intfs;
+	struct list_head list;
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
+	struct list_head fs_list;
+	struct dentry *mcg_dentry;
+#endif
+};
+
+struct ipoib_ah {
+	struct net_device *dev;
+	struct ib_ah      *ah;
+	struct list_head   list;
+	struct kref        ref;
+	unsigned           last_send;
+};
+
+struct ipoib_path {
+	struct ipoib_ah    *ah;
+	struct sk_buff_head queue;
+
+	struct net_device  *dev;
+	struct neighbour   *neighbour;
+};
+
+static inline struct ipoib_path **to_ipoib_path(struct neighbour *neigh)
+{
+	return (struct ipoib_path **) (neigh->ha + 24);
+}
+
+extern struct workqueue_struct *ipoib_workqueue;
+
+/* functions */
+
+void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
+
+struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
+				 struct ib_pd *pd, struct ib_ah_attr *attr);
+void ipoib_free_ah(struct kref *kref);
+static inline void ipoib_put_ah(struct ipoib_ah *ah)
+{
+	kref_put(&ah->ref, ipoib_free_ah);
+}
+
+int ipoib_add_pkey_attr(struct net_device *dev);
+
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn);
+void ipoib_reap_ah(void *dev_ptr);
+
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
+
+int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
+void ipoib_ib_dev_flush(void *dev);
+void ipoib_ib_dev_cleanup(struct net_device *dev);
+
+int ipoib_ib_dev_open(struct net_device *dev);
+int ipoib_ib_dev_up(struct net_device *dev);
+int ipoib_ib_dev_down(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev);
+
+int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
+void ipoib_dev_cleanup(struct net_device *dev);
+
+void ipoib_mcast_join_task(void *dev_ptr);
+void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid,
+		      struct sk_buff *skb);
+
+void ipoib_mcast_restart_task(void *dev_ptr);
+int ipoib_mcast_start_thread(struct net_device *dev);
+int ipoib_mcast_stop_thread(struct net_device *dev);
+
+void ipoib_mcast_dev_down(struct net_device *dev);
+void ipoib_mcast_dev_flush(struct net_device *dev);
+
+struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev);
+void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter);
+int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter);
+void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter,
+				  union ib_gid *gid,
+				  unsigned long *created,
+				  unsigned int *queuelen,
+				  unsigned int *complete,
+				  unsigned int *send_only);
+
+int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
+		       union ib_gid *mgid);
+int ipoib_mcast_detach(struct net_device *dev, u16 mlid,
+		       union ib_gid *mgid);
+
+int ipoib_qp_create(struct net_device *dev);
+int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
+void ipoib_transport_dev_cleanup(struct net_device *dev);
+
+void ipoib_event(struct ib_event_handler *handler,
+		 struct ib_event *record);
+
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
+
+void ipoib_pkey_poll(void *dev);
+int ipoib_pkey_dev_delay_open(struct net_device *dev);
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
+int ipoib_create_debug_file(struct net_device *dev);
+void ipoib_delete_debug_file(struct net_device *dev);
+int ipoib_register_debugfs(void);
+void ipoib_unregister_debugfs(void);
+#else
+static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; }
+static inline void ipoib_delete_debug_file(struct net_device *dev) { }
+static inline int ipoib_register_debugfs(void) { return 0; }
+static inline void ipoib_unregister_debugfs(void) { }
+#endif
+
+
+#define ipoib_printk(level, priv, format, arg...)	\
+	printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg)
+#define ipoib_warn(priv, format, arg...)		\
+	ipoib_printk(KERN_WARNING, priv, format , ## arg)
+
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
+extern int debug_level;
+extern int mcast_debug_level;
+
+#define ipoib_dbg(priv, format, arg...)			\
+	do {					        \
+		if (debug_level > 0)			\
+			ipoib_printk(KERN_DEBUG, priv, format , ## arg); \
+	} while (0)
+#define ipoib_dbg_mcast(priv, format, arg...)		\
+	do {					        \
+		if (mcast_debug_level > 0)		\
+			ipoib_printk(KERN_DEBUG, priv, format , ## arg); \
+	} while (0)
+#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */
+#define ipoib_dbg(priv, format, arg...)			\
+	do { (void) (priv); } while (0)
+#define ipoib_dbg_mcast(priv, format, arg...)		\
+	do { (void) (priv); } while (0)
+#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
+#define ipoib_dbg_data(priv, format, arg...)		\
+	do {					        \
+		if (debug_level > 1)			\
+			ipoib_printk(KERN_DEBUG, priv, format , ## arg); \
+	} while (0)
+#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */
+#define ipoib_dbg_data(priv, format, arg...)		\
+	do { (void) (priv); } while (0)
+#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */
+
+
+#define IPOIB_GID_FMT		"%x:%x:%x:%x:%x:%x:%x:%x"
+
+#define IPOIB_GID_ARG(gid)	be16_to_cpup((__be16 *) ((gid).raw +  0)), \
+				be16_to_cpup((__be16 *) ((gid).raw +  2)), \
+				be16_to_cpup((__be16 *) ((gid).raw +  4)), \
+				be16_to_cpup((__be16 *) ((gid).raw +  6)), \
+				be16_to_cpup((__be16 *) ((gid).raw +  8)), \
+				be16_to_cpup((__be16 *) ((gid).raw + 10)), \
+				be16_to_cpup((__be16 *) ((gid).raw + 12)), \
+				be16_to_cpup((__be16 *) ((gid).raw + 14))
+
+#endif /* _IPOIB_H */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c	2004-11-23 08:10:22.816198131 -0800
@@ -0,0 +1,276 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#include <linux/pagemap.h>
+#include <linux/seq_file.h>
+
+#include "ipoib.h"
+
+enum {
+	IPOIB_MAGIC = 0x49504942 /* "IPIB" */
+};
+
+static DECLARE_MUTEX(ipoib_fs_mutex);
+static struct dentry *ipoib_root;
+static struct super_block *ipoib_sb;
+static LIST_HEAD(ipoib_device_list);
+
+static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos)
+{
+	struct ipoib_mcast_iter *iter;
+	loff_t n = *pos;
+
+	iter = ipoib_mcast_iter_init(file->private);
+	if (!iter)
+		return NULL;
+
+	while (n--) {
+		if (ipoib_mcast_iter_next(iter)) {
+			ipoib_mcast_iter_free(iter);
+			return NULL;
+		}
+	}
+
+	return iter;
+}
+
+static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr,
+				   loff_t *pos)
+{
+	struct ipoib_mcast_iter *iter = iter_ptr;
+
+	(*pos)++;
+
+	if (ipoib_mcast_iter_next(iter)) {
+		ipoib_mcast_iter_free(iter);
+		return NULL;
+	}
+
+	return iter;
+}
+
+static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr)
+{
+	/* nothing for now */
+}
+
+static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr)
+{
+	struct ipoib_mcast_iter *iter = iter_ptr;
+	char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"];
+	union ib_gid mgid;
+	int i, n;
+	unsigned long created;
+	unsigned int queuelen, complete, send_only;
+
+	if (iter) {
+		ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen,
+				      &complete, &send_only);
+
+		for (n = 0, i = 0; i < sizeof mgid / 2; ++i) {
+			n += sprintf(gid_buf + n, "%x",
+				     be16_to_cpu(((u16 *)mgid.raw)[i]));
+			if (i < sizeof mgid / 2 - 1)
+				gid_buf[n++] = ':';
+		}
+	}
+
+	seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf);
+
+	seq_printf(file,
+		   " created: %10ld queuelen: %4d complete: %d send_only: %d\n",
+		   created, queuelen, complete, send_only);
+
+	return 0;
+}
+
+static struct seq_operations ipoib_seq_ops = {
+	.start = ipoib_mcg_seq_start,
+	.next  = ipoib_mcg_seq_next,
+	.stop  = ipoib_mcg_seq_stop,
+	.show  = ipoib_mcg_seq_show,
+};
+
+static int ipoib_mcg_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq;
+	int ret;
+
+	ret = seq_open(file, &ipoib_seq_ops);
+	if (ret)
+		return ret;
+
+	seq = file->private_data;
+	seq->private = inode->u.generic_ip;
+
+	return 0;
+}
+
+static struct file_operations ipoib_fops = {
+	.owner   = THIS_MODULE,
+	.open    = ipoib_mcg_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = seq_release
+};
+
+static struct inode *ipoib_get_inode(void)
+{
+	struct inode *inode = new_inode(ipoib_sb);
+
+	if (inode) {
+		inode->i_mode 	 = S_IFREG | S_IRUGO;
+		inode->i_uid 	 = 0;
+		inode->i_gid 	 = 0;
+		inode->i_blksize = PAGE_CACHE_SIZE;
+		inode->i_blocks  = 0;
+		inode->i_atime 	 = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		inode->i_fop     = &ipoib_fops;
+	}
+
+	return inode;
+}
+
+static int __ipoib_create_debug_file(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct dentry *dentry;
+	struct inode *inode;
+	char name[IFNAMSIZ + sizeof "_mcg"];
+
+	snprintf(name, sizeof name, "%s_mcg", dev->name);
+
+	dentry = d_alloc_name(ipoib_root, name);
+	if (!dentry)
+		return -ENOMEM;
+
+	inode = ipoib_get_inode();
+	if (!inode) {
+		dput(dentry);
+		return -ENOMEM;
+	}
+
+	inode->u.generic_ip = dev;
+	priv->mcg_dentry = dentry;
+
+	d_add(dentry, inode);
+
+	return 0;
+}
+
+int ipoib_create_debug_file(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	down(&ipoib_fs_mutex);
+
+	list_add_tail(&priv->fs_list, &ipoib_device_list);
+
+	if (!ipoib_sb) {
+		up(&ipoib_fs_mutex);
+		return 0;
+	}
+
+	up(&ipoib_fs_mutex);
+
+	return __ipoib_create_debug_file(dev);
+}
+
+void ipoib_delete_debug_file(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	down(&ipoib_fs_mutex);
+	list_del(&priv->fs_list);
+	if (!ipoib_sb) {
+		up(&ipoib_fs_mutex);
+		return;
+	}
+	up(&ipoib_fs_mutex);
+
+	if (priv->mcg_dentry) {
+		d_drop(priv->mcg_dentry);
+		simple_unlink(ipoib_root->d_inode, priv->mcg_dentry);
+	}
+}
+
+static int ipoib_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr ipoib_files[] = {
+		{ "" }
+	};
+	struct ipoib_dev_priv *priv;
+	int ret;
+
+	ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files);
+	if (ret)
+		return ret;
+
+	ipoib_root = sb->s_root;
+
+	down(&ipoib_fs_mutex);
+
+	ipoib_sb = sb;
+
+	list_for_each_entry(priv, &ipoib_device_list, fs_list) {
+		ret = __ipoib_create_debug_file(priv->dev);
+		if (ret)
+			break;
+	}
+
+	up(&ipoib_fs_mutex);
+
+	return ret;
+}
+
+static struct super_block *ipoib_get_sb(struct file_system_type *fs_type,
+	int flags, const char *dev_name, void *data)
+{
+	return get_sb_single(fs_type, flags, data, ipoib_fill_super);
+}
+
+static void ipoib_kill_sb(struct super_block *sb)
+{
+	down(&ipoib_fs_mutex);
+	ipoib_sb = NULL;
+	up(&ipoib_fs_mutex);
+
+	kill_litter_super(sb);
+}
+
+static struct file_system_type ipoib_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "ipoib_debugfs",
+	.get_sb		= ipoib_get_sb,
+	.kill_sb	= ipoib_kill_sb,
+};
+
+int ipoib_register_debugfs(void)
+{
+	return register_filesystem(&ipoib_fs_type);
+}
+
+void ipoib_unregister_debugfs(void)
+{
+	unregister_filesystem(&ipoib_fs_type);
+}
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2004-11-23 08:10:22.857192086 -0800
@@ -0,0 +1,626 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib_ib.c 1267 2004-11-18 20:31:22Z roland $
+ */
+
+#include <linux/delay.h>
+
+#include <ib_cache.h>
+
+#include "ipoib.h"
+
+#define	IPOIB_OP_RECV	(1ul << 31)
+
+static DECLARE_MUTEX(pkey_sem);
+
+struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
+				 struct ib_pd *pd, struct ib_ah_attr *attr)
+{
+	struct ipoib_ah *ah;
+
+	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	if (!ah)
+		return NULL;
+
+	ah->dev       = dev;
+	ah->last_send = 0;
+	kref_init(&ah->ref);
+
+	ah->ah = ib_create_ah(pd, attr);
+	if (IS_ERR(ah->ah)) {
+		kfree(ah);
+		ah = NULL;
+	} else
+		ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah);
+
+	return ah;
+}
+
+void ipoib_free_ah(struct kref *kref)
+{
+	struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref);
+	struct ipoib_dev_priv *priv = netdev_priv(ah->dev);
+
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	if (ah->last_send <= priv->tx_tail) {
+		ipoib_dbg(priv, "Freeing ah %p\n", ah->ah);
+		ib_destroy_ah(ah->ah);
+		kfree(ah);
+	} else
+		list_add_tail(&ah->list, &priv->dead_ahs);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv,
+				   unsigned int wr_id,
+				   dma_addr_t addr)
+{
+	struct ib_sge list = {
+		.addr    = addr,
+		.length  = IPOIB_BUF_SIZE,
+		.lkey    = priv->mr->lkey,
+	};
+	struct ib_recv_wr param = {
+		.wr_id 	    = wr_id | IPOIB_OP_RECV,
+		.sg_list    = &list,
+		.num_sge    = 1,
+		.recv_flags = IB_RECV_SIGNALED
+	};
+	struct ib_recv_wr *bad_wr;
+
+	return ib_post_recv(priv->qp, &param, &bad_wr);
+}
+
+static int ipoib_ib_post_receive(struct net_device *dev, int id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	dma_addr_t addr;
+	int ret;
+
+	skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4);
+	if (!skb) {
+		ipoib_warn(priv, "failed to allocate receive buffer\n");
+
+		priv->rx_ring[id].skb = NULL;
+		return -ENOMEM;
+	}
+	skb_reserve(skb, 4);	/* 16 byte align IP header */
+	priv->rx_ring[id].skb = skb;
+	addr = dma_map_single(priv->ca->dma_device,
+			      skb->data, IPOIB_BUF_SIZE,
+			      DMA_FROM_DEVICE);
+	pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr);
+
+	ret = ipoib_ib_receive(priv, id, addr);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n",
+			   id, ret);
+		priv->rx_ring[id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static int ipoib_ib_post_receives(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) {
+		if (ipoib_ib_post_receive(dev, i)) {
+			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+static void ipoib_ib_handle_wc(struct net_device *dev,
+			       struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned int wr_id = wc->wr_id;
+
+	ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n",
+		       wr_id, wc->opcode, wc->status);
+
+	if (wr_id & IPOIB_OP_RECV) {
+		wr_id &= ~IPOIB_OP_RECV;
+
+		if (wr_id < IPOIB_RX_RING_SIZE) {
+			struct sk_buff *skb = priv->rx_ring[wr_id].skb;
+
+			priv->rx_ring[wr_id].skb = NULL;
+
+			dma_unmap_single(priv->ca->dma_device,
+					 pci_unmap_addr(&priv->rx_ring[wr_id],
+							mapping),
+					 IPOIB_BUF_SIZE,
+					 DMA_FROM_DEVICE);
+
+			if (wc->status != IB_WC_SUCCESS) {
+				if (wc->status != IB_WC_WR_FLUSH_ERR)
+					ipoib_warn(priv, "failed recv event "
+						   "(status=%d, wrid=%d vend_err %x)\n",
+						   wc->status, wr_id, wc->vendor_err);
+				dev_kfree_skb_any(skb);
+				return;
+			}
+
+			ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+				       wc->byte_len, wc->slid);
+
+			skb_put(skb, wc->byte_len);
+			skb_pull(skb, IB_GRH_BYTES);
+
+			if (wc->slid != priv->local_lid ||
+			    wc->src_qp != priv->qp->qp_num) {
+				skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+
+				skb_pull(skb, IPOIB_ENCAP_LEN);
+
+				dev->last_rx = jiffies;
+				++priv->stats.rx_packets;
+				priv->stats.rx_bytes += skb->len;
+
+				skb->dev = dev;
+				/* XXX get correct PACKET_ type here */
+				skb->pkt_type = PACKET_HOST;
+				netif_rx_ni(skb);
+			} else {
+				ipoib_dbg_data(priv, "dropping loopback packet\n");
+				dev_kfree_skb_any(skb);
+			}
+
+			/* repost receive */
+			if (ipoib_ib_post_receive(dev, wr_id))
+				ipoib_warn(priv, "ipoib_ib_post_receive failed "
+					   "for buf %d\n", wr_id);
+		} else
+			ipoib_warn(priv, "completion event with wrid %d\n",
+				   wr_id);
+
+	} else {
+		struct ipoib_buf *tx_req;
+		unsigned long flags;
+
+		if (wr_id >= IPOIB_TX_RING_SIZE) {
+			ipoib_warn(priv, "completion event with wrid %d (> %d)\n",
+				   wr_id, IPOIB_TX_RING_SIZE);
+			return;
+		}
+
+		ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id);
+
+		tx_req = &priv->tx_ring[wr_id];
+
+		dma_unmap_single(priv->ca->dma_device,
+				 pci_unmap_addr(tx_req, mapping),
+				 tx_req->skb->len,
+				 DMA_TO_DEVICE);
+
+		++priv->stats.tx_packets;
+		priv->stats.tx_bytes += tx_req->skb->len;
+
+		dev_kfree_skb_any(tx_req->skb);
+
+		spin_lock_irqsave(&priv->lock, flags);
+		++priv->tx_tail;
+		if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2)
+			netif_wake_queue(dev);
+		spin_unlock_irqrestore(&priv->lock, flags);
+
+		if (wc->status != IB_WC_SUCCESS &&
+		    wc->status != IB_WC_WR_FLUSH_ERR)
+			ipoib_warn(priv, "failed send event "
+				   "(status=%d, wrid=%d vend_err %x)\n",
+				   wc->status, wr_id, wc->vendor_err);
+	}
+}
+
+void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr)
+{
+	struct net_device *dev = (struct net_device *) dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int n, i;
+
+	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	do {
+		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc);
+		for (i = 0; i < n; ++i)
+			ipoib_ib_handle_wc(dev, priv->ibwc + i);
+	} while (n == IPOIB_NUM_WC);
+}
+
+static inline int post_send(struct ipoib_dev_priv *priv,
+			    unsigned int wr_id,
+			    struct ib_ah *address, u32 qpn,
+			    dma_addr_t addr, int len)
+{
+	struct ib_sge list = {
+		.addr    = addr,
+		.length  = len,
+		.lkey    = priv->mr->lkey,
+	};
+	struct ib_send_wr param = {
+		.wr_id = wr_id,
+		.opcode = IB_WR_SEND,
+		.sg_list = &list,
+		.num_sge = 1,
+		.wr = {
+			.ud = {
+				 .remote_qpn = qpn,
+				 .remote_qkey = priv->qkey,
+				 .ah = address
+			 },
+		},
+		.send_flags = IB_SEND_SIGNALED,
+	};
+	struct ib_send_wr *bad_wr;
+
+	return ib_post_send(priv->qp, &param, &bad_wr);
+}
+
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_buf *tx_req;
+	dma_addr_t addr;
+
+	if (skb->len > dev->mtu + INFINIBAND_ALEN) {
+		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+			   skb->len, dev->mtu + INFINIBAND_ALEN);
+		++priv->stats.tx_dropped;
+		++priv->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+		return;
+	}
+
+	if (!(skb = skb_unshare(skb, GFP_ATOMIC))) {
+		ipoib_warn(priv, "failed to unshare sk_buff. Dropping\n");
+		++priv->stats.tx_dropped;
+		++priv->stats.tx_errors;
+		return;
+	}
+
+	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
+		       skb->len, address, qpn);
+
+	/*
+	 * We put the skb into the tx_ring _before_ we call post_send()
+	 * because it's entirely possible that the completion handler will
+	 * run before we execute anything after the post_send().  That
+	 * means we have to make sure everything is properly recorded and
+	 * our state is consistent before we call post_send().
+	 */
+	tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)];
+	tx_req->skb = skb;
+	addr = dma_map_single(priv->ca->dma_device,
+			      skb->data, skb->len,
+			      DMA_TO_DEVICE);
+	pci_unmap_addr_set(tx_req, mapping, addr);
+
+	if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1),
+		      address->ah, qpn, addr, skb->len)) {
+		ipoib_warn(priv, "post_send failed\n");
+		++priv->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+	} else {
+		unsigned long flags;
+
+		dev->trans_start = jiffies;
+
+		address->last_send = priv->tx_head;
+		++priv->tx_head;
+
+		spin_lock_irqsave(&priv->lock, flags);
+		if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) {
+			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+			netif_stop_queue(dev);
+		}
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+void __ipoib_reap_ah(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_ah *ah, *tah;
+	LIST_HEAD(remove_list);
+
+	spin_lock_irq(&priv->lock);
+	list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list)
+		if (ah->last_send <= priv->tx_tail) {
+			list_del(&ah->list);
+			list_add_tail(&ah->list, &remove_list);
+		}
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(ah, tah, &remove_list, list) {
+		ipoib_dbg(priv, "Reaping ah %p\n", ah->ah);
+		ib_destroy_ah(ah->ah);
+		kfree(ah);
+	}
+}
+
+void ipoib_reap_ah(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	__ipoib_reap_ah(dev);
+
+	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
+		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
+}
+
+int ipoib_ib_dev_open(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = ipoib_qp_create(dev);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret);
+		return -1;
+	}
+
+	ret = ipoib_ib_post_receives(dev);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
+		return -1;
+	}
+
+	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
+	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
+
+	return 0;
+}
+
+int ipoib_ib_dev_up(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	set_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
+
+	return ipoib_mcast_start_thread(dev);
+}
+
+int ipoib_ib_dev_down(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_dbg(priv, "downing ib_dev\n");
+
+	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
+	netif_carrier_off(dev);
+
+	/* Shutdown the P_Key thread if still active */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
+		down(&pkey_sem);
+		set_bit(IPOIB_PKEY_STOP, &priv->flags);
+		cancel_delayed_work(&priv->pkey_task);
+		up(&pkey_sem);
+		flush_workqueue(ipoib_workqueue);
+	}
+
+	ipoib_mcast_stop_thread(dev);
+
+	/*
+	 * Flush the multicast groups first so we stop any multicast joins. The
+	 * completion thread may have already died and we may deadlock waiting
+	 * for the completion thread to finish some multicast joins.
+	 */
+	ipoib_mcast_dev_flush(dev);
+
+	/* Delete broadcast and local addresses since they will be recreated */
+	ipoib_mcast_dev_down(dev);
+
+	return 0;
+}
+
+static int recvs_pending(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i)
+		if (priv->rx_ring[i].skb)
+			return 1;
+
+	return 0;
+}
+
+int ipoib_ib_dev_stop(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int attr_mask;
+	int i;
+
+	/* Kill the existing QP and allocate a new one */
+	qp_attr.qp_state = IB_QPS_ERR;
+	attr_mask        = IB_QP_STATE;
+	if (ib_modify_qp(priv->qp, &qp_attr, attr_mask))
+		ipoib_warn(priv, "Failed to modify QP to ERROR state\n");
+
+	/* Wait for all sends and receives to complete */
+	while (priv->tx_head != priv->tx_tail || recvs_pending(dev))
+		yield();
+
+	ipoib_dbg(priv, "All sends and receives done.\n");
+
+	qp_attr.qp_state = IB_QPS_RESET;
+	attr_mask        = IB_QP_STATE;
+	if (ib_modify_qp(priv->qp, &qp_attr, attr_mask))
+		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
+
+	/* Wait for all AHs to be reaped */
+	set_bit(IPOIB_STOP_REAPER, &priv->flags);
+	cancel_delayed_work(&priv->ah_reap_task);
+	flush_workqueue(ipoib_workqueue);
+	while (!list_empty(&priv->dead_ahs)) {
+		__ipoib_reap_ah(dev);
+		yield();
+	}
+
+	for (i = 0; i < IPOIB_RX_RING_SIZE; ++i)
+		if (priv->rx_ring[i].skb)
+			ipoib_warn(priv, "Recv skb still around @ %d\n", i);
+
+	return 0;
+}
+
+int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	priv->ca = ca;
+	priv->port = port;
+	priv->qp = NULL;
+
+	if (ipoib_transport_dev_init(dev, ca)) {
+		printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name);
+		return -ENODEV;
+	}
+
+	if (dev->flags & IFF_UP) {
+		if (ipoib_ib_dev_open(dev)) {
+			ipoib_transport_dev_cleanup(dev);
+			return -ENODEV;
+		}
+	}
+
+	return 0;
+}
+
+void ipoib_ib_dev_flush(void *_dev)
+{
+	struct net_device *dev = (struct net_device *)_dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv;
+
+	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		return;
+
+	ipoib_dbg(priv, "flushing\n");
+
+	ipoib_ib_dev_down(dev);
+
+	/*
+	 * The device could have been brought down between the start and when
+	 * we get here, don't bring it back up if it's not configured up
+	 */
+	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		ipoib_ib_dev_up(dev);
+
+	/* Flush any child interfaces too */
+	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		ipoib_ib_dev_flush(&cpriv->dev);
+}
+
+void ipoib_ib_dev_cleanup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_dbg(priv, "cleaning up ib_dev\n");
+
+	ipoib_mcast_stop_thread(dev);
+
+	/* Delete the broadcast address and the local address */
+	ipoib_mcast_dev_down(dev);
+
+	ipoib_transport_dev_cleanup(dev);
+}
+
+/*
+ * Delayed P_Key Assigment Interim Support
+ *
+ * The following is initial implementation of delayed P_Key assigment
+ * mechanism. It is using the same approach implemented for the multicast
+ * group join. The single goal of this implementation is to quickly address
+ * Bug #2507. This implementation will probably be removed when the P_Key
+ * change async notification is available.
+ */
+int ipoib_open(struct net_device *dev);
+
+static void ipoib_pkey_dev_check_presence(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	u16 pkey_index = 0;
+
+	if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index))
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	else
+		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+}
+
+void ipoib_pkey_poll(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_pkey_dev_check_presence(dev);
+
+	if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
+		ipoib_open(dev);
+	else {
+		down(&pkey_sem);
+		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
+			queue_delayed_work(ipoib_workqueue,
+					   &priv->pkey_task,
+					   HZ);
+		up(&pkey_sem);
+	}
+}
+
+int ipoib_pkey_dev_delay_open(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	/* Look for the interface pkey value in the IB Port P_Key table and */
+	/* set the interface pkey assigment flag                            */
+	ipoib_pkey_dev_check_presence(dev);
+
+	/* P_Key value not assigned yet - start polling */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
+		down(&pkey_sem);
+		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->pkey_task,
+				   HZ);
+		up(&pkey_sem);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c	2004-11-23 08:10:22.898186042 -0800
@@ -0,0 +1,954 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib_main.c 1273 2004-11-22 22:59:30Z roland $
+ */
+
+#include "ipoib.h"
+
+#include <linux/version.h>
+#include <linux/module.h>
+
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#include <linux/if_arp.h>	/* For ARPHRD_xxx */
+
+#include <linux/ip.h>
+#include <linux/in.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("IP-over-InfiniBand net driver");
+MODULE_LICENSE("Dual BSD/GPL");
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
+int debug_level;
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
+#define DATA_PATH_DEBUG_HELP " and data path tracing if > 1"
+#else
+#define DATA_PATH_DEBUG_HELP ""
+#endif
+
+module_param(debug_level, int, 0644);
+MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" DATA_PATH_DEBUG_HELP);
+
+int mcast_debug_level;
+
+module_param(mcast_debug_level, int, 0644);
+MODULE_PARM_DESC(mcast_debug_level,
+		 "Enable multicast debug tracing if > 0");
+#endif
+
+static const u8 ipv4_bcast_addr[] = {
+	0x00, 0xff, 0xff, 0xff,
+	0xff, 0x12, 0x40, 0x1b,	0x00, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x00, 0x00,	0xff, 0xff, 0xff, 0xff
+};
+
+struct workqueue_struct *ipoib_workqueue;
+
+static void ipoib_add_one(struct ib_device *device);
+static void ipoib_remove_one(struct ib_device *device);
+
+static struct ib_client ipoib_client = {
+	.name   = "ipoib",
+	.add    = ipoib_add_one,
+	.remove = ipoib_remove_one
+};
+
+int ipoib_device_handle(struct net_device *dev, struct ib_device **ca,
+			u8 *port_num, union ib_gid *gid, u16 *pkey)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	*ca       = priv->ca;
+	*port_num = priv->port;
+	*gid      = priv->local_gid;
+	*pkey     = priv->pkey;
+
+	return 0;
+}
+EXPORT_SYMBOL(ipoib_device_handle);
+
+int ipoib_open(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_dbg(priv, "bringing up interface\n");
+
+	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
+
+	if (ipoib_pkey_dev_delay_open(dev))
+		return 0;
+
+	if (ipoib_ib_dev_open(dev))
+		return -EINVAL;
+
+	if (ipoib_ib_dev_up(dev))
+		return -EINVAL;
+
+	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
+		struct ipoib_dev_priv *cpriv;
+
+		/* Bring up any child interfaces too */
+		down(&priv->vlan_mutex);
+		list_for_each_entry(cpriv, &priv->child_intfs, list) {
+			int flags;
+
+			flags = cpriv->dev->flags;
+			if (flags & IFF_UP)
+				continue;
+
+			dev_change_flags(cpriv->dev, flags | IFF_UP);
+		}
+		up(&priv->vlan_mutex);
+	}
+
+	netif_start_queue(dev);
+
+	return 0;
+}
+
+static int ipoib_stop(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_dbg(priv, "stopping interface\n");
+
+	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
+
+	netif_stop_queue(dev);
+
+	ipoib_ib_dev_down(dev);
+	ipoib_ib_dev_stop(dev);
+
+	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
+		struct ipoib_dev_priv *cpriv;
+
+		/* Bring down any child interfaces too */
+		down(&priv->vlan_mutex);
+		list_for_each_entry(cpriv, &priv->child_intfs, list) {
+			int flags;
+
+			flags = cpriv->dev->flags;
+			if (!(flags & IFF_UP))
+				continue;
+
+			dev_change_flags(cpriv->dev, flags & ~IFF_UP);
+		}
+		up(&priv->vlan_mutex);
+	}
+
+	return 0;
+}
+
+static int ipoib_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+		return -EINVAL;
+
+	priv->admin_mtu = new_mtu;
+
+	dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
+
+	return 0;
+}
+
+static void path_rec_completion(int status,
+				struct ib_sa_path_rec *pathrec,
+				void *path_ptr)
+{
+	struct ipoib_path *path = path_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(path->dev);
+	struct sk_buff *skb;
+	struct ipoib_ah *ah;
+
+	ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n",
+		  status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid));
+
+	if (status != IB_WC_SUCCESS)
+		goto err;
+
+	{
+		struct ib_ah_attr av = {
+			.dlid 	       = be16_to_cpu(pathrec->dlid),
+			.sl 	       = pathrec->sl,
+			.src_path_bits = 0,
+			.static_rate   = 0,
+			.ah_flags      = 0,
+			.port_num      = priv->port
+		};
+
+		ah = ipoib_create_ah(path->dev, priv->pd, &av);
+	}
+
+	if (!ah)
+		goto err;
+
+	path->ah = ah;
+
+	ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n",
+		  ah, pathrec->dlid, pathrec->sl);
+
+	while ((skb = __skb_dequeue(&path->queue))) {
+		skb->dev = path->dev;
+		if (dev_queue_xmit(skb))
+			ipoib_warn(priv, "dev_queue_xmit failed "
+				   "to requeue packet\n");
+	}
+
+	return;
+
+err:
+	while ((skb = __skb_dequeue(&path->queue)))
+		dev_kfree_skb(skb);
+	
+	if (path->neighbour)
+		*to_ipoib_path(path->neighbour) = NULL;
+
+	kfree(path);
+}
+
+static int path_rec_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC);
+	struct ib_sa_path_rec rec = {
+		.numb_path = 1
+	};
+	struct ib_sa_query *query;
+
+	if (!path)
+		goto err;
+
+	path->ah  = NULL;
+	path->dev = dev;
+	skb_queue_head_init(&path->queue);
+	__skb_queue_tail(&path->queue, skb);
+	path->neighbour = NULL;
+
+	rec.sgid = priv->local_gid;
+	memcpy(rec.dgid.raw, skb->dst->neighbour->ha + 4, 16);
+	rec.pkey = cpu_to_be16(priv->pkey);
+
+	/*
+	 * XXX there's a race here if path record completion runs
+	 * before we get to finish up.  Add a lock to path struct?
+	 */
+	if (ib_sa_path_rec_get(priv->ca, priv->port, &rec,
+			       IB_SA_PATH_REC_DGID	|
+			       IB_SA_PATH_REC_SGID	|
+			       IB_SA_PATH_REC_NUMB_PATH	|
+			       IB_SA_PATH_REC_PKEY,
+			       1000, GFP_ATOMIC,
+			       path_rec_completion,
+			       path, &query) < 0) {
+		ipoib_warn(priv, "ib_sa_path_rec_get failed\n");
+		goto err;
+	}
+
+	path->neighbour = skb->dst->neighbour;
+	*to_ipoib_path(skb->dst->neighbour) = path;
+	return 0;
+
+err:
+	kfree(path);
+	++priv->stats.tx_dropped;
+	dev_kfree_skb_any(skb);
+
+	return 0;
+}
+
+static int path_lookup(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
+
+	/* Look up path record for unicasts */
+	if (skb->dst->neighbour->ha[4] != 0xff)
+		return path_rec_start(skb, dev);
+
+	/* Add in the P_Key */
+	skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff;
+	skb->dst->neighbour->ha[9] = priv->pkey & 0xff;
+	ipoib_mcast_send(dev,
+			 (union ib_gid *) (skb->dst->neighbour->ha + 4),
+			 skb);
+	return 0;
+}
+
+static void unicast_arp_completion(int status,
+				   struct ib_sa_path_rec *pathrec,
+				   void *skb_ptr)
+{
+	struct sk_buff *skb = skb_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
+	struct ipoib_ah *ah;
+
+	ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n",
+		  status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid));
+
+	if (status)
+		goto err;
+
+	{
+		struct ib_ah_attr av = {
+			.dlid 	       = be16_to_cpu(pathrec->dlid),
+			.sl 	       = pathrec->sl,
+			.src_path_bits = 0,
+			.static_rate   = 0,
+			.ah_flags      = 0,
+			.port_num      = priv->port
+		};
+
+		ah = ipoib_create_ah(skb->dev, priv->pd, &av);
+	}
+
+	if (!ah)
+		goto err;
+
+	*(struct ipoib_ah **) skb->cb = ah;
+
+	if (dev_queue_xmit(skb))
+		ipoib_warn(priv, "dev_queue_xmit failed "
+			   "to requeue ARP packet\n");
+
+	return;
+
+err:
+	dev_kfree_skb(skb);
+}
+
+static void unicast_arp_finish(struct sk_buff *skb)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
+	struct ipoib_ah *ah = *(struct ipoib_ah **) skb->cb;
+	unsigned long flags;
+
+	if (ah) {
+		spin_lock_irqsave(&priv->lock, flags);
+		list_add_tail(&ah->list, &priv->dead_ahs);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+/*
+ * For unicast packets with no skb->dst->neighbour (unicast ARPs are
+ * the main example), we fire off a path record query for each packet.
+ * This is pretty bad for scalability (since this is going to hammer
+ * the SM on a big fabric) but it's the best I can think of for now.
+ *
+ * Also we might have a problem if a path changes, because ARPs will
+ * still go through (since we'll get the new path from the SM for
+ * these queries) so we'll never update the neighbour.
+ */
+static int unicast_arp_start(struct sk_buff *skb, struct net_device *dev,
+			     struct ipoib_pseudoheader *phdr)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *tmp_skb;
+	struct ib_sa_path_rec rec = {
+		.numb_path = 1
+	};
+	struct ib_sa_query *query;
+
+	if (skb->destructor) {
+		tmp_skb = skb;
+		skb = skb_clone(tmp_skb, GFP_ATOMIC);
+		dev_kfree_skb_any(tmp_skb);
+		if (!skb) {
+			++priv->stats.tx_dropped;
+			return 0;
+		}
+	}
+
+	skb->dev        = dev;
+	skb->destructor = unicast_arp_finish;
+	memset(skb->cb, 0, sizeof skb->cb);
+
+	rec.sgid = priv->local_gid;
+	memcpy(rec.dgid.raw, phdr->hwaddr + 4, 16);
+	rec.pkey = cpu_to_be16(priv->pkey);
+
+	/*
+	 * XXX We need to keep a record of the skb and TID somewhere
+	 * so that we can cancel the request if the device goes down
+	 * before it finishes.
+	 */
+	if (ib_sa_path_rec_get(priv->ca, priv->port, &rec,
+			       IB_SA_PATH_REC_DGID	|
+			       IB_SA_PATH_REC_SGID	|
+			       IB_SA_PATH_REC_NUMB_PATH	|
+			       IB_SA_PATH_REC_PKEY,
+			       1000, GFP_ATOMIC,
+			       unicast_arp_completion,
+			       skb, &query) < 0) {
+		ipoib_warn(priv, "ib_sa_path_rec_get failed\n");
+		++priv->stats.tx_dropped;
+		dev_kfree_skb_any(skb);
+	}
+
+	return 0;
+}
+
+static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_path *path;
+
+	if (skb->dst && skb->dst->neighbour) {
+		if (unlikely(!*to_ipoib_path(skb->dst->neighbour)))
+			return path_lookup(skb, dev);
+
+		path = *to_ipoib_path(skb->dst->neighbour);
+
+		if (likely(path->ah)) {
+			ipoib_send(dev, skb, path->ah,
+				   be32_to_cpup((__be32 *) skb->dst->neighbour->ha));
+			return 0;
+		}
+
+		if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE)
+			__skb_queue_tail(&path->queue, skb);
+		else
+			goto err;
+	} else {
+		struct ipoib_pseudoheader *phdr =
+			(struct ipoib_pseudoheader *) skb->data;
+		skb_pull(skb, sizeof *phdr);
+
+		if (phdr->hwaddr[4] == 0xff) {
+			/* Add in the P_Key */
+			phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+			phdr->hwaddr[9] = priv->pkey & 0xff;
+
+			ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb);
+		}
+		else {
+			/* unicast GID -- ARP reply?? */
+
+			/*
+			 * If destructor is unicast_arp_finish, we've
+			 * already been through the path lookup and
+			 * now we can just send the packet.
+			 */
+			if (skb->destructor == unicast_arp_finish) {
+				ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb,
+					   be32_to_cpup((u32 *) phdr->hwaddr));
+				return 0;
+			}
+
+			if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) {
+				ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+					   IPOIB_GID_FMT "\n",
+					   skb->dst ? "neigh" : "dst",
+					   be16_to_cpup((u16 *) skb->data),
+					   be32_to_cpup((u32 *) phdr->hwaddr),
+					   IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4)));
+				dev_kfree_skb_any(skb);
+				++priv->stats.tx_dropped;
+				return 0;
+			}
+
+			/* put the pseudoheader back on */			  
+			skb_push(skb, sizeof *phdr);
+			return unicast_arp_start(skb, dev, phdr);
+		}
+	}
+
+	return 0;
+
+err:
+	++priv->stats.tx_dropped;
+	dev_kfree_skb_any(skb);
+
+	return 0;
+}
+
+struct net_device_stats *ipoib_get_stats(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	return &priv->stats;
+}
+
+static void ipoib_timeout(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_warn(priv, "transmit timeout: latency %ld\n",
+		   jiffies - dev->trans_start);
+	/* XXX reset QP, etc. */
+}
+
+static int ipoib_hard_header(struct sk_buff *skb,
+			     struct net_device *dev,
+			     unsigned short type,
+			     void *daddr, void *saddr, unsigned len)
+{
+	struct ipoib_header *header;
+
+	header = (struct ipoib_header *) skb_push(skb, sizeof *header);
+
+	header->proto = htons(type);
+	header->reserved = 0;
+
+	/*
+	 * If we don't have a neighbour structure, stuff the
+	 * destination address onto the front of the skb so we can
+	 * figure out where to send the packet later.
+	 */
+	if (!skb->dst || !skb->dst->neighbour) {
+		struct ipoib_pseudoheader *phdr =
+			(struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr);
+		memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN);
+	}
+
+	return 0;
+}
+
+static void ipoib_set_mcast_list(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	schedule_work(&priv->restart_task);
+}
+
+static void ipoib_neigh_destructor(struct neighbour *neigh)
+{
+	struct ipoib_path     *path = *to_ipoib_path(neigh);
+
+	ipoib_dbg(netdev_priv(neigh->dev),
+		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
+		  be32_to_cpup((__be32 *) neigh->ha),
+		  IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4))));
+
+	if (path && path->ah) {
+		ipoib_put_ah(path->ah);
+		kfree(path);
+	}
+}
+
+static int ipoib_neigh_setup(struct neighbour *neigh)
+{
+	/*
+	 * Is this kosher?  I can't find anybody in the kernel that
+	 * sets neigh->destructor, so we should be able to set it here
+	 * without trouble.
+	 */
+	neigh->ops->destructor = ipoib_neigh_destructor;
+
+	return 0;
+}
+
+static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
+{
+	parms->neigh_setup = ipoib_neigh_setup;
+
+	return 0;
+}
+
+int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	/* Allocate RX/TX "rings" to hold queued skbs */
+
+	priv->rx_ring =	kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf),
+				GFP_KERNEL);
+	if (!priv->rx_ring) {
+		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
+		       ca->name, IPOIB_RX_RING_SIZE);
+		goto out;
+	}
+	memset(priv->rx_ring, 0,
+	       IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf));
+
+	priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf),
+				GFP_KERNEL);
+	if (!priv->tx_ring) {
+		printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n",
+		       ca->name, IPOIB_TX_RING_SIZE);
+		goto out_rx_ring_cleanup;
+	}
+	memset(priv->tx_ring, 0,
+	       IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf));
+
+	/* priv->tx_head & tx_tail are already 0 */
+
+	if (ipoib_ib_dev_init(dev, ca, port))
+		goto out_tx_ring_cleanup;
+
+	return 0;
+
+out_tx_ring_cleanup:
+	kfree(priv->tx_ring);
+
+out_rx_ring_cleanup:
+	kfree(priv->rx_ring);
+
+out:
+	return -ENOMEM;
+}
+
+void ipoib_dev_cleanup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
+
+	ipoib_delete_debug_file(dev);
+
+	/* Delete any child interfaces first */
+	list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
+		unregister_netdev(cpriv->dev);
+		ipoib_dev_cleanup(cpriv->dev);
+		free_netdev(cpriv->dev);
+	}
+
+	ipoib_ib_dev_cleanup(dev);
+
+	if (priv->rx_ring) {
+		kfree(priv->rx_ring);
+		priv->rx_ring = NULL;
+	}
+
+	if (priv->tx_ring) {
+		kfree(priv->tx_ring);
+		priv->tx_ring = NULL;
+	}
+}
+
+static void ipoib_setup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	dev->open 		 = ipoib_open;
+	dev->stop 		 = ipoib_stop;
+	dev->change_mtu 	 = ipoib_change_mtu;
+	dev->hard_start_xmit 	 = ipoib_start_xmit;
+	dev->get_stats 		 = ipoib_get_stats;
+	dev->tx_timeout 	 = ipoib_timeout;
+	dev->hard_header 	 = ipoib_hard_header;
+	dev->set_multicast_list  = ipoib_set_mcast_list;
+	dev->neigh_setup         = ipoib_neigh_setup_dev;
+
+	dev->watchdog_timeo 	 = HZ;
+
+	dev->rebuild_header 	 = NULL;
+	dev->set_mac_address 	 = NULL;
+	dev->header_cache_update = NULL;
+
+	dev->flags              |= IFF_BROADCAST | IFF_MULTICAST;
+	
+	/*
+	 * We add in INFINIBAND_ALEN to allow for the destination
+	 * address "pseudoheader" for skbs without neighbour struct.
+	 */
+	dev->hard_header_len 	 = IPOIB_ENCAP_LEN + INFINIBAND_ALEN;
+	dev->addr_len 		 = INFINIBAND_ALEN;
+	dev->type 		 = ARPHRD_INFINIBAND;
+	dev->tx_queue_len 	 = IPOIB_TX_RING_SIZE * 2;
+	dev->features            = NETIF_F_VLAN_CHALLENGED;
+
+	/* MTU will be reset when mcast join happens */
+	dev->mtu 		 = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN;
+	priv->mcast_mtu 	 = priv->admin_mtu = dev->mtu;
+
+	memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN);
+
+	netif_carrier_off(dev);
+
+	SET_MODULE_OWNER(dev);
+
+	priv->dev = dev;
+
+	spin_lock_init(&priv->lock);
+
+	init_MUTEX(&priv->mcast_mutex);
+	init_MUTEX(&priv->vlan_mutex);
+
+	INIT_LIST_HEAD(&priv->child_intfs);
+	INIT_LIST_HEAD(&priv->dead_ahs);
+	INIT_LIST_HEAD(&priv->multicast_list);
+
+	INIT_WORK(&priv->pkey_task,    ipoib_pkey_poll,          priv->dev);
+	INIT_WORK(&priv->mcast_task,   ipoib_mcast_join_task,    priv->dev);
+	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush,       priv->dev);
+	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev);
+	INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah,            priv->dev);
+}
+
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
+{
+	struct net_device *dev;
+
+	dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name,
+			   ipoib_setup);
+	if (!dev)
+		return NULL;
+
+	return netdev_priv(dev);
+}
+
+static ssize_t show_pkey(struct class_device *cdev, char *buf)
+{
+	struct ipoib_dev_priv *priv =
+		netdev_priv(container_of(cdev, struct net_device, class_dev));
+
+	return sprintf(buf, "0x%04x\n", priv->pkey);
+}
+static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
+
+static ssize_t create_child(struct class_device *cdev,
+			    const char *buf, size_t count)
+{
+	int pkey;
+	int ret;
+
+	if (sscanf(buf, "%i", &pkey) != 1)
+		return -EINVAL;
+
+	if (pkey < 0 || pkey > 0xffff)
+		return -EINVAL;
+
+	ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev),
+			     pkey);
+
+	return ret ? ret : count;
+}
+static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
+
+static ssize_t delete_child(struct class_device *cdev,
+			    const char *buf, size_t count)
+{
+	int pkey;
+	int ret;
+
+	if (sscanf(buf, "%i", &pkey) != 1)
+		return -EINVAL;
+
+	if (pkey < 0 || pkey > 0xffff)
+		return -EINVAL;
+
+	ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev),
+				pkey);
+
+	return ret ? ret : count;
+
+}
+static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
+
+int ipoib_add_pkey_attr(struct net_device *dev)
+{
+	return class_device_create_file(&dev->class_dev,
+					&class_device_attr_pkey);
+}
+
+static struct net_device *ipoib_add_port(const char *format,
+					 struct ib_device *hca, u8 port)
+{
+	struct ipoib_dev_priv *priv;
+	int result = -ENOMEM;
+
+	priv = ipoib_intf_alloc(format);
+	if (!priv)
+		goto alloc_mem_failed;
+
+	SET_NETDEV_DEV(priv->dev, hca->dma_device);
+
+	result = ib_query_pkey(hca, port, 0, &priv->pkey);
+	if (result) {
+		printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n",
+		       hca->name, port, result);
+		goto alloc_mem_failed;
+	}
+
+	priv->dev->broadcast[8] = priv->pkey >> 8;
+	priv->dev->broadcast[9] = priv->pkey & 0xff;
+
+	result = ib_query_gid(hca, port, 0, &priv->local_gid);
+	if (result) {
+		printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n",
+		       hca->name, port, result);
+		goto alloc_mem_failed;
+	} else
+		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
+
+
+	result = ipoib_dev_init(priv->dev, hca, port);
+	if (result < 0) {
+		printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n",
+		       hca->name, port, result);
+		goto device_init_failed;
+	}
+
+	INIT_IB_EVENT_HANDLER(&priv->event_handler,
+			      priv->ca, ipoib_event);
+	result = ib_register_event_handler(&priv->event_handler);
+	if (result < 0) {
+		printk(KERN_WARNING "%s: ib_register_event_handler failed for "
+		       "port %d (ret = %d)\n",
+		       hca->name, port, result);
+		goto event_failed;
+	}
+
+	result = register_netdev(priv->dev);
+	if (result) {
+		printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n",
+		       hca->name, port, result);
+		goto register_failed;
+	}
+
+	if (ipoib_create_debug_file(priv->dev))
+		goto debug_failed;
+
+	if (ipoib_add_pkey_attr(priv->dev))
+		goto sysfs_failed;
+	if (class_device_create_file(&priv->dev->class_dev,
+				     &class_device_attr_create_child))
+		goto sysfs_failed;
+	if (class_device_create_file(&priv->dev->class_dev,
+				     &class_device_attr_delete_child))
+		goto sysfs_failed;
+
+	return priv->dev;
+
+sysfs_failed:
+	ipoib_delete_debug_file(priv->dev);
+
+debug_failed:
+	unregister_netdev(priv->dev);
+
+register_failed:
+	ib_unregister_event_handler(&priv->event_handler);
+
+event_failed:
+	ipoib_dev_cleanup(priv->dev);
+
+device_init_failed:
+	free_netdev(priv->dev);
+
+alloc_mem_failed:
+	return ERR_PTR(result);
+}
+
+static void ipoib_add_one(struct ib_device *device)
+{
+	struct list_head *dev_list;
+	struct net_device *dev;
+	struct ipoib_dev_priv *priv;
+	int s, e, p;
+
+	dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
+	if (!dev_list)
+		return;
+
+	INIT_LIST_HEAD(dev_list);
+
+	if (device->node_type == IB_NODE_SWITCH) {
+		s = 0;
+		e = 0;
+	} else {
+		s = 1;
+		e = device->phys_port_cnt;
+	}
+
+	for (p = s; p <= e; ++p) {
+		dev = ipoib_add_port("ib%d", device, p);
+		if (!IS_ERR(dev)) {
+			priv = netdev_priv(dev);
+			list_add_tail(&priv->list, dev_list);
+		}
+	}
+
+	ib_set_client_data(device, &ipoib_client, dev_list);
+}
+
+static void ipoib_remove_one(struct ib_device *device)
+{
+	struct ipoib_dev_priv *priv, *tmp;
+	struct list_head *dev_list;
+
+	dev_list = ib_get_client_data(device, &ipoib_client);
+
+	list_for_each_entry_safe(priv, tmp, dev_list, list) {
+		ib_unregister_event_handler(&priv->event_handler);
+
+		unregister_netdev(priv->dev);
+		ipoib_dev_cleanup(priv->dev);
+		free_netdev(priv->dev);
+	}
+}
+
+static int __init ipoib_init_module(void)
+{
+	int ret;
+
+	ret = ipoib_register_debugfs();
+	if (ret)
+		return ret;
+
+	/*
+	 * We create our own workqueue mainly because we want to be
+	 * able to flush it when devices are being removed.  We can't
+	 * use schedule_work()/flush_scheduled_work() because both
+	 * unregister_netdev() and linkwatch_event take the rtnl lock,
+	 * so flush_scheduled_work() can deadlock during device
+	 * removal.
+	 */
+	ipoib_workqueue = create_singlethread_workqueue("ipoib");
+	if (!ipoib_workqueue) {
+		ret = -ENOMEM;
+		goto err_fs;
+	}
+
+	ret = ib_register_client(&ipoib_client);
+	if (ret)
+		goto err_wq;
+
+	return 0;
+
+err_fs:
+	ipoib_unregister_debugfs();
+
+err_wq:
+	destroy_workqueue(ipoib_workqueue);
+
+	return ret;
+}
+
+static void __exit ipoib_cleanup_module(void)
+{
+	ipoib_unregister_debugfs();
+	ib_unregister_client(&ipoib_client);
+	destroy_workqueue(ipoib_workqueue);
+}
+
+module_init(ipoib_init_module);
+module_exit(ipoib_cleanup_module);
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2004-11-23 08:10:22.940179850 -0800
@@ -0,0 +1,928 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib_multicast.c 1277 2004-11-23 01:08:07Z roland $
+ */
+
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/ip.h>
+#include <linux/in.h>
+#include <linux/igmp.h>
+#include <linux/inetdevice.h>
+#include <linux/delay.h>
+#include <linux/completion.h>
+
+#include "ipoib.h"
+
+static DECLARE_MUTEX(mcast_mutex);
+
+/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */
+struct ipoib_mcast {
+	struct ib_sa_mcmember_rec mcmember;
+	struct ipoib_ah          *ah;
+
+	struct rb_node    rb_node;
+	struct list_head  list;
+	struct completion done;
+	
+	int                 query_id;
+	struct ib_sa_query *query;
+
+	unsigned long created;
+	unsigned long backoff;
+
+	unsigned long flags;
+	unsigned char logcount;
+
+	struct sk_buff_head pkt_queue;
+
+	struct net_device *dev;
+};
+
+struct ipoib_mcast_iter {
+	struct net_device *dev;
+	union ib_gid       mgid;
+	unsigned long      created;
+	unsigned int       queuelen;
+	unsigned int       complete;
+	unsigned int       send_only;
+};
+
+static void ipoib_mcast_free(struct ipoib_mcast *mcast)
+{
+	struct net_device *dev = mcast->dev;
+
+	ipoib_dbg_mcast(netdev_priv(dev),
+			"deleting multicast group " IPOIB_GID_FMT "\n",
+			IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+	if (mcast->ah)
+		ipoib_put_ah(mcast->ah);
+
+	while (!skb_queue_empty(&mcast->pkt_queue)) {
+		struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+
+		skb->dev = dev;
+		dev_kfree_skb_any(skb);
+	}
+
+	kfree(mcast);
+}
+
+static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev,
+					     int can_sleep)
+{
+	struct ipoib_mcast *mcast;
+
+	mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC);
+	if (!mcast)
+		return NULL;
+
+	memset(mcast, 0, sizeof (*mcast));
+
+	init_completion(&mcast->done);
+
+	mcast->dev = dev;
+	mcast->created = jiffies;
+	mcast->backoff = HZ;
+	mcast->logcount = 0;
+
+	INIT_LIST_HEAD(&mcast->list);
+	skb_queue_head_init(&mcast->pkt_queue);
+
+	mcast->ah    = NULL;
+	mcast->query = NULL;
+
+	return mcast;
+}
+
+static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct rb_node *n = priv->multicast_tree.rb_node;
+
+	while (n) {
+		struct ipoib_mcast *mcast;
+		int ret;
+
+		mcast = rb_entry(n, struct ipoib_mcast, rb_node);
+
+		ret = memcmp(mgid->raw, mcast->mcmember.mgid.raw,
+			     sizeof (union ib_gid));
+		if (ret < 0)
+			n = n->rb_left;
+		else if (ret > 0)
+			n = n->rb_right;
+		else
+			return mcast;
+	}
+
+	return NULL;
+}
+
+static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL;
+
+	while (*n) {
+		struct ipoib_mcast *tmcast;
+		int ret;
+
+		pn = *n;
+		tmcast = rb_entry(pn, struct ipoib_mcast, rb_node);
+
+		ret = memcmp(mcast->mcmember.mgid.raw, tmcast->mcmember.mgid.raw,
+			     sizeof (union ib_gid));
+		if (ret < 0)
+			n = &pn->rb_left;
+		else if (ret > 0)
+			n = &pn->rb_right;
+		else
+			return -EEXIST;
+	}
+
+	rb_link_node(&mcast->rb_node, pn, n);
+	rb_insert_color(&mcast->rb_node, &priv->multicast_tree);
+
+	return 0;
+}
+
+static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
+				   struct ib_sa_mcmember_rec *mcmember)
+{
+	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+
+	mcast->mcmember = *mcmember;
+
+	/* Set the cached Q_Key before we attach if it's the broadcast group */
+	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
+		    sizeof (union ib_gid)))
+		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
+
+	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+		if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			ipoib_warn(priv, "multicast group " IPOIB_GID_FMT
+				   " already attached\n",
+				   IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+			return 0;
+		}
+
+		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
+					 &mcast->mcmember.mgid);
+		if (ret < 0) {
+			ipoib_warn(priv, "couldn't attach QP to multicast group "
+				   IPOIB_GID_FMT "\n",
+				   IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+			clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags);
+			return ret;
+		}
+	}
+
+	{
+		struct ib_ah_attr av = {
+			.dlid	       = be16_to_cpu(mcast->mcmember.mlid),
+			.port_num      = priv->port,
+			.sl	       = mcast->mcmember.sl,
+			.src_path_bits = 0,
+			.static_rate   = 0,
+			.ah_flags      = IB_AH_GRH,
+			.grh	       = {
+				.flow_label    = be32_to_cpu(mcast->mcmember.flow_label),
+				.hop_limit     = mcast->mcmember.hop_limit,
+				.sgid_index    = 0,
+				.traffic_class = mcast->mcmember.traffic_class
+			}
+		};
+
+		av.grh.dgid = mcast->mcmember.mgid;
+
+		mcast->ah = ipoib_create_ah(dev, priv->pd, &av);
+		if (!mcast->ah) {
+			ipoib_warn(priv, "ib_address_create failed\n");
+		} else {
+			ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT
+					" AV %p, LID 0x%04x, SL %d\n",
+					IPOIB_GID_ARG(mcast->mcmember.mgid),
+					mcast->ah->ah,
+					be16_to_cpu(mcast->mcmember.mlid),
+					mcast->mcmember.sl);
+		}
+	}
+
+	/* actually send any queued packets */
+	while (!skb_queue_empty(&mcast->pkt_queue)) {
+		struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+
+		skb->dev = dev;
+
+		if (dev_queue_xmit(skb))
+			ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n");
+	}
+
+	return 0;
+}
+
+static void
+ipoib_mcast_sendonly_join_complete(int status,
+				   struct ib_sa_mcmember_rec *mcmember,
+				   void *mcast_ptr)
+{
+	struct ipoib_mcast *mcast = mcast_ptr;
+	struct net_device *dev = mcast->dev;
+
+	if (!status)
+		ipoib_mcast_join_finish(mcast, mcmember);
+	else {
+		if (mcast->logcount++ < 20)
+			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for "
+					IPOIB_GID_FMT ", status %d\n",
+					IPOIB_GID_ARG(mcast->mcmember.mgid), status);
+
+		/* Flush out any queued packets */
+		while (!skb_queue_empty(&mcast->pkt_queue)) {
+			struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+
+			skb->dev = dev;
+
+			dev_kfree_skb_any(skb);
+		}
+
+		/* Clear the busy flag so we try again */
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	}
+
+	complete(&mcast->done);
+}
+
+static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
+{
+	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sa_mcmember_rec rec = {
+#if 0				/* Some SMs don't support send-only yet */
+		.join_state = 4
+#else
+		.join_state = 1
+#endif
+	};
+	int ret = 0;
+
+	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
+		ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n");
+		return -ENODEV;
+	}
+
+	if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
+		ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
+		return -EBUSY;
+	}
+
+	rec.mgid     = mcast->mcmember.mgid;
+	rec.port_gid = priv->local_gid;
+	rec.pkey     = be16_to_cpu(priv->pkey);
+
+	ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec,
+				     IB_SA_MCMEMBER_REC_MGID		|
+				     IB_SA_MCMEMBER_REC_PORT_GID	|
+				     IB_SA_MCMEMBER_REC_PKEY		|
+				     IB_SA_MCMEMBER_REC_JOIN_STATE,
+				     1000, GFP_ATOMIC,
+				     ipoib_mcast_sendonly_join_complete,
+				     mcast, &mcast->query);
+	if (ret < 0) {
+		ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n",
+			   ret);
+	} else {
+		ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT
+				", starting join\n",
+				IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+		mcast->query_id = ret;
+	}
+
+	return ret;
+}
+
+static void ipoib_mcast_join_complete(int status,
+				      struct ib_sa_mcmember_rec *mcmember,
+				      void *mcast_ptr)
+{
+	struct ipoib_mcast *mcast = mcast_ptr;
+	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT
+			" (status %d)\n",
+			IPOIB_GID_ARG(mcast->mcmember.mgid), status);
+
+	if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) {
+		mcast->backoff = HZ;
+		down(&mcast_mutex);
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+			queue_work(ipoib_workqueue, &priv->mcast_task);
+		up(&mcast_mutex);
+		complete(&mcast->done);
+		return;
+	}
+
+	if (status == -EINTR) {
+		complete(&mcast->done);
+		return;
+	}
+
+	if (status && mcast->logcount++ < 20) {
+		if (status == -ETIMEDOUT || status == -EINTR) {
+			ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT
+					", status %d\n",
+					IPOIB_GID_ARG(mcast->mcmember.mgid),
+					status);
+		} else {
+			ipoib_warn(priv, "multicast join failed for "
+				   IPOIB_GID_FMT ", status %d\n",
+				   IPOIB_GID_ARG(mcast->mcmember.mgid),
+				   status);
+		}
+	}
+
+	mcast->backoff *= 2;
+	if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+		mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+
+	mcast->query = NULL;
+
+	down(&mcast_mutex);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
+		if (status == -ETIMEDOUT)
+			queue_work(ipoib_workqueue, &priv->mcast_task);
+		else
+			queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
+					   mcast->backoff * HZ);
+	} else
+		complete(&mcast->done);
+	up(&mcast_mutex);
+
+	return;
+}
+
+static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
+			     int create)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sa_mcmember_rec rec = {
+		.join_state = 1
+	};
+	ib_sa_comp_mask comp_mask;
+	int ret = 0;
+
+	ipoib_dbg_mcast(priv, "joining MGID " IPOIB_GID_FMT "\n",
+			IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+	rec.mgid     = mcast->mcmember.mgid;
+	rec.port_gid = priv->local_gid;
+	rec.pkey     = be16_to_cpu(priv->pkey);
+
+	comp_mask =
+		IB_SA_MCMEMBER_REC_MGID		|
+		IB_SA_MCMEMBER_REC_PORT_GID	|
+		IB_SA_MCMEMBER_REC_PKEY		|
+		IB_SA_MCMEMBER_REC_JOIN_STATE;
+
+	if (create) {
+		comp_mask |=
+			IB_SA_MCMEMBER_REC_QKEY		|
+			IB_SA_MCMEMBER_REC_SL		|
+			IB_SA_MCMEMBER_REC_FLOW_LABEL	|
+			IB_SA_MCMEMBER_REC_TRAFFIC_CLASS;
+
+		rec.qkey	  = priv->broadcast->mcmember.qkey;
+		rec.sl		  = priv->broadcast->mcmember.sl;
+		rec.flow_label	  = priv->broadcast->mcmember.flow_label;
+		rec.traffic_class = priv->broadcast->mcmember.traffic_class;
+	}
+
+	ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask,
+				     mcast->backoff * 1000, GFP_ATOMIC,
+				     ipoib_mcast_join_complete,
+				     mcast, &mcast->query);
+
+	if (ret < 0) {
+		ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret);
+
+		mcast->backoff *= 2;
+		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+
+		down(&mcast_mutex);
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+			queue_delayed_work(ipoib_workqueue,
+					   &priv->mcast_task,
+					   mcast->backoff);
+		up(&mcast_mutex);
+	} else
+		mcast->query_id = ret;
+}
+
+void ipoib_mcast_join_task(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		return;
+
+	if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid))
+		ipoib_warn(priv, "ib_gid_entry_get() failed\n");
+	else
+		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
+
+	if (!priv->broadcast) {
+		priv->broadcast = ipoib_mcast_alloc(dev, 1);
+		if (!priv->broadcast) {
+			ipoib_warn(priv, "failed to allocate broadcast group\n");
+			down(&mcast_mutex);
+			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+				queue_delayed_work(ipoib_workqueue,
+						   &priv->mcast_task, HZ);
+			up(&mcast_mutex);
+			return;
+		}
+
+		memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
+		       sizeof (union ib_gid));
+
+		spin_lock_irq(&priv->lock);
+		__ipoib_mcast_add(dev, priv->broadcast);
+		spin_unlock_irq(&priv->lock);
+	}
+
+	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
+		ipoib_mcast_join(dev, priv->broadcast, 0);
+		return;
+	}
+
+	while (1) {
+		struct ipoib_mcast *mcast = NULL;
+
+		spin_lock_irq(&priv->lock);
+		list_for_each_entry(mcast, &priv->multicast_list, list) {
+			if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
+			    && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
+			    && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+				/* Found the next unjoined group */
+				break;
+			}
+		}
+		spin_unlock_irq(&priv->lock);
+
+		if (&mcast->list == &priv->multicast_list) {
+			/* All done */
+			break;
+		}
+
+		ipoib_mcast_join(dev, mcast, 1);
+		return;
+	}
+
+	{
+		struct ib_port_attr attr;
+
+		if (!ib_query_port(priv->ca, priv->port, &attr))
+			priv->local_lid = attr.lid;
+		else
+			ipoib_warn(priv, "ib_query_port failed\n");
+	}
+
+	priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) -
+		IPOIB_ENCAP_LEN;
+	dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
+
+	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
+
+	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
+	netif_carrier_on(dev);
+}
+
+int ipoib_mcast_start_thread(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	ipoib_dbg_mcast(priv, "starting multicast thread\n");
+
+	down(&mcast_mutex);
+	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_work(ipoib_workqueue, &priv->mcast_task);
+	up(&mcast_mutex);
+
+	return 0;
+}
+
+int ipoib_mcast_stop_thread(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_mcast *mcast;
+
+	ipoib_dbg_mcast(priv, "stopping multicast thread\n");
+
+	down(&mcast_mutex);
+	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
+	cancel_delayed_work(&priv->mcast_task);
+	up(&mcast_mutex);
+
+	flush_workqueue(ipoib_workqueue);
+
+	if (priv->broadcast && priv->broadcast->query) {
+		ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query);
+		priv->broadcast->query = NULL;
+		ipoib_dbg_mcast(priv, "waiting for bcast\n");
+		wait_for_completion(&priv->broadcast->done);
+	}
+
+	list_for_each_entry(mcast, &priv->multicast_list, list) {
+		if (mcast->query) {
+			ib_sa_cancel_query(mcast->query_id, mcast->query);
+			mcast->query = NULL;
+			ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n",
+					IPOIB_GID_ARG(mcast->mcmember.mgid));
+			wait_for_completion(&mcast->done);
+		}
+	}
+
+	return 0;
+}
+
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sa_mcmember_rec rec = {
+		.join_state = 1
+	};
+	int ret = 0;
+
+	if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags))
+		return 0;
+
+	ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n",
+			IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+	rec.mgid     = mcast->mcmember.mgid;
+	rec.port_gid = priv->local_gid;
+	rec.pkey     = be16_to_cpu(priv->pkey);
+
+	/* Remove ourselves from the multicast group */
+	ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid),
+				 &mcast->mcmember.mgid);
+	if (ret)
+		ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret);
+
+	/*
+	 * Just make one shot at leaving and don't wait for a reply;
+	 * if we fail, too bad.
+	 */
+	ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec,
+					IB_SA_MCMEMBER_REC_MGID		|
+					IB_SA_MCMEMBER_REC_PORT_GID	|
+					IB_SA_MCMEMBER_REC_PKEY		|
+					IB_SA_MCMEMBER_REC_JOIN_STATE,
+					0, GFP_ATOMIC, NULL,
+					mcast, &mcast->query);
+	if (ret < 0)
+		ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed "
+			   "for leave (result = %d)\n", ret);
+
+	return 0;
+}
+
+void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid,
+		      struct sk_buff *skb)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_mcast *mcast;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	mcast = __ipoib_mcast_find(dev, mgid);
+	if (!mcast) {
+		/* Let's create a new send only group now */
+		ipoib_dbg_mcast(priv, "setting up send only multicast group for "
+				IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid));
+
+		mcast = ipoib_mcast_alloc(dev, 0);
+		if (!mcast) {
+			ipoib_warn(priv, "unable to allocate memory for "
+				   "multicast structure\n");
+			dev_kfree_skb_any(skb);
+			goto out;
+		}
+
+		set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
+		mcast->mcmember.mgid = *mgid;
+		__ipoib_mcast_add(dev, mcast);
+		list_add_tail(&mcast->list, &priv->multicast_list);
+	}
+
+	if (!mcast->ah) {
+		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
+			skb_queue_tail(&mcast->pkt_queue, skb);
+		else
+			dev_kfree_skb_any(skb);
+
+		if (mcast->query)
+			ipoib_dbg_mcast(priv, "no address vector, "
+					"but multicast join already started\n");
+		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_mcast_sendonly_join(mcast);
+
+		/*
+		 * If lookup completes between here and out:, don't
+		 * want to send packet twice.
+		 */
+		mcast = NULL;
+	}
+
+out:
+	spin_unlock_irqrestore(&priv->lock, flags);
+	if (mcast && mcast->ah) {
+		if (skb->dst            &&
+		    skb->dst->neighbour &&
+		    !*to_ipoib_path(skb->dst->neighbour)) {
+			struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC);
+
+			if (path) {
+				kref_get(&mcast->ah->ref);
+				path->ah  	= mcast->ah;
+				path->dev 	= dev;
+				path->neighbour = skb->dst->neighbour;
+				*to_ipoib_path(skb->dst->neighbour) = path;
+			}
+		}
+
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+	}
+}
+
+void ipoib_mcast_dev_flush(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	LIST_HEAD(remove_list);
+	struct ipoib_mcast *mcast, *tmcast, *nmcast;
+	unsigned long flags;
+
+	ipoib_dbg_mcast(priv, "flushing multicast list\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) {
+		nmcast = ipoib_mcast_alloc(dev, 0);
+		if (nmcast) {
+			nmcast->flags =
+				mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY);
+
+			nmcast->mcmember.mgid = mcast->mcmember.mgid;
+
+			/* Add the new group in before the to-be-destroyed group */
+			list_add_tail(&nmcast->list, &mcast->list);
+			list_del_init(&mcast->list);
+
+			rb_replace_node(&mcast->rb_node, &nmcast->rb_node,
+					&priv->multicast_tree);
+
+			list_add_tail(&mcast->list, &remove_list);
+		} else {
+			ipoib_warn(priv, "could not reallocate multicast group "
+				   IPOIB_GID_FMT "\n",
+				   IPOIB_GID_ARG(mcast->mcmember.mgid));
+		}
+	}
+
+	if (priv->broadcast) {
+		nmcast = ipoib_mcast_alloc(dev, 0);
+		if (nmcast) {
+			nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid;
+
+			rb_replace_node(&priv->broadcast->rb_node,
+					&nmcast->rb_node,
+					&priv->multicast_tree);
+
+			list_add_tail(&priv->broadcast->list, &remove_list);
+		}
+
+		priv->broadcast = nmcast;
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	list_for_each_entry(mcast, &remove_list, list) {
+		ipoib_mcast_leave(dev, mcast);
+		ipoib_mcast_free(mcast);
+	}
+}
+
+void ipoib_mcast_dev_down(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	/* Delete broadcast since it will be recreated */
+	if (priv->broadcast) {
+		ipoib_dbg_mcast(priv, "deleting broadcast group\n");
+
+		spin_lock_irqsave(&priv->lock, flags);
+		rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree);
+		spin_unlock_irqrestore(&priv->lock, flags);
+		ipoib_mcast_leave(dev, priv->broadcast);
+		ipoib_mcast_free(priv->broadcast);
+		priv->broadcast = NULL;
+	}
+}
+
+void ipoib_mcast_restart_task(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct dev_mc_list *mclist;
+	struct ipoib_mcast *mcast, *tmcast;
+	LIST_HEAD(remove_list);
+	unsigned long flags;
+
+	ipoib_dbg_mcast(priv, "restarting multicast task\n");
+
+	ipoib_mcast_stop_thread(dev);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/*
+	 * Unfortunately, the networking core only gives us a list of all of
+	 * the multicast hardware addresses. We need to figure out which ones
+	 * are new and which ones have been removed
+	 */
+
+	/* Clear out the found flag */
+	list_for_each_entry(mcast, &priv->multicast_list, list)
+		clear_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags);
+
+	/* Mark all of the entries that are found or don't exist */
+	for (mclist = dev->mc_list; mclist; mclist = mclist->next) {
+		union ib_gid mgid;
+
+		memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid);
+
+		/* Add in the P_Key */
+		mgid.raw[4] = (priv->pkey >> 8) & 0xff;
+		mgid.raw[5] = priv->pkey & 0xff;
+
+		mcast = __ipoib_mcast_find(dev, &mgid);
+		if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+			struct ipoib_mcast *nmcast;
+
+			/* Not found or send-only group, let's add a new entry */
+			ipoib_dbg_mcast(priv, "adding multicast entry for mgid "
+					IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid));
+
+			nmcast = ipoib_mcast_alloc(dev, 0);
+			if (!nmcast) {
+				ipoib_warn(priv, "unable to allocate memory for multicast structure\n");
+				continue;
+			}
+
+			set_bit(IPOIB_MCAST_FLAG_FOUND, &nmcast->flags);
+
+			nmcast->mcmember.mgid = mgid;
+
+			if (mcast) {
+				/* Destroy the send only entry */
+				list_del(&mcast->list);
+				list_add_tail(&mcast->list, &remove_list);
+
+				rb_replace_node(&mcast->rb_node,
+						&nmcast->rb_node,
+						&priv->multicast_tree);
+			} else
+				__ipoib_mcast_add(dev, nmcast);
+
+			list_add_tail(&nmcast->list, &priv->multicast_list);
+		}
+
+		if (mcast)
+			set_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags);
+	}
+
+	/* Remove all of the entries don't exist anymore */
+	list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) {
+		if (!test_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags) &&
+		    !test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+			ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n",
+					IPOIB_GID_ARG(mcast->mcmember.mgid));
+
+			rb_erase(&mcast->rb_node, &priv->multicast_tree);
+
+			/* Move to the remove list */
+			list_del(&mcast->list);
+			list_add_tail(&mcast->list, &remove_list);
+		}
+	}
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	/* We have to cancel outside of the spinlock */
+	list_for_each_entry(mcast, &remove_list, list) {
+		ipoib_mcast_leave(mcast->dev, mcast);
+		ipoib_mcast_free(mcast);
+	}
+
+	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		ipoib_mcast_start_thread(dev);
+}
+
+struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev)
+{
+	struct ipoib_mcast_iter *iter;
+
+	iter = kmalloc(sizeof *iter, GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	iter->dev = dev;
+	memset(iter->mgid.raw, 0, sizeof iter->mgid);
+
+	if (ipoib_mcast_iter_next(iter)) {
+		ipoib_mcast_iter_free(iter);
+		return NULL;
+	}
+
+	return iter;
+}
+
+void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter)
+{
+	kfree(iter);
+}
+
+int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(iter->dev);
+	struct rb_node *n;
+	struct ipoib_mcast *mcast;
+	int ret = 1;
+
+	spin_lock_irq(&priv->lock);
+
+	n = rb_first(&priv->multicast_tree);
+
+	while (n) {
+		mcast = rb_entry(n, struct ipoib_mcast, rb_node);
+
+		if (memcmp(iter->mgid.raw, mcast->mcmember.mgid.raw,
+			   sizeof (union ib_gid)) < 0) {
+			iter->mgid      = mcast->mcmember.mgid;
+			iter->created   = mcast->created;
+			iter->queuelen  = skb_queue_len(&mcast->pkt_queue);
+			iter->complete  = !!mcast->ah;
+			iter->send_only = !!(mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY));
+
+			ret = 0;
+
+			break;
+		}
+
+		n = rb_next(n);
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter,
+			   union ib_gid *mgid,
+			   unsigned long *created,
+			   unsigned int *queuelen,
+			   unsigned int *complete,
+			   unsigned int *send_only)
+{
+	*mgid      = iter->mgid;
+	*created   = iter->created;
+	*queuelen  = iter->queuelen;
+	*complete  = iter->complete;
+	*send_only = iter->send_only;
+}
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_proto.h	2004-11-23 08:10:22.978174248 -0800
@@ -0,0 +1,37 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib_proto.h 1254 2004-11-17 17:19:12Z roland $
+ */
+
+#ifndef _IPOIB_PROTO_H
+#define _IPOIB_PROTO_H
+
+#include <linux/netdevice.h>
+#include <ib_verbs.h>
+
+/*
+ * Public functions
+ */
+
+int ipoib_device_handle(struct net_device *dev, struct ib_device **ca,
+			u8 *port_num, union ib_gid *gid, u16 *pkey);
+
+#endif /* _IPOIB_PROTO_H */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2004-11-23 08:10:23.018168351 -0800
@@ -0,0 +1,248 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib_verbs.c 1262 2004-11-18 17:38:36Z roland $
+ */
+
+#include <ib_cache.h>
+
+#include "ipoib.h"
+
+int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr *qp_attr;
+	int attr_mask;
+	int ret;
+	u16 pkey_index;
+
+	ret = -ENOMEM;
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		goto out;
+
+	if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		ret = -ENXIO;
+		goto out;
+	}
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
+	/* set correct QKey for QP */
+	qp_attr->qkey = priv->qkey;
+	attr_mask = IB_QP_QKEY;
+	ret = ib_modify_qp(priv->qp, qp_attr, attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret);
+		goto out;
+	}
+
+	/* attach QP to multicast group */
+	down(&priv->mcast_mutex);
+	ret = ib_attach_mcast(priv->qp, mgid, mlid);
+	up(&priv->mcast_mutex);
+	if (ret)
+		ipoib_warn(priv, "failed to attach to multicast group, ret = %d\n", ret);
+
+out:
+	kfree(qp_attr);
+	return ret;
+}
+
+int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+
+	down(&priv->mcast_mutex);
+	ret = ib_detach_mcast(priv->qp, mgid, mlid);
+	up(&priv->mcast_mutex);
+	if (ret)
+		ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret);
+
+	return ret;
+}
+
+int ipoib_qp_create(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+	u16 pkey_index;
+	struct ib_qp_attr qp_attr;
+	int attr_mask;
+
+	/*
+	 * Search through the port P_Key table for the requested pkey value.
+	 * The port has to be assigned to the respective IB partition in
+	 * advance.
+	 */
+	ret = ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index);
+	if (ret) {
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		return ret;
+	}
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.qkey = 0;
+	qp_attr.port_num = priv->port;
+	qp_attr.pkey_index = pkey_index;
+	attr_mask =
+	    IB_QP_QKEY |
+	    IB_QP_PORT |
+	    IB_QP_PKEY_INDEX |
+	    IB_QP_STATE;
+	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret);
+		goto out_fail;
+	}
+
+	qp_attr.qp_state = IB_QPS_RTR;
+	/* Can't set this in a INIT->RTR transition */
+	attr_mask &= ~IB_QP_PORT;
+	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret);
+		goto out_fail;
+	}
+
+	qp_attr.qp_state = IB_QPS_RTS;
+	qp_attr.sq_psn = 0;
+	attr_mask |= IB_QP_SQ_PSN;
+	attr_mask &= ~IB_QP_PKEY_INDEX;
+	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret);
+		goto out_fail;
+	}
+
+	return 0;
+
+out_fail:
+	ib_destroy_qp(priv->qp);
+	priv->qp = NULL;
+
+	return -EINVAL;
+}
+
+int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.cap = {
+			.max_send_wr  = IPOIB_TX_RING_SIZE,
+			.max_recv_wr  = IPOIB_RX_RING_SIZE,
+			.max_send_sge = 1,
+			.max_recv_sge = 1
+		},
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.rq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+
+	priv->pd = ib_alloc_pd(priv->ca);
+	if (IS_ERR(priv->pd)) {
+		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
+		return -ENODEV;
+	}
+
+	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev,
+				IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1);
+	if (IS_ERR(priv->cq)) {
+		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);
+		goto out_free_pd;
+	}
+
+	if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP))
+		goto out_free_cq;
+
+	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(priv->mr)) {
+		printk(KERN_WARNING "%s: ib_reg_phys_mr failed\n", ca->name);
+		goto out_free_cq;
+	}
+
+	init_attr.send_cq = priv->cq;
+	init_attr.recv_cq = priv->cq,
+
+	priv->qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(priv->qp)) {
+		printk(KERN_WARNING "%s: failed to create QP\n", ca->name);
+		goto out_free_mr;
+	}
+
+	priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff;
+	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
+	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
+
+	return 0;
+
+out_free_mr:
+	ib_dereg_mr(priv->mr);
+
+out_free_cq:
+	ib_destroy_cq(priv->cq);
+
+out_free_pd:
+	ib_dealloc_pd(priv->pd);
+	return -ENODEV;
+}
+
+void ipoib_transport_dev_cleanup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->qp) {
+		if (ib_destroy_qp(priv->qp))
+			ipoib_warn(priv, "ib_qp_destroy failed\n");
+
+		priv->qp = NULL;
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	}
+
+	if (ib_dereg_mr(priv->mr))
+		ipoib_warn(priv, "ib_dereg_mr failed\n");
+
+	if (ib_destroy_cq(priv->cq))
+		ipoib_warn(priv, "ib_cq_destroy failed\n");
+
+	if (ib_dealloc_pd(priv->pd))
+		ipoib_warn(priv, "ib_dealloc_pd failed\n");
+}
+
+void ipoib_event(struct ib_event_handler *handler,
+		 struct ib_event *record)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(handler, struct ipoib_dev_priv, event_handler);
+
+	if (record->event == IB_EVENT_PORT_ACTIVE) {
+		ipoib_dbg(priv, "Port active event\n");
+		schedule_work(&priv->flush_task);
+	}
+}
+
+/*
+  Local Variables:
+  c-file-style: "linux"
+  indent-tabs-mode: t
+  End:
+*/
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_vlan.c	2004-11-23 08:10:23.043164665 -0800
@@ -0,0 +1,166 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id: ipoib_vlan.c 1271 2004-11-18 22:11:29Z roland $
+ */
+
+#include <linux/version.h>
+#include <linux/module.h>
+
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include <asm/uaccess.h>
+
+#include "ipoib.h"
+
+static ssize_t show_parent(struct class_device *class_dev, char *buf)
+{
+	struct net_device *dev =
+		container_of(class_dev, struct net_device, class_dev);
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	return sprintf(buf, "%s\n", priv->parent->name);
+}
+static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
+
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
+{
+	struct ipoib_dev_priv *ppriv, *priv;
+	char intf_name[IFNAMSIZ];
+	int result;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	ppriv = netdev_priv(pdev);
+
+	down(&ppriv->vlan_mutex);
+
+	/*
+	 * First ensure this isn't a duplicate. We check the parent device and
+	 * then all of the child interfaces to make sure the Pkey doesn't match.
+	 */
+	if (ppriv->pkey == pkey) {
+		result = -ENOTUNIQ;
+		goto err;
+	}
+
+	list_for_each_entry(priv, &ppriv->child_intfs, list) {
+		if (priv->pkey == pkey) {
+			result = -ENOTUNIQ;
+			goto err;
+		}
+	}
+
+	snprintf(intf_name, sizeof intf_name, "%s.%04x",
+		 ppriv->dev->name, pkey);
+	priv = ipoib_intf_alloc(intf_name);
+	if (!priv) {
+		result = -ENOMEM;
+		goto err;
+	}
+
+	set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags);
+
+	priv->pkey = pkey;
+
+	memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN);
+	priv->dev->broadcast[8] = pkey >> 8;
+	priv->dev->broadcast[9] = pkey & 0xff;
+
+	result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port);
+	if (result < 0) {
+		ipoib_warn(ppriv, "failed to initialize subinterface: "
+			   "device %s, port %d",
+			   ppriv->ca->name, ppriv->port);
+		goto device_init_failed;
+	}
+
+	result = register_netdev(priv->dev);
+	if (result) {
+		ipoib_warn(priv, "failed to initialize; error %i", result);
+		goto register_failed;
+	}
+
+	priv->parent = ppriv->dev;
+
+	if (ipoib_create_debug_file(priv->dev))
+		goto debug_failed;
+
+	if (ipoib_add_pkey_attr(priv->dev))
+		goto sysfs_failed;
+
+	if (class_device_create_file(&priv->dev->class_dev,
+				     &class_device_attr_parent))
+		goto sysfs_failed;
+
+	list_add_tail(&priv->list, &ppriv->child_intfs);
+
+	up(&ppriv->vlan_mutex);
+
+	return 0;
+
+sysfs_failed:
+	ipoib_delete_debug_file(priv->dev);
+
+debug_failed:
+	unregister_netdev(priv->dev);
+
+register_failed:
+	ipoib_dev_cleanup(priv->dev);
+
+device_init_failed:
+	free_netdev(priv->dev);
+
+err:
+	up(&ppriv->vlan_mutex);
+	return result;
+}
+
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
+{
+	struct ipoib_dev_priv *ppriv, *priv, *tpriv;
+	int ret = -ENOENT;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	ppriv = netdev_priv(pdev);
+
+	down(&ppriv->vlan_mutex);
+	list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) {
+		if (priv->pkey == pkey) {
+			unregister_netdev(priv->dev);
+			ipoib_dev_cleanup(priv->dev);
+
+			list_del(&priv->list);
+
+			kfree(priv);
+
+			ret = 0;
+			break;
+		}
+	}
+	up(&ppriv->vlan_mutex);
+
+	return ret;
+}


From roland at topspin.com  Tue Nov 23 08:16:09 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:16:09 -0800
Subject: [openib-general] [PATCH][RFC/v2][18/21] Add InfiniBand userspace
	MAD support
In-Reply-To: <20041123816.7BdwvFRYhI45pb9i@topspin.com>
Message-ID: <20041123816.bPLXoHbNS6amekEO@topspin.com>

Add a driver that provides a character special device for each
InfiniBand port.  This device allows userspace to send and receive
MADs via write() and read() (with some control operations implemented
as ioctls).

All operations are 32/64 clean and have been tested with 32-bit
userspace running on a ppc64 kernel.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/drivers/infiniband/core/Makefile	2004-11-23 08:10:18.652812015 -0800
+++ linux-bk/drivers/infiniband/core/Makefile	2004-11-23 08:10:23.631077978 -0800
@@ -1,22 +1,12 @@
 EXTRA_CFLAGS += -Idrivers/infiniband/include
 
-obj-$(CONFIG_INFINIBAND) += \
-    ib_core.o \
-    ib_mad.o \
-    ib_sa.o
+obj-$(CONFIG_INFINIBAND) +=	ib_core.o ib_mad.o ib_sa.o ib_umad.o
 
-ib_core-objs := \
-    packer.o \
-    ud_header.o \
-    verbs.o \
-    sysfs.o \
-    device.o \
-    fmr_pool.o \
-    cache.o
+ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
+				device.o fmr_pool.o cache.o
 
-ib_mad-objs := \
-    mad.o \
-    smi.o \
-    agent.o
+ib_mad-y :=			mad.o smi.o agent.o
 
-ib_sa-objs := sa_query.o
+ib_sa-y :=			sa_query.o
+
+ib_umad-y :=			user_mad.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/core/user_mad.c	2004-11-23 08:10:23.697068248 -0800
@@ -0,0 +1,649 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <linux/pci.h>
+#include <linux/poll.h>
+#include <linux/rwsem.h>
+
+#include <asm/uaccess.h>
+
+#include <ib_mad.h>
+#include <ib_user_mad.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("InfiniBand userspace MAD packet access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	IB_UMAD_MAX_PORTS  = 256,
+	IB_UMAD_MAX_AGENTS = 32
+};
+
+struct ib_umad_port {
+	int                  devnum;
+	struct cdev          dev;
+	struct class_device *class_dev;
+	struct ib_device    *ib_dev;
+	u8                   port_num;
+};
+
+struct ib_umad_device {
+	int                  start_port, end_port;
+	struct ib_umad_port  port[0];
+};
+
+struct ib_umad_file {
+	struct ib_umad_port *port;
+	spinlock_t           recv_lock;
+	struct list_head     recv_list;
+	wait_queue_head_t    recv_wait;
+	struct rw_semaphore  agent_mutex;
+	struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS];
+	struct ib_mr        *mr[IB_UMAD_MAX_AGENTS];
+};
+
+struct ib_umad_packet {
+	struct ib_user_mad mad;
+	struct ib_ah      *ah;
+	struct list_head   list;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+static dev_t base_dev;
+static spinlock_t map_lock;
+static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS);
+
+static struct class_simple *umad_class;
+
+static void ib_umad_add_one(struct ib_device *device);
+static void ib_umad_remove_one(struct ib_device *device);
+
+static int queue_packet(struct ib_umad_file *file,
+			struct ib_mad_agent *agent,
+			struct ib_umad_packet *packet)
+{
+	int ret = 1;
+
+	down_read(&file->agent_mutex);
+	for (packet->mad.id = 0;
+	     packet->mad.id < IB_UMAD_MAX_AGENTS;
+	     packet->mad.id++)
+		if (agent == file->agent[packet->mad.id]) {
+			spin_lock_irq(&file->recv_lock);
+			list_add_tail(&packet->list, &file->recv_list);
+			spin_unlock_irq(&file->recv_lock);
+			wake_up_interruptible(&file->recv_wait);
+			ret = 0;
+			break;
+		}
+
+	up_read(&file->agent_mutex);
+
+	return ret;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *send_wc)
+{
+	struct ib_umad_file *file = agent->context;
+	struct ib_umad_packet *packet =
+		(void *) (unsigned long) send_wc->wr_id;
+
+	dma_unmap_single(agent->device->dma_device,
+			 pci_unmap_addr(packet, mapping),
+			 sizeof packet->mad.data,
+			 DMA_TO_DEVICE);
+	ib_destroy_ah(packet->ah);
+
+	if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) {
+		packet->mad.status = ETIMEDOUT;
+
+		if (!queue_packet(file, agent, packet))
+			return;
+	}
+		
+	kfree(packet);
+}
+
+static void recv_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_umad_file *file = agent->context;
+	struct ib_umad_packet *packet;
+
+	if (mad_recv_wc->wc->status != IB_WC_SUCCESS)
+		goto out;
+
+	packet = kmalloc(sizeof *packet, GFP_KERNEL);
+	if (!packet)
+		goto out;
+
+	memset(packet, 0, sizeof *packet);
+
+	memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data);
+	packet->mad.status        = 0;
+	packet->mad.qpn 	  = cpu_to_be32(mad_recv_wc->wc->src_qp);
+	packet->mad.lid 	  = cpu_to_be16(mad_recv_wc->wc->slid);
+	packet->mad.sl  	  = mad_recv_wc->wc->sl;
+	packet->mad.path_bits 	  = mad_recv_wc->wc->dlid_path_bits;
+	packet->mad.grh_present   = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH);
+	if (packet->mad.grh_present) {
+		/* XXX parse GRH */
+		packet->mad.gid_index 	  = 0;
+		packet->mad.hop_limit 	  = 0;
+		packet->mad.traffic_class = 0;
+		memset(packet->mad.gid, 0, 16);
+		packet->mad.flow_label 	  = 0;
+	}
+
+	if (queue_packet(file, agent, packet))
+		kfree(packet);
+
+out:
+	ib_free_recv_mad(mad_recv_wc);
+}
+
+static ssize_t ib_umad_read(struct file *filp, char __user *buf,
+			    size_t count, loff_t *pos)
+{
+	struct ib_umad_file *file = filp->private_data;
+	struct ib_umad_packet *packet;
+	ssize_t ret;
+
+	if (count < sizeof (struct ib_user_mad))
+		return -EINVAL;
+
+	spin_lock_irq(&file->recv_lock);
+
+	while (list_empty(&file->recv_list)) {
+		spin_unlock_irq(&file->recv_lock);
+
+		if (filp->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+
+		if (wait_event_interruptible(file->recv_wait,
+					     !list_empty(&file->recv_list)))
+			return -ERESTARTSYS;
+
+		spin_lock_irq(&file->recv_lock);
+	}
+
+	packet = list_entry(file->recv_list.next, struct ib_umad_packet, list);
+	list_del(&packet->list);
+
+	spin_unlock_irq(&file->recv_lock);
+
+	if (copy_to_user(buf, &packet->mad, sizeof packet->mad))
+		ret = -EFAULT;
+	else
+		ret = sizeof packet->mad;
+
+	kfree(packet);
+	return ret;
+}
+
+static ssize_t ib_umad_write(struct file *filp, const char __user *buf,
+			     size_t count, loff_t *pos)
+{
+	struct ib_umad_file *file = filp->private_data;
+	struct ib_umad_packet *packet;
+	struct ib_mad_agent *agent;
+	struct ib_ah_attr ah_attr;
+	struct ib_sge      gather_list;
+	struct ib_send_wr *bad_wr, wr = {
+		.opcode      = IB_WR_SEND,
+		.sg_list     = &gather_list,
+		.num_sge     = 1,
+		.send_flags  = IB_SEND_SIGNALED,
+	};
+	int ret;
+
+	if (count < sizeof (struct ib_user_mad))
+		return -EINVAL;
+
+	packet = kmalloc(sizeof *packet, GFP_KERNEL);
+	if (!packet)
+		return -ENOMEM;
+
+	if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) {
+		kfree(packet);
+		return -EFAULT;
+	}
+
+	if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	down_read(&file->agent_mutex);
+
+	agent = file->agent[packet->mad.id];
+	if (!agent) {
+		ret = -EINVAL;
+		goto err_up;
+	}
+
+	((struct ib_mad_hdr *) packet->mad.data)->tid =
+		cpu_to_be64(((u64) agent->hi_tid) << 32 |
+			    (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) &
+			     0xffffffff));
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid          = be16_to_cpu(packet->mad.lid);
+	ah_attr.sl            = packet->mad.sl;
+	ah_attr.src_path_bits = packet->mad.path_bits;
+	ah_attr.port_num      = file->port->port_num;
+	/* XXX handle GRH */
+
+	packet->ah = ib_create_ah(agent->qp->pd, &ah_attr);
+	if (IS_ERR(packet->ah)) {
+		ret = PTR_ERR(packet->ah);
+		goto err_up;
+	}
+
+	gather_list.addr = dma_map_single(agent->device->dma_device,
+					  packet->mad.data,
+					  sizeof packet->mad.data,
+					  DMA_TO_DEVICE);
+	gather_list.length = sizeof packet->mad.data;
+	gather_list.lkey   = file->mr[packet->mad.id]->lkey;
+	pci_unmap_addr_set(packet, mapping, gather_list.addr);
+
+	wr.wr.ud.mad_hdr     = (struct ib_mad_hdr *) packet->mad.data;
+	wr.wr.ud.ah          = packet->ah;
+	wr.wr.ud.remote_qpn  = be32_to_cpu(packet->mad.qpn);
+	wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey);
+	wr.wr.ud.timeout_ms  = packet->mad.timeout_ms;
+
+	wr.wr_id            = (unsigned long) packet;
+
+	ret = ib_post_send_mad(agent, &wr, &bad_wr);
+	if (ret) {
+		dma_unmap_single(agent->device->dma_device,
+				 pci_unmap_addr(packet, mapping),
+				 sizeof packet->mad.data,
+				 DMA_TO_DEVICE);
+		goto err_up;
+	}
+
+	up_read(&file->agent_mutex);
+
+	return sizeof packet->mad;
+
+err_up:
+	up_read(&file->agent_mutex);
+
+err:
+	kfree(packet);
+	return ret;
+}
+
+static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ib_umad_file *file = filp->private_data;
+
+	/* we will always be able to post a MAD send */
+	unsigned int mask = POLLOUT | POLLWRNORM;
+
+	poll_wait(filp, &file->recv_wait, wait);
+
+	if (!list_empty(&file->recv_list))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg)
+{
+	struct ib_user_mad_reg_req ureq;
+	struct ib_mad_reg_req req;
+	struct ib_mad_agent *agent;
+	int agent_id;
+	int ret;
+
+	down_write(&file->agent_mutex);
+
+	if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (ureq.qpn != 0 && ureq.qpn != 1) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id)
+		if (!file->agent[agent_id])
+			goto found;
+
+	ret = -ENOMEM;
+	goto out;
+
+found:
+	req.mgmt_class         = ureq.mgmt_class;
+	req.mgmt_class_version = ureq.mgmt_class_version;
+	memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask);
+
+	agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num,
+				      ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI,
+				      &req, 0, send_handler, recv_handler,
+				      file);
+	if (IS_ERR(agent)) {
+		ret = PTR_ERR(agent);
+		goto out;
+	}
+
+	file->agent[agent_id] = agent;
+
+	file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(file->mr[agent_id])) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	if (put_user(agent_id,
+		     (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) {
+		ret = -EFAULT;
+		goto err_mr;
+	}
+
+	ret = 0;
+	goto out;
+
+err_mr:
+	ib_dereg_mr(file->mr[agent_id]);
+
+err:
+	file->agent[agent_id] = NULL;
+	ib_unregister_mad_agent(agent);
+
+out:
+	up_write(&file->agent_mutex);
+	return ret;
+}
+
+static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg)
+{
+	u32 id;
+	int ret = 0;
+
+	down_write(&file->agent_mutex);
+
+	if (get_user(id, (u32 __user *) arg)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ib_dereg_mr(file->mr[id]);
+	ib_unregister_mad_agent(file->agent[id]);
+	file->agent[id] = NULL;
+
+out:
+	up_write(&file->agent_mutex);
+	return ret;
+}
+
+static int ib_umad_ioctl(struct inode *inode, struct file *filp,
+			 unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case IB_USER_MAD_GET_ABI_VERSION:
+		return put_user(IB_USER_MAD_ABI_VERSION,
+				(u32 __user *) arg) ? -EFAULT : 0;
+	case IB_USER_MAD_REGISTER_AGENT:
+		return ib_umad_reg_agent(filp->private_data, arg);
+	case IB_USER_MAD_UNREGISTER_AGENT:
+		return ib_umad_unreg_agent(filp->private_data, arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ib_umad_open(struct inode *inode, struct file *filp)
+{
+	struct ib_umad_port *port =
+		container_of(inode->i_cdev, struct ib_umad_port, dev);
+	struct ib_umad_file *file;
+
+	file = kmalloc(sizeof *file, GFP_KERNEL);
+	if (!file)
+		return -ENOMEM;
+
+	memset(file, 0, sizeof *file);
+
+	spin_lock_init(&file->recv_lock);
+	init_rwsem(&file->agent_mutex);
+	INIT_LIST_HEAD(&file->recv_list);
+	init_waitqueue_head(&file->recv_wait);
+
+	file->port = port;
+	filp->private_data = file;
+
+	return 0;
+}
+
+static int ib_umad_close(struct inode *inode, struct file *filp)
+{
+	struct ib_umad_file *file = filp->private_data;
+	int i;
+
+	for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i)
+		if (file->agent[i]) {
+			ib_dereg_mr(file->mr[i]);
+			ib_unregister_mad_agent(file->agent[i]);
+		}
+
+	kfree(file);
+
+	return 0;
+}
+
+static struct file_operations umad_fops = {
+	.owner 	 = THIS_MODULE,
+	.read 	 = ib_umad_read,
+	.write 	 = ib_umad_write,
+	.poll 	 = ib_umad_poll,
+	.ioctl 	 = ib_umad_ioctl,
+	.open 	 = ib_umad_open,
+	.release = ib_umad_close
+};
+
+static struct ib_client umad_client = {
+	.name   = "umad",
+	.add    = ib_umad_add_one,
+	.remove = ib_umad_remove_one
+};
+
+static ssize_t show_ibdev(struct class_device *class_dev, char *buf)
+{
+	struct ib_umad_port *port = class_get_devdata(class_dev);
+
+	return sprintf(buf, "%s\n", port->ib_dev->name);
+}
+static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
+
+static ssize_t show_port(struct class_device *class_dev, char *buf)
+{
+	struct ib_umad_port *port = class_get_devdata(class_dev);
+
+	return sprintf(buf, "%d\n", port->port_num);
+}
+static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL);
+
+static void ib_umad_add_one(struct ib_device *device)
+{
+	struct ib_umad_device *umad_dev;
+	int s, e, i;
+
+	if (device->node_type == IB_NODE_SWITCH)
+		s = e = 0;
+	else {
+		s = 1;
+		e = device->phys_port_cnt;
+	}
+
+	umad_dev = kmalloc(sizeof *umad_dev +
+			   (e - s + 1) * sizeof (struct ib_umad_port),
+			   GFP_KERNEL);
+	if (!umad_dev)
+		return;
+
+	umad_dev->start_port = s;
+	umad_dev->end_port   = e;
+
+	for (i = s; i <= e; ++i) {
+		spin_lock(&map_lock);
+		umad_dev->port[i - s].devnum =
+			find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS);
+		if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) {
+			spin_unlock(&map_lock);
+			goto err;
+		}
+		set_bit(umad_dev->port[i - s].devnum, dev_map);
+		spin_unlock(&map_lock);
+
+		umad_dev->port[i - s].ib_dev   = device;
+		umad_dev->port[i - s].port_num = i;
+
+		memset(&umad_dev->port[i - s].dev, 0, sizeof (struct cdev));
+		cdev_init(&umad_dev->port[i - s].dev, &umad_fops);
+		umad_dev->port[i - s].dev.owner = THIS_MODULE;
+		kobject_set_name(&umad_dev->port[i - s].dev.kobj,
+				 "umad%d", umad_dev->port[i - s].devnum);
+		if (cdev_add(&umad_dev->port[i - s].dev, base_dev +
+			     umad_dev->port[i - s].devnum, 1))
+			goto err;
+
+		umad_dev->port[i - s].class_dev =
+			class_simple_device_add(umad_class,
+						umad_dev->port[i - s].dev.dev,
+						device->dma_device,
+						"umad%d", umad_dev->port[i - s].devnum);
+		if (IS_ERR(umad_dev->port[i - s].class_dev))
+			goto err_class;
+
+		class_set_devdata(umad_dev->port[i - s].class_dev,
+				  &umad_dev->port[i - s]);
+
+		if (class_device_create_file(umad_dev->port[i - s].class_dev,
+					     &class_device_attr_ibdev))
+			goto err_class;
+		if (class_device_create_file(umad_dev->port[i - s].class_dev,
+					     &class_device_attr_port))
+			goto err_class;
+	}
+
+	ib_set_client_data(device, &umad_client, umad_dev);
+
+	return;
+
+err_class:
+	cdev_del(&umad_dev->port[i - s].dev);
+	clear_bit(umad_dev->port[i - s].devnum, dev_map);
+
+err:
+	while (--i >= s) {
+		class_simple_device_remove(umad_dev->port[i - s].dev.dev);
+		cdev_del(&umad_dev->port[i - s].dev);
+		clear_bit(umad_dev->port[i - s].devnum, dev_map);
+	}
+
+	kfree(umad_dev);
+}
+
+static void ib_umad_remove_one(struct ib_device *device)
+{
+	struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client);
+	int i;
+
+	if (!umad_dev)
+		return;
+
+	for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) {
+		class_simple_device_remove(umad_dev->port[i].dev.dev);
+		cdev_del(&umad_dev->port[i].dev);
+		clear_bit(umad_dev->port[i].devnum, dev_map);
+	}
+
+	kfree(umad_dev);
+}
+
+static int __init ib_umad_init(void)
+{
+	int ret;
+
+	spin_lock_init(&map_lock);
+
+	ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS,
+				  "infiniband_mad");
+	if (ret) {
+		printk(KERN_ERR "user_mad: couldn't get device number\n");
+		goto out;
+	}
+
+	umad_class = class_simple_create(THIS_MODULE, "infiniband_mad");
+	if (IS_ERR(umad_class)) {
+		printk(KERN_ERR "user_mad: couldn't create class_simple\n");
+		ret = PTR_ERR(umad_class);
+		goto out_chrdev;
+	}
+
+	ret = ib_register_client(&umad_client);
+	if (ret) {
+		printk(KERN_ERR "user_mad: couldn't register ib_umad client\n");
+		goto out_class;
+	}
+		
+	return 0;
+
+out_class:
+	class_simple_destroy(umad_class);
+
+out_chrdev:
+	unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS);
+
+out:
+	return ret;
+}
+
+static void __exit ib_umad_cleanup(void)
+{
+	ib_unregister_client(&umad_client);
+	class_simple_destroy(umad_class);
+	unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS);
+}
+
+module_init(ib_umad_init);
+module_exit(ib_umad_cleanup);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/drivers/infiniband/include/ib_user_mad.h	2004-11-23 08:10:23.724064267 -0800
@@ -0,0 +1,111 @@
+/*
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available at
+ * <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
+ * license, available in the LICENSE.TXT file accompanying this
+ * software.  These details are also available at
+ * <http://openib.org/license.html>.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ *
+ * $Id$
+ */
+
+#ifndef IB_USER_MAD_H
+#define IB_USER_MAD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+#define IB_USER_MAD_ABI_VERSION	1
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ */
+
+/**
+ * ib_user_mad - MAD packet
+ * @data - Contents of MAD
+ * @id - ID of agent MAD received with/to be sent with
+ * @status - 0 on successful receive, ETIMEDOUT if no response
+ *   received (transaction ID in data[] will be set to TID of original
+ *   request) (ignored on send)
+ * @timeout_ms - Milliseconds to wait for response (unset on receive)
+ * @qpn - Remote QP number received from/to be sent to
+ * @qkey - Remote Q_Key to be sent with (unset on receive)
+ * @lid - Remote lid received from/to be sent to
+ * @sl - Service level received with/to be sent with
+ * @path_bits - Local path bits received with/to be sent with
+ * @grh_present - If set, GRH was received/should be sent
+ * @gid_index - Local GID index to send with (unset on receive)
+ * @hop_limit - Hop limit in GRH
+ * @traffic_class - Traffic class in GRH
+ * @gid - Remote GID in GRH
+ * @flow_label - Flow label in GRH
+ *
+ * All multi-byte quantities are stored in network (big endian) byte order.
+ */
+struct ib_user_mad {
+	__u8	data[256];
+	__u32	id;
+	__u32	status;
+	__u32	timeout_ms;
+	__u32	qpn;
+	__u32   qkey;
+	__u16	lid;
+	__u8	sl;
+	__u8	path_bits;
+	__u8	grh_present;
+	__u8	gid_index;
+	__u8	hop_limit;
+	__u8	traffic_class;
+	__u8	gid[16];
+	__u32	flow_label;
+};
+
+/**
+ * ib_user_mad_reg_req - MAD registration request
+ * @id - Set by the kernel; used to identify agent in future requests.
+ * @qpn - Queue pair number; must be 0 or 1.
+ * @method_mask - The caller will receive unsolicited MADs for any method
+ *   where @method_mask = 1.
+ * @mgmt_class - Indicates which management class of MADs should be receive
+ *   by the caller.  This field is only required if the user wishes to
+ *   receive unsolicited MADs, otherwise it should be 0.
+ * @mgmt_class_version - Indicates which version of MADs for the given
+ *   management class to receive.
+ */
+struct ib_user_mad_reg_req {
+	__u32	id;
+	__u32	method_mask[4];
+	__u8	qpn;
+	__u8	mgmt_class;
+	__u8	mgmt_class_version;
+};
+
+#define IB_IOCTL_MAGIC		0x1b
+
+#define IB_USER_MAD_GET_ABI_VERSION	_IOR(IB_IOCTL_MAGIC, 0, __u32)
+
+#define IB_USER_MAD_REGISTER_AGENT	_IOWR(IB_IOCTL_MAGIC, 1, \
+					      struct ib_user_mad_reg_req)
+
+#define IB_USER_MAD_UNREGISTER_AGENT	_IOW(IB_IOCTL_MAGIC, 2, __u32)
+
+#endif /* IB_USER_MAD_H */


From roland at topspin.com  Tue Nov 23 08:16:15 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:16:15 -0800
Subject: [openib-general] [PATCH][RFC/v2][19/21] Document InfiniBand ioctl
	use
In-Reply-To: <20041123816.bPLXoHbNS6amekEO@topspin.com>
Message-ID: <20041123816.baaAyOggjbry3R4e@topspin.com>

Add the 0x1b ioctl magic number used by ib_umad module to
Documentation/ioctl-number.txt.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/Documentation/ioctl-number.txt	2004-11-23 08:09:54.932309534 -0800
+++ linux-bk/Documentation/ioctl-number.txt	2004-11-23 08:10:24.016021218 -0800
@@ -72,6 +72,7 @@
 0x09	all	linux/md.h
 0x12	all	linux/fs.h
 		linux/blkpg.h
+0x1b	all	InfiniBand Subsystem	<http://www.openib.org/>
 0x20	all	drivers/cdrom/cm206.h
 0x22	all	scsi/sg.h
 '#'	00-3F	IEEE 1394 Subsystem	Block for the entire subsystem


From roland at topspin.com  Tue Nov 23 08:16:20 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:16:20 -0800
Subject: [openib-general] [PATCH][RFC/v2][20/21] Add InfiniBand
	Documentation files
In-Reply-To: <20041123816.baaAyOggjbry3R4e@topspin.com>
Message-ID: <20041123816.Z3lNI0kVfxRLOphJ@topspin.com>

Add files to Documentation/infiniband that describe the tree under
/sys/class/infiniband, the IPoIB driver and the userspace MAD access driver.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/ipoib.txt	2004-11-23 08:10:24.271983477 -0800
@@ -0,0 +1,55 @@
+IP OVER INFINIBAND
+
+  The ib_ipoib driver is an implementation of the IP over InfiniBand
+  protocol as specified by the latest Internet-Drafts issued by the
+  IETF ipoib working group.  It is a "native" implementation in the
+  sense of setting the interface type to ARPHRD_INFINIBAND and the
+  hardware address length to 20 (earlier proprietary implementations
+  masqueraded to the kernel as ethernet interfaces).
+
+Partitions and P_Keys
+
+  When the IPoIB driver is loaded, it creates one interface for each
+  port using the P_Key at index 0.  To create an interface with a
+  different P_Key, write the desired P_Key into the main interface's
+  /sys/class/net/<intf name>/create_child file.  For example:
+
+    echo 0x8001 > /sys/class/net/ib0/create_child
+
+  This will create an interface named ib0.8001 with P_Key 0x8001.  To
+  remove a subinterface, use the "delete_child" file:
+
+    echo 0x8001 > /sys/class/net/ib0/delete_child
+
+  The P_Key for any interface is given by the "pkey" file, and the
+  main interface for a subinterface is in "parent."
+
+Debugging Information
+
+  By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
+  to 'y', tracing messages are compiled into the driver.  They are
+  turned on by setting the module parameters debug_level and
+  mcast_debug_level to 1.  These parameters can be controlled at
+  runtime through files in /sys/module/ib_ipoib/.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs"
+  virtual filesystem.  By mounting this filesystem, for example with
+
+    mkdir -p /ipoib_debugfs
+    mount -t ipoib_debugfs none /ipoib_debufs
+
+  it is possible to get statistics about multicast groups from the
+  files /ipoib_debugfs/ib0_mcg and so on.
+
+  The performance impact of this option is negligible, so it
+  is safe to enable this option with debug_level set to 0 for normal
+  operation.
+
+  CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output
+  in the data path when debug_level is set to 2.  However, even with
+  the output disabled, this option will affect performance.
+
+References
+
+  IETF IP over InfiniBand (ipoib) Working Group
+    http://ietf.org/html.charters/ipoib-charter.html
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/sysfs.txt	2004-11-23 08:10:24.316976843 -0800
@@ -0,0 +1,63 @@
+SYSFS FILES
+
+  For each InfiniBand device, the InfiniBand drivers create the
+  following files under /sys/class/infiniband/<device name>:
+
+    node_guid      - Node GUID
+    sys_image_guid - System image GUID
+
+  In addition, there is a "ports" subdirectory, with one subdirectory
+  for each port.  For example, if mthca0 is a 2-port HCA, there will
+  be two directories:
+
+    /sys/class/infiniband/mthca0/ports/1
+    /sys/class/infiniband/mthca0/ports/2
+
+  (A switch will only have a single "0" subdirectory for switch port
+  0; no subdirectory is created for normal switch ports)
+
+  In each port subdirectory, the following files are created:
+
+    cap_mask       - Port capability mask
+    lid            - Port LID
+    lid_mask_count - Port LID mask count
+    sm_lid         - Subnet manager LID for port's subnet
+    sm_sl          - Subnet manager SL for port's subnet
+    state          - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
+
+  There is also a "counters" subdirectory, with files
+
+    VL15_dropped
+    excessive_buffer_overrun_errors
+    link_downed
+    link_error_recovery
+    local_link_integrity_errors
+    port_rcv_constraint_errors
+    port_rcv_data
+    port_rcv_errors
+    port_rcv_packets
+    port_rcv_remote_physical_errors
+    port_rcv_switch_relay_errors
+    port_xmit_constraint_errors
+    port_xmit_data
+    port_xmit_discards
+    port_xmit_packets
+    symbol_error
+
+  Each of these files contains the corresponding value from the port's
+  Performance Management PortCounters attribute, as described in
+  section 16.1.3.5 of the InfiniBand Architecture Specification.
+
+  The "pkeys" and "gids" subdirectories contain one file for each
+  entry in the port's P_Key or GID table respectively.  For example,
+  ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
+  table.
+
+MTHCA
+
+  The Mellanox HCA driver also creates the files:
+
+    hw_rev   - Hardware revision number
+    fw_ver   - Firmware version
+    hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
+               or "MT25208"
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-bk/Documentation/infiniband/user_mad.txt	2004-11-23 08:10:24.365969619 -0800
@@ -0,0 +1,77 @@
+USERSPACE MAD ACCESS
+
+Device files
+
+  Each port of each InfiniBand device has a "umad" device attached.
+  For example, a two-port HCA will have two devices, while a switch
+  will have one device (for switch port 0).
+
+Creating MAD agents
+
+  A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+  and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+  descriptor for the appropriate device file.  If the registration
+  request succeeds, a 32-bit id will be returned in the structure.
+  For example:
+
+	struct ib_user_mad_reg_req req = { /* ... */ };
+	ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+        if (!ret)
+		my_agent = req.id;
+	else
+		perror("agent register");
+
+  Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+  ioctl.  Also, all agents registered through a file descriptor will
+  be unregistered when the descriptor is closed.
+
+Receiving MADs
+
+  MADs are received using read().  The buffer passed to read() must be
+  large enough to hold at least one struct ib_user_mad.  For example:
+
+	struct ib_user_mad mad;
+	ret = read(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("read");
+
+  In addition to the actual MAD contents, the other struct ib_user_mad
+  fields will be filled in with information on the received MAD.  For
+  example, the remote LID will be in mad.lid.
+
+  If a send times out, a receive will be generated with mad.status set
+  to ETIMEDOUT.  Otherwise when a MAD has been successfully received,
+  mad.status will be 0.
+
+  poll()/select() may be used to wait until a MAD can be read.
+
+Sending MADs
+
+  MADs are sent using write().  The agent ID for sending should be
+  filled into the id field of the MAD, the destination LID should be
+  filled into the lid field, and so on.  For example:
+
+	struct ib_user_mad mad;
+
+	/* fill in mad.data */
+
+	mad.id  = my_agent;	/* req.id from agent registration */
+	mad.lid = my_dest;	/* in network byte order... */
+	/* etc. */
+
+	ret = write(fd, &mad, sizeof mad);
+	if (ret != sizeof mad)
+		perror("write");
+
+/dev files
+
+  To create the appropriate character device files automatically with
+  udev, a rule like
+
+    KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad"
+
+  can be used.  This will create a device node named
+
+    /dev/infiniband/mthca0/ports/1/mad
+
+  for port 1 of device mthca0, and so on.


From roland at topspin.com  Tue Nov 23 08:16:27 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 08:16:27 -0800
Subject: [openib-general] [PATCH][RFC/v2][21/21] InfiniBand MAINTAINERS entry
In-Reply-To: <20041123816.Z3lNI0kVfxRLOphJ@topspin.com>
Message-ID: <20041123816.kKEP5asEjoRbLoxS@topspin.com>

Add OpenIB maintainers information to MAINTAINERS.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-bk.orig/MAINTAINERS	2004-11-23 08:09:38.208775343 -0800
+++ linux-bk/MAINTAINERS	2004-11-23 08:10:24.658926423 -0800
@@ -1075,6 +1075,17 @@
 L:	linux-fbdev-devel at lists.sourceforge.net
 S:	Maintained
 
+INFINIBAND SUBSYSTEM
+P:	Roland Dreier
+M:	roland at topspin.com
+P:	Sean Hefty
+M:	mshefty at ichips.intel.com
+P:	Hal Rosenstock
+M:	halr at voltaire.com
+L:	openib-general at openib.org
+W:	http://www.openib.org/
+S:	Supported
+
 INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS
 P:	Vojtech Pavlik
 M:	vojtech at suse.cz


From robert.j.woodruff at intel.com  Tue Nov 23 08:25:59 2004
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 23 Nov 2004 08:25:59 -0800
Subject: [openib-general] troubles with IPoIB
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002DCBAE3@orsmsx408>

>What is the firmware version of the PCIe adapters ? I have seen
problems
>like this when not all the adapters were at 4.5.3.

I also saw a problem with multicast packets on PCI-E adapters, with the 
SF ipoib. There was a problem was with the 4.5.3 firmware and seemed to
be fixed
with the 4.6.0-rc4 firmware. Not sure if that one is released yet, but
you might want to check with Mellanox. 


From iod00d at hp.com  Tue Nov 23 08:43:59 2004
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 23 Nov 2004 08:43:59 -0800
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <1101186939.29554.92.camel@trinity>
References: <1101173164.18604.53.camel@localhost>
	<1101183978.4124.548.camel@localhost.localdomain>
	<1101186939.29554.92.camel@trinity>
Message-ID: <20041123164359.GB10431@esmail.cup.hp.com>

On Mon, Nov 22, 2004 at 09:15:38PM -0800, Matt Leininger wrote:
>   We are using fw_ver 4.5.0.  Looks like we need to upgrade.  Time to
> try the user space firmware burning tools. 

FWIW, tvflash works fine under 2.6.10-rc1 kernels on ia64.
I hacked the code a bit so it's more informative
of what's going on...I guess I should submit a diff
back to roland.

thanks,
grant


From greg at kroah.com  Tue Nov 23 09:22:56 2004
From: greg at kroah.com (Greg KH)
Date: Tue, 23 Nov 2004 09:22:56 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand
	support
In-Reply-To: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
References: <20041123814.rXLIXw020elfd6Da@topspin.com>
	<20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
Message-ID: <20041123172256.GA30264@kroah.com>

On Tue, Nov 23, 2004 at 08:14:19AM -0800, Roland Dreier wrote:
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-bk/drivers/infiniband/core/cache.c	2004-11-23 08:10:16.816082837 -0800
> @@ -0,0 +1,338 @@
> +/*
> +  This software is available to you under a choice of one of two
> +  licenses.  You may choose to be licensed under the terms of the GNU
> +  General Public License (GPL) Version 2, available at
> +  <http://www.fsf.org/copyleft/gpl.html>, or the OpenIB.org BSD
> +  license, available in the LICENSE.TXT file accompanying this
> +  software.  These details are also available at
> +  <http://openib.org/license.html>.

Sorry, but this is wrong license for this file still.  Come on, you
can't tell me that your lawyers didn't vet this code at least once
before submission...

Looks like the openib group is going to have to give up on their dream
of keeping a bsd license for their code, sorry.

> +/*
> +  Local Variables:
> +  c-file-style: "linux"
> +  indent-tabs-mode: t
> +  End:
> +*/

Are these really necessary in every file?  Just set these to be your
editor's defaults.

thanks,

greg k-h


From roland at topspin.com  Tue Nov 23 09:34:53 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 09:34:53 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand
	support
In-Reply-To: <20041123172256.GA30264@kroah.com> (Greg KH's message of "Tue,
	23 Nov 2004 09:22:56 -0800")
References: <20041123814.rXLIXw020elfd6Da@topspin.com>
	<20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
	<20041123172256.GA30264@kroah.com>
Message-ID: <521xek79ky.fsf@topspin.com>

    Greg> Are these really necessary in every file?  Just set these to
    Greg> be your editor's defaults.

I'll strip them out before next time...

 - R.


From halr at voltaire.com  Tue Nov 23 09:58:17 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 23 Nov 2004 12:58:17 -0500
Subject: [openib-general] troubles with IPoIB
In-Reply-To: <52llcs7ey0.fsf@topspin.com>
References: <1101173164.18604.53.camel@localhost>
	<1101183978.4124.548.camel@localhost.localdomain>
	<1101186939.29554.92.camel@trinity> <52llcs7ey0.fsf@topspin.com>
Message-ID: <1101232697.19855.2.camel@localhost.localdomain>

On Tue, 2004-11-23 at 10:39, Roland Dreier wrote:
>     Matt>   We are using fw_ver 4.5.0.  Looks like we need to upgrade.
>     Matt> Time to try the user space firmware burning tools.
> 
> I would recommend _not_ using tvflash to upgrade PCIe HCAs from FW
> 4.5.0 to 4.5.3 right now.  The invariant sector of flash needs to be
> rewritten, and the version of tvflash checked in right now doesn't
> handle that properly yet.  Give me a day or so to fix it...

The HCAs can still be updated using the Mellanox tools (new mstflint ?
or other OS/driver with InfiniBurn and/or old mstflint) in the mean time
or one can wait.

-- Hal


From halr at voltaire.com  Tue Nov 23 10:20:21 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 23 Nov 2004 13:20:21 -0500
Subject: [openib-general] Start of an IPoIB FAQ
Message-ID: <1101234021.19855.16.camel@localhost.localdomain>

Hi,

The start of an IPoIB FAQ may be in order.

Something along the lines of:

ping doesn't work between IPoIB nodes. What should I do ?

First, verify that the ports are active. This can be done via:

cat /sys/class/infiniband/mthca0/ports/1/state

This should indicate 4: ACTIVE

assuming the HCA is mthca0 and port 1 is plugged in.

Next, verify the firmware version via

cat /sys/class/infiniband/mthca0/fw_ver

For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 
4.5.3 is recommended.

If these versions of the firmware are being used, indicate the
configuration and which SM is being utilized.

Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0"
show anything on the other nodes when you try to ping or something?

There are 2 levels of IPoIB debug which can be enabled when building:
IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging.
The latter has performance implications and should only be enabled when
all else fails. Enable the first level of IPoIB debug and then:

mount -t ipoib_debugfs none /ipoib_debufs/
cat /ipoib_debugfs/ib0_mcg

-- Hal


From roland at topspin.com  Tue Nov 23 10:40:50 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 10:40:50 -0800
Subject: [openib-general] [PATCH] Convert from pci_xxx to dma_xxx functions
In-Reply-To: <52wtwd8cji.fsf@topspin.com> (Roland Dreier's message of "Mon,
	22 Nov 2004 19:33:21 -0800")
References: <52wtwd8cji.fsf@topspin.com>
Message-ID: <52y8gs4de5.fsf@topspin.com>

I heard no objections, so I'm going to go ahead and commit this.

 - R.


From iod00d at hp.com  Tue Nov 23 11:32:25 2004
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 23 Nov 2004 11:32:25 -0800
Subject: [openib-general] [PATCH] Convert from pci_xxx to dma_xxx functions
In-Reply-To: <52wtwd8cji.fsf@topspin.com>
References: <52wtwd8cji.fsf@topspin.com>
Message-ID: <20041123193225.GC10431@esmail.cup.hp.com>

On Mon, Nov 22, 2004 at 07:33:21PM -0800, Roland Dreier wrote:
> Christoph Hellwig suggested we might as well put a generic struct
> device *dma_device and use the generic dma_map functions rather than
> assuming we're dealing with a PCI device.  (There's no dma_xxx
> equivalent of pci_unmap_addr_set() and friends, so I left that stuff--
> Christoph agrees this is OK for now).
> 
> Look OK to commit?

yeah - looks fine to me.

thanks,
grant


From sam at ravnborg.org  Tue Nov 23 11:53:45 2004
From: sam at ravnborg.org (Sam Ravnborg)
Date: Tue, 23 Nov 2004 20:53:45 +0100
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
In-Reply-To: <20041123814.rXLIXw020elfd6Da@topspin.com>
References: <20041123814.p0AnYzTlx42JeVes@topspin.com>
	<20041123814.rXLIXw020elfd6Da@topspin.com>
Message-ID: <20041123195345.GC8367@mars.ravnborg.org>

On Tue, Nov 23, 2004 at 08:14:14AM -0800, Roland Dreier wrote:
> Add public headers for core InfiniBand support.  This can be thought
> of as a midlayer that provides an abstraction between low-level
> hardware drivers and upper level protocols (such as
> IP-over-InfiniBand).
> 
> Signed-off-by: Roland Dreier <roland at topspin.com>

After giving it a second thought my vote goes for: include/linux/infiniband
And just a few comments to the API towards drivers...

	Sam


From sam at ravnborg.org  Tue Nov 23 11:56:54 2004
From: sam at ravnborg.org (Sam Ravnborg)
Date: Tue, 23 Nov 2004 20:56:54 +0100
Subject: [openib-general] Re: [PATCH][RFC/v2][4/21] Add InfiniBand MAD
	(management datagram) support (public headers)
In-Reply-To: <20041123814.xOcI2C4YpT1G9jQi@topspin.com>
References: <20041123814.LeHMD5hRZLn6VbLm@topspin.com>
	<20041123814.xOcI2C4YpT1G9jQi@topspin.com>
Message-ID: <20041123195654.GD8367@mars.ravnborg.org>

On Tue, Nov 23, 2004 at 08:14:31AM -0800, Roland Dreier wrote:
> +
> +struct ib_grh {
> +	u32		version_tclass_flow;
> +	u16		paylen;
> +	u8		next_hdr;
> +	u8		hop_limit;
> +	union ib_gid	sgid;
> +	union ib_gid	dgid;
> +} __attribute__ ((packed));

It was told on lkml why these structs was packed.
Same info here as comment so it is known next time.

And I see comments to API here - good.

	Sam


From iod00d at hp.com  Tue Nov 23 12:28:38 2004
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 23 Nov 2004 12:28:38 -0800
Subject: [openib-general] Start of an IPoIB FAQ
In-Reply-To: <1101234021.19855.16.camel@localhost.localdomain>
References: <1101234021.19855.16.camel@localhost.localdomain>
Message-ID: <20041123202838.GI10431@esmail.cup.hp.com>

On Tue, Nov 23, 2004 at 01:20:21PM -0500, Hal Rosenstock wrote:
> Hi,
> 
> The start of an IPoIB FAQ may be in order.

Yes - good idea. I need it too.

> Something along the lines of:
> 
> ping doesn't work between IPoIB nodes. What should I do ?
> 
> First, verify that the ports are active. This can be done via:
> 
> cat /sys/class/infiniband/mthca0/ports/1/state

cat: /sys/class/infiniband/mthca0/ports/1/state: No such file or directory
gsyprf3:~# ls /sys/class/infiniband/  
gsyprf3:~# lsmod
Module                  Size  Used by
ib_ipoib              104344  0 
ib_sa                  24620  1 ib_ipoib
ipt_state               5528  13 
ib_mthca              168167  0 
ib_mad                 60352  2 ib_sa,ib_mthca
ib_core                81328  4 ib_ipoib,ib_sa,ib_mthca,ib_mad
gsyprf3:~# lspci -vs 81:0.0
0000:81:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
        Subsystem: Mellanox Technology MT23108 InfiniHost
	Flags: 66MHz, medium devsel, IRQ 67
	Memory at 00000000cf700000 (64-bit, non-prefetchable) [size=1M]
	Memory at 00000000cf800000 (64-bit, prefetchable) [size=8M]
	Memory at 00000000d0000000 (64-bit, prefetchable) [size=256M]
	Capabilities: [40] #11 [001f]
	Capabilities: [50] Vital Product Data
	Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
	Capabilities: [70] PCI-X non-bridge device.

Maybe trying to use an unsupported card?
Ah yes...something along that line:
ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0)
GSI 60 (level, low) -> CPU 0 (0x0000) vector 67
ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67
ib_mthca 0000:81:00.0: Unhandled event 0f(00) on eqn 3
ib_query_gid failed (-16) for mthca0 (index 12)
ib_query_port failed (-16) for mthca0
ib_mthca 0000:81:00.0: WRITE_MTT failed (-16)
ib_mad: Couldn't create ib_mad CQ
ib_mad: Couldn't open mthca0 port 1

I can debug this a bit more later today.

In anycase, the FAQ sounds like a great idea.

grant


From roland at topspin.com  Tue Nov 23 13:20:38 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 13:20:38 -0800
Subject: [openib-general] Start of an IPoIB FAQ
In-Reply-To: <20041123202838.GI10431@esmail.cup.hp.com> (Grant Grundler's
	message of "Tue, 23 Nov 2004 12:28:38 -0800")
References: <1101234021.19855.16.camel@localhost.localdomain>
	<20041123202838.GI10431@esmail.cup.hp.com>
Message-ID: <52zn182rfd.fsf@topspin.com>

    > GSI 60 (level, low) -> CPU 0 (0x0000) vector 67
    > ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67
    > ib_mthca 0000:81:00.0: Unhandled event 0f(00) on eqn 3
    > ib_query_gid failed (-16) for mthca0 (index 12)
    > ib_query_port failed (-16) for mthca0
    > ib_mthca 0000:81:00.0: WRITE_MTT failed (-16)
    > ib_mad: Couldn't create ib_mad CQ
    > ib_mad: Couldn't open mthca0 port 1

Something very strange happened here.  It looks like the event queue
for firmware command completions overflowed, and then a couple of
firmware commands timed out.

What kind of system is this?  What HCA firmware are you running?

 - R.


From iod00d at hp.com  Tue Nov 23 13:34:32 2004
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 23 Nov 2004 13:34:32 -0800
Subject: [openib-general] Start of an IPoIB FAQ
In-Reply-To: <52zn182rfd.fsf@topspin.com>
References: <1101234021.19855.16.camel@localhost.localdomain>
	<20041123202838.GI10431@esmail.cup.hp.com>
	<52zn182rfd.fsf@topspin.com>
Message-ID: <20041123213432.GN10431@esmail.cup.hp.com>

On Tue, Nov 23, 2004 at 01:20:38PM -0800, Roland Dreier wrote:
>     > GSI 60 (level, low) -> CPU 0 (0x0000) vector 67
>     > ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67
>     > ib_mthca 0000:81:00.0: Unhandled event 0f(00) on eqn 3
>     > ib_query_gid failed (-16) for mthca0 (index 12)
>     > ib_query_port failed (-16) for mthca0
>     > ib_mthca 0000:81:00.0: WRITE_MTT failed (-16)
>     > ib_mad: Couldn't create ib_mad CQ
>     > ib_mad: Couldn't open mthca0 port 1
> 
> Something very strange happened here.  It looks like the event queue
> for firmware command completions overflowed, and then a couple of
> firmware commands timed out.
> 
> What kind of system is this?

ia64 rx2600.

> What HCA firmware are you running?

Erm...one of the firmware versions that I downloaded with tvflash.
Non trivial to say since /sys/class/infiniband isn't available.

Either fw3.3 or hca-cougar-a1-250-157.bin.
Because tvflash can't identify it, I'll guess it's the fw3.3
(a version of HP's firmware that exposes the 3rd MMIO BAR).

I can try again on another machine that has the hca-cougar
firmware.

grant


From iod00d at hp.com  Tue Nov 23 14:56:24 2004
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 23 Nov 2004 14:56:24 -0800
Subject: [openib-general] HP ZX1 and HP IB cards...
Message-ID: <20041123225624.GO10431@esmail.cup.hp.com>

So the adventure continues on a different box (rx4640).
(I'll go back to the rx2600 and reflash/reboot the box).

With tvflash, I was able to upload the hca-cougar image I mentioned
before successfully...at least that's what tvflash asserted.

If you want me to try a different firmware, we should do that off-list.

Running 2.6.10-rc2 kernel ended up with the following output:
iowa:~# modprobe ib_mthca
ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:41:00.0)
GSI 38 (level, low) -> CPU 1 (0x0100) vector 66
ACPI: PCI interrupt 0000:41:00.0[A] -> GSI 38 (level, low) -> IRQ 66
ib_mthca 0000:41:00.0: SYS_EN DDR error: syn=0, sock=0, sladdr=0, SPD source=DIMM
ib_mthca 0000:41:00.0: SYS_EN returned status 0x07, aborting.
ib_mthca: probe of 0000:41:00.0 failed with error -22
iowa:~# tvflash -i
open_hca(0)
flash_chip_reset()
flash_check_failsafe()
 
Error. String Tag not present (found tag 43 instead)
HCA #0: Found MT23108, Cougar, revision A1
  Primary image is valid, unknown source (sig 0x0/0x0)
  Secondary image is valid, unknown source (sig 0x0/0x0)

  
Error. String Tag not present (found tag 43 instead)
close_hca()
  Vital Product Dataiowa:~# 

tvflash isn't able to ID the new downloaded firmware.
Seems like a bug but I don't have specs to see what it is.

iowa:~# lspci -vs 41:0.0
0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
        Subsystem: Mellanox Technology MT23108 InfiniHost
        Flags: 66MHz, medium devsel, IRQ 66
        Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M]
        Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M]
        Memory at <unassigned> (64-bit, prefetchable)
        Capabilities: [40] #11 [001f]
        Capabilities: [50] Vital Product Data
        Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
        Capabilities: [70] PCI-X non-bridge device.

Oh...I think I see the problem.
System Firmware is having problems with this card.
I need to update firmware on this box anyway and will report back.

thanks,
grant


From roland at topspin.com  Tue Nov 23 15:31:10 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 15:31:10 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand
	support
In-Reply-To: <20041123172256.GA30264@kroah.com> (Greg KH's message of "Tue,
	23 Nov 2004 09:22:56 -0800")
References: <20041123814.rXLIXw020elfd6Da@topspin.com>
	<20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
	<20041123172256.GA30264@kroah.com>
Message-ID: <52mzx82ldt.fsf@topspin.com>

I've just checked in a change that converts this file from using RCU
to protecting its structures with an rwlock_t.  This should avoid any
patent licensing issues.  These functions are extremely unlikely to
have SMP scalability issues so this isn't too painful.

Thanks,
  Roland


From bunk at stusta.de  Tue Nov 23 16:13:28 2004
From: bunk at stusta.de (Adrian Bunk)
Date: Wed, 24 Nov 2004 01:13:28 +0100
Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand
	support
In-Reply-To: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
References: <20041123814.rXLIXw020elfd6Da@topspin.com>
	<20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
Message-ID: <20041124001328.GE2927@stusta.de>

On Tue, Nov 23, 2004 at 08:14:19AM -0800, Roland Dreier wrote:
> Add implementation of core InfiniBand support.  This can be thought of
> as a midlayer that provides an abstraction between low-level hardware
> drivers and upper level protocols (such as IP-over-InfiniBand).
>
> Signed-off-by: Roland Dreier <roland at topspin.com>
>
>
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-bk/drivers/infiniband/Kconfig       2004-11-23 
08:10:16.399144313 -$
> @@ -0,0 +1,11 @@
> +menu "InfiniBand support"
> +
> +config INFINIBAND
> +     tristate "InfiniBand support"
> +     default n
>...

This "default n" has no effect.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


From roland at topspin.com  Tue Nov 23 16:24:24 2004
From: roland at topspin.com (Roland Dreier)
Date: Tue, 23 Nov 2004 16:24:24 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand
	support
In-Reply-To: <20041124001328.GE2927@stusta.de> (Adrian Bunk's message of
	"Wed, 24 Nov 2004 01:13:28 +0100")
References: <20041123814.rXLIXw020elfd6Da@topspin.com>
	<20041123814.m1N7Tf2QmSCq9s5q@topspin.com>
	<20041124001328.GE2927@stusta.de>
Message-ID: <52is7w2ix3.fsf@topspin.com>

    Adrian> This "default n" has no effect.

Thanks, I've deleted it from our tree.

 - Roland


From mst at mellanox.co.il  Wed Nov 24 11:16:24 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 24 Nov 2004 21:16:24 +0200
Subject: [openib-general] HP ZX1 and HP IB cards...
In-Reply-To: <20041123225624.GO10431@esmail.cup.hp.com>
References: <20041123225624.GO10431@esmail.cup.hp.com>
Message-ID: <20041124191624.GA9404@mellanox.co.il>

Hello!
Quoting r. Grant Grundler (iod00d at hp.com) "[openib-general] HP ZX1 and HP IB cards...":
> So the adventure continues on a different box (rx4640).
> (I'll go back to the rx2600 and reflash/reboot the box).
> 
> With tvflash, I was able to upload the hca-cougar image I mentioned
> before successfully...at least that's what tvflash asserted.
> 
> If you want me to try a different firmware, we should do that off-list.
> 
> Running 2.6.10-rc2 kernel ended up with the following output:
> iowa:~# modprobe ib_mthca
> ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
> ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:41:00.0)
> GSI 38 (level, low) -> CPU 1 (0x0100) vector 66
> ACPI: PCI interrupt 0000:41:00.0[A] -> GSI 38 (level, low) -> IRQ 66
> ib_mthca 0000:41:00.0: SYS_EN DDR error: syn=0, sock=0, sladdr=0, SPD source=DIMM
> ib_mthca 0000:41:00.0: SYS_EN returned status 0x07, aborting.
> ib_mthca: probe of 0000:41:00.0 failed with error -22
> iowa:~# tvflash -i
> open_hca(0)
> flash_chip_reset()
> flash_check_failsafe()
>  
> Error. String Tag not present (found tag 43 instead)
> HCA #0: Found MT23108, Cougar, revision A1
>   Primary image is valid, unknown source (sig 0x0/0x0)
>   Secondary image is valid, unknown source (sig 0x0/0x0)
> 
>   
> Error. String Tag not present (found tag 43 instead)
> close_hca()
>   Vital Product Dataiowa:~# 
> 
> tvflash isn't able to ID the new downloaded firmware.
> Seems like a bug but I don't have specs to see what it is.
> 
> iowa:~# lspci -vs 41:0.0
> 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
>         Subsystem: Mellanox Technology MT23108 InfiniHost
>         Flags: 66MHz, medium devsel, IRQ 66
>         Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M]
>         Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M]
>         Memory at <unassigned> (64-bit, prefetchable)
>         Capabilities: [40] #11 [001f]
>         Capabilities: [50] Vital Product Data
>         Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
>         Capabilities: [70] PCI-X non-bridge device.
> 
> Oh...I think I see the problem.
> System Firmware is having problems with this card.
> I need to update firmware on this box anyway and will report back.
> 
> thanks,
> grant

If you think its a flash issue, try flashing with flint
(mstflint under openib tree works without kernel modules). this is what we
always use in mellanox for cougars.

MST


From roland at topspin.com  Wed Nov 24 11:39:27 2004
From: roland at topspin.com (Roland Dreier)
Date: Wed, 24 Nov 2004 11:39:27 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
 support (public headers)
In-Reply-To: <20041123195345.GC8367@mars.ravnborg.org> (Sam Ravnborg's
	message of "Tue, 23 Nov 2004 20:53:45 +0100")
References: <20041123814.p0AnYzTlx42JeVes@topspin.com>
	<20041123814.rXLIXw020elfd6Da@topspin.com>
	<20041123195345.GC8367@mars.ravnborg.org>
Message-ID: <52brdnyr2o.fsf@topspin.com>

    Sam> After giving it a second thought my vote goes for:
    Sam> include/linux/infiniband

Could you share the reasoning that led to that preference?

Unfortunately we don't seem to be converging on one choice of location.

On one side there is the fact that the .h files are not used outside
of drivers/infiniband -- hence they should stay under drivers/infiniband.

On the other side is the fact that moving the includes under include/
gets rid of some CFLAGS lines in the Makefile.

I don't see a conclusive reason to choose any particular place.
Perhaps Linus or Andrew can simply hand down an authoritative answer?

Thanks,
  Roland


From iod00d at hp.com  Wed Nov 24 14:35:52 2004
From: iod00d at hp.com (Grant Grundler)
Date: Wed, 24 Nov 2004 14:35:52 -0800
Subject: [openib-general] HP ZX1 and HP IB cards...
In-Reply-To: <20041124191624.GA9404@mellanox.co.il>
References: <20041123225624.GO10431@esmail.cup.hp.com>
	<20041124191624.GA9404@mellanox.co.il>
Message-ID: <20041124223552.GA15993@esmail.cup.hp.com>

On Wed, Nov 24, 2004 at 09:16:24PM +0200, Michael S. Tsirkin wrote:
> > 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
> >         Subsystem: Mellanox Technology MT23108 InfiniHost
> >         Flags: 66MHz, medium devsel, IRQ 66
> >         Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M]
> >         Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M]
> >         Memory at <unassigned> (64-bit, prefetchable)

<unassigned> is the problem.

> > System Firmware is having problems with this card.

yeah - turned out to be a firmware bug...not clear the firmware
team will fix it but they are at least aware of it.

We are also considering adding 64-bit MMIO (aka GMMIO) support
to ia64-linux. We just learned that some boxes don't assign GMMIO.

> If you think its a flash issue, try flashing with flint
> (mstflint under openib tree works without kernel modules). this is what we
> always use in mellanox for cougars.

I'm certain this is not a flash issue.
There might be issues with flash but not this one.

thanks though,
grant


From mshefty at ichips.intel.com  Wed Nov 24 16:59:54 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 24 Nov 2004 16:59:54 -0800
Subject: [openib-general] [PATCH] cleanup/fixes for handle_outgoing_smp
Message-ID: <41A52E8A.3000802@ichips.intel.com>

This patch restructures handle_outgoing_smp to improve its readability
and fixes the following issues: removes unneeded memory allocation for
received SMP, properly sends a SMP if the underlying HCA driver does not
provide a process_mad routine, and deallocates the allocated received
SMP in all failures cases.

- Sean


Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1291)
+++ core/mad.c	(working copy)
@@ -366,108 +366,92 @@
  			       struct ib_send_wr *send_wr)
  {
  	int ret;
+	struct ib_mad_private *mad_priv;
+	struct ib_mad_send_wc mad_send_wc;

  	if (!smi_handle_dr_smp_send(smp,
  				    mad_agent->device->node_type,
  				    mad_agent->port_num)) {
  		ret = -EINVAL;
  		printk(KERN_ERR PFX "Invalid directed route\n");
-		goto error1;
+		goto out;
  	}
-	if (smi_check_local_dr_smp(smp,
-				   mad_agent->device,
-				   mad_agent->port_num)) {
-		struct ib_mad_private *mad_priv;
-		struct ib_mad_agent_private *mad_agent_priv;
-		struct ib_mad_send_wc mad_send_wc;
-
-		mad_priv = kmem_cache_alloc(ib_mad_cache,
-					    (in_atomic() || irqs_disabled()) ?
-					    GFP_ATOMIC : GFP_KERNEL);
-		if (!mad_priv) {
-			ret = -ENOMEM;
-			printk(KERN_ERR PFX "No memory for local "
-			       "response MAD\n");
-			goto error1;
-		}
+	/* Check to post send on QP or process locally. */
+	ret = smi_check_local_dr_smp(smp, mad_agent->device,
+				     mad_agent->port_num);
+	if (!ret || !mad_agent->device->process_mad)
+		goto out;

-		mad_agent_priv = container_of(mad_agent,
-					      struct ib_mad_agent_private,
-					      agent);
-
-		if (mad_agent->device->process_mad) {
-			ret = mad_agent->device->process_mad(
-					    mad_agent->device,
-					    0,
-					    mad_agent->port_num,
-					    smp->dr_slid, /* ? */
+	mad_priv = kmem_cache_alloc(ib_mad_cache,
+				    (in_atomic() || irqs_disabled()) ?
+				    GFP_ATOMIC : GFP_KERNEL);
+	if (!mad_priv) {
+		ret = -ENOMEM;
+		printk(KERN_ERR PFX "No memory for local response MAD\n");
+		goto out;
+	}
+	ret = mad_agent->device->process_mad(mad_agent->device, 0,
+					     mad_agent->port_num, smp->dr_slid,
  					    (struct ib_mad *)smp,
  					    (struct ib_mad *)&mad_priv->mad);
-			if (ret & IB_MAD_RESULT_SUCCESS) {
-				if (ret & IB_MAD_RESULT_CONSUMED) {
-					ret = 1;
-					goto error1;
-				}
-				if (ret & IB_MAD_RESULT_REPLY) {
-					/*
-					 * See if response is solicited and
-					 * there is a recv handler
-					 */
-					if (solicited_mad(&mad_priv->mad.mad) &&
-					    mad_agent_priv->agent.recv_handler) {
-						struct ib_wc wc;
-
-						/*
-						 * Defined behavior is to
-						 * complete response before
-						 * request
-						 */
-						wc.wr_id = send_wr->wr_id;
-						wc.status = IB_WC_SUCCESS;
-						wc.opcode = IB_WC_RECV;
-						wc.vendor_err = 0;
-						wc.byte_len = sizeof(struct ib_mad);
-						wc.src_qp = 0;  /* IB_QPT_SMI ? */
-						wc.wc_flags = 0;
-						wc.pkey_index = 0;
-						wc.slid = IB_LID_PERMISSIVE;
-						wc.sl = 0;
-						wc.dlid_path_bits = 0;
-						mad_priv->header.recv_wc.wc = &wc;
-						mad_priv->header.recv_wc.mad_len =
-							sizeof(struct ib_mad);
-						INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
-						mad_priv->header.recv_buf.grh = NULL;
-						mad_priv->header.recv_buf.mad =
-							&mad_priv->mad.mad;
-						mad_priv->header.recv_wc.recv_buf =
-							&mad_priv->header.recv_buf;
-						mad_agent_priv->agent.recv_handler(
-							mad_agent,
-							&mad_priv->header.recv_wc);
-					} else
-						kmem_cache_free(ib_mad_cache, mad_priv);
-				} else
-					kmem_cache_free(ib_mad_cache, mad_priv);
-			} else
-				kmem_cache_free(ib_mad_cache, mad_priv);
-		}
-
-		if (mad_agent_priv->agent.send_handler) {
-			/* Now, complete send */
-			mad_send_wc.status = IB_WC_SUCCESS;
-			mad_send_wc.vendor_err = 0;
-			mad_send_wc.wr_id = send_wr->wr_id;
-			mad_agent_priv->agent.send_handler(
-						mad_agent,
-						&mad_send_wc);
-			ret = 1;
+	switch (ret)
+	{
+	case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY:
+		/*
+		 * See if response is solicited and
+		 * there is a recv handler
+		 */
+		if (solicited_mad(&mad_priv->mad.mad) &&
+		    mad_agent->recv_handler) {
+			struct ib_wc wc;
+
+			/*
+                         * Defined behavior is to complete response before
+			 * request
+			 */
+			wc.wr_id = send_wr->wr_id;
+			wc.status = IB_WC_SUCCESS;
+			wc.opcode = IB_WC_RECV;
+			wc.vendor_err = 0;
+			wc.byte_len = sizeof(struct ib_mad);
+			wc.src_qp = IB_QP0;
+			wc.wc_flags = 0;
+			wc.pkey_index = 0;
+			wc.slid = IB_LID_PERMISSIVE;
+			wc.sl = 0;
+			wc.dlid_path_bits = 0;
+			mad_priv->header.recv_wc.wc = &wc;
+			mad_priv->header.recv_wc.mad_len =
+				sizeof(struct ib_mad);
+			INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
+			mad_priv->header.recv_buf.grh = NULL;
+			mad_priv->header.recv_buf.mad = &mad_priv->mad.mad;
+			mad_priv->header.recv_wc.recv_buf =
+				&mad_priv->header.recv_buf;
+			mad_agent->recv_handler(mad_agent,
+						&mad_priv->header.recv_wc);
  		} else
-			ret = -EINVAL;
-	} else
+			kmem_cache_free(ib_mad_cache, mad_priv);
+		break;
+	case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
+		kmem_cache_free(ib_mad_cache, mad_priv);
+		break;
+	case IB_MAD_RESULT_SUCCESS:
  		ret = 0;
+		goto out;
+	default:
+		kmem_cache_free(ib_mad_cache, mad_priv);
+		ret = -EINVAL;
+		goto out;
+	}

-error1:
+	/* Complete send */
+	mad_send_wc.status = IB_WC_SUCCESS;
+	mad_send_wc.vendor_err = 0;
+	mad_send_wc.wr_id = send_wr->wr_id;
+	mad_agent->send_handler(mad_agent, &mad_send_wc);
+	ret = 1;
+out:
  	return ret;
  }


From eli at mellanox.co.il  Thu Nov 25 04:36:30 2004
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 25 Nov 2004 14:36:30 +0200
Subject: [openib-general] [PATCH] cache.c fixes
Message-ID: <506C3D7B14CDD411A52C00025558DED605B1A4F9@mtlex01.yok.mtl.com>

Looks like allocation size is buggy and also cleanup.

Index: cache.c
===================================================================
--- cache.c (revision 1292)
+++ cache.c (working copy)
@@ -249,10 +249,10 @@

    device->cache.pkey_cache =
        kmalloc(sizeof *device->cache.pkey_cache *
-           end_port(device) - start_port(device), GFP_KERNEL);
+           (end_port(device) - start_port(device) + 1), GFP_KERNEL);
    device->cache.gid_cache =
        kmalloc(sizeof *device->cache.pkey_cache *
-           end_port(device) - start_port(device), GFP_KERNEL);
+           (end_port(device) - start_port(device) + 1), GFP_KERNEL);

    if (!device->cache.pkey_cache || !device->cache.gid_cache) {
        printk(KERN_WARNING "Couldn't allocate cache "
@@ -280,8 +280,14 @@
    }

 err:
-   kfree(device->cache.pkey_cache);
-   kfree(device->cache.gid_cache);
+   if (device->cache.pkey_cache) {
+    kfree(device->cache.pkey_cache);
+    device->cache.pkey_cache = NULL;
+  }
+   if (device->cache.gid_cache) {
+    device->cache.gid_cache = NULL;
+    kfree(device->cache.gid_cache);
+  }
 }

 void ib_cache_cleanup_one(struct ib_device *device)
@@ -296,8 +302,12 @@
        kfree(device->cache.gid_cache[p]);
    }

-   kfree(device->cache.pkey_cache);
-   kfree(device->cache.gid_cache);
+  if (device->cache.pkey_cache) {
+    kfree(device->cache.pkey_cache);
+  }
+  if (device->cache.gid_cache) {
+    kfree(device->cache.gid_cache);
+  }
 }

 struct ib_client cache_client = {
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041125/84369129/attachment.html>

From shaharf at voltaire.com  Thu Nov 25 05:03:45 2004
From: shaharf at voltaire.com (shaharf)
Date: Thu, 25 Nov 2004 15:03:45 +0200
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
Message-ID: <D4F8F0B3820E754C887699BEF26A89403F4F0C@taurus.voltaire.com>

> 
>     Sam> After giving it a second thought my vote goes for:
>     Sam> include/linux/infiniband
> 
> Could you share the reasoning that led to that preference?
> 
> Unfortunately we don't seem to be converging on one choice of
location.
> 
> On one side there is the fact that the .h files are not used outside
> of drivers/infiniband -- hence they should stay under
drivers/infiniband.
> 
> On the other side is the fact that moving the includes under include/
> gets rid of some CFLAGS lines in the Makefile.
> 
> I don't see a conclusive reason to choose any particular place.
> Perhaps Linus or Andrew can simply hand down an authoritative answer?
> 
> Thanks,
>   Roland

(This message is posted to openib-general only)

I agree that headers that are not used outside drivers/infiniband should
stay there, but it seems that some header currently located in
drivers/infiniband may be used by user mode programs - ib_user_mad.h for
example, but also parts of ib_mad.h, ib_sa.h, etc.

So there are two issues -
	1 Shouldn't we move known public headers to
include/linux/infiniband?
	2 I would prefer to let user mode staff to include IB related
headers such that ib_mad.h without including real kernel only stuff. Can
we separate these files to (user mode) public parts and kernel only
(even drivers/infiniband only) parts? If we do that, I would say that
the public headers should be located in include/linux/infiniband and
leave the private headers where there are today. Of course we can use
also the #ifdef KERNEL mechanism, but personally I don't like it.

Shahar 


From roland at topspin.com  Thu Nov 25 07:50:30 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 25 Nov 2004 07:50:30 -0800
Subject: [openib-general] [PATCH] cache.c fixes
In-Reply-To: <506C3D7B14CDD411A52C00025558DED605B1A4F9@mtlex01.yok.mtl.com>
	(Eli Cohen's message of "Thu, 25 Nov 2004 14:36:30 +0200")
References: <506C3D7B14CDD411A52C00025558DED605B1A4F9@mtlex01.yok.mtl.com>
Message-ID: <52oehmx709.fsf@topspin.com>

Thanks for pointing this out.  However your patch was seriously
whitespace damaged.  Also kfree(NULL) is perfectly fine and even
encouraged for better readability.  So I just applied the kmalloc()
part of the fix by hand.

Thanks,
  Roland


From roland at topspin.com  Thu Nov 25 07:51:59 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 25 Nov 2004 07:51:59 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
In-Reply-To: <D4F8F0B3820E754C887699BEF26A89403F4F0C@taurus.voltaire.com>
	(shaharf@voltaire.com's
	message of "Thu, 25 Nov 2004 15:03:45 +0200")
References: <D4F8F0B3820E754C887699BEF26A89403F4F0C@taurus.voltaire.com>
Message-ID: <52fz2yx6xs.fsf@topspin.com>

    shaharf> I agree that headers that are not used outside
    shaharf> drivers/infiniband should stay there, but it seems that
    shaharf> some header currently located in drivers/infiniband may
    shaharf> be used by user mode programs - ib_user_mad.h for
    shaharf> example, but also parts of ib_mad.h, ib_sa.h, etc.

I believe the current feeling in the kernel community is that kernel
headers should be kernel only and if userspace needs a header file,
there should be a separate userspace version of the file.

 - Roland


From shaharf at voltaire.com  Thu Nov 25 08:37:19 2004
From: shaharf at voltaire.com (shaharf)
Date: Thu, 25 Nov 2004 18:37:19 +0200
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
Message-ID: <D4F8F0B3820E754C887699BEF26A89403F4F46@taurus.voltaire.com>

> 
> I believe the current feeling in the kernel community is that kernel
> headers should be kernel only and if userspace needs a header file,
> there should be a separate userspace version of the file.
> 
>  - Roland

OK, I accept that for most IB types and structures, but what about
ib_user_mad structure & header?

I suggest exposing that header to the user space (like many other files,
asm/errno for example), or alternatively to use only well known
structures for the device (I am not sure if that is feasible).

Shahar


From volta104 at mail.netvision.net.il  Thu Nov 25 09:29:04 2004
From: volta104 at mail.netvision.net.il (volta104 at mail.netvision.net.il)
Date: Thu, 25 Nov 2004 12:29:04 -0500
Subject: [openib-general] MAD registration for newer vendor classes
Message-ID: <194470-220041142517294790@M2W103.mail2web.com>

Hi,

For the newer vendor classes (0x30-0x4f), should we add OUI to the
registration and put the demux into the MAD layer for these classes by OUI ?

If so, I will work up a patch for this.

-- Hal

--------------------------------------------------------------------
mail2web - Check your email from the web at
http://mail2web.com/ .


From volta104 at mail.netvision.net.il  Thu Nov 25 09:32:23 2004
From: volta104 at mail.netvision.net.il (volta104 at mail.netvision.net.il)
Date: Thu, 25 Nov 2004 12:32:23 -0500
Subject: [openib-general] OUI Needed for OpenIB Alliance ?
Message-ID: <52540-2200411425173223129@M2W056.mail2web.com>

Hi,

One more thing I forgot in the last post:

It also seems like we might need an OpenIB alliance OUI for any vendor
class MADs, etc. that we might define. 

Note that one use of this would be a vendor specific ping as a diagnostic
tool. There are others.

-- Hal

--------------------------------------------------------------------
mail2web - Check your email from the web at
http://mail2web.com/ .


From tduffy at sun.com  Thu Nov 25 09:45:32 2004
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 25 Nov 2004 09:45:32 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
In-Reply-To: <D4F8F0B3820E754C887699BEF26A89403F4F46@taurus.voltaire.com>
References: <D4F8F0B3820E754C887699BEF26A89403F4F46@taurus.voltaire.com>
Message-ID: <1101404732.1290.4.camel@duffman>

On Thu, 2004-11-25 at 18:37 +0200, shaharf wrote:
> > 
> > I believe the current feeling in the kernel community is that kernel
> > headers should be kernel only and if userspace needs a header file,
> > there should be a separate userspace version of the file.
> > 
> >  - Roland
> 
> OK, I accept that for most IB types and structures, but what about
> ib_user_mad structure & header?
> 
> I suggest exposing that header to the user space (like many other files,
> asm/errno for example), or alternatively to use only well known
> structures for the device (I am not sure if that is feasible).

Right, but that should be part of some *other* package (not the kernel).
A copy of the file, like how glibc ships kernel headers separately.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041125/d95dabf9/attachment.sig>

From mst at mellanox.co.il  Thu Nov 25 10:13:11 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 25 Nov 2004 20:13:11 +0200
Subject: [openib-general] HP ZX1 and HP IB cards...
In-Reply-To: <20041124223552.GA15993@esmail.cup.hp.com>
References: <20041123225624.GO10431@esmail.cup.hp.com>
	<20041124191624.GA9404@mellanox.co.il>
	<20041124223552.GA15993@esmail.cup.hp.com>
Message-ID: <20041125181311.GA18098@mellanox.co.il>

Hello!
Quoting r. Grant Grundler (iod00d at hp.com) "Re: [openib-general] HP ZX1 and HP IB cards...":
> On Wed, Nov 24, 2004 at 09:16:24PM +0200, Michael S. Tsirkin wrote:
> > > 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
> > >         Subsystem: Mellanox Technology MT23108 InfiniHost
> > >         Flags: 66MHz, medium devsel, IRQ 66
> > >         Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M]
> > >         Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M]
> > >         Memory at <unassigned> (64-bit, prefetchable)
> 
> <unassigned> is the problem.
> 
> > > System Firmware is having problems with this card.
> 
> yeah - turned out to be a firmware bug...not clear the firmware
> team will fix it but they are at least aware of it.
> 
> We are also considering adding 64-bit MMIO (aka GMMIO) support
> to ia64-linux. We just learned that some boxes don't assign GMMIO.

Makes perfect sence to me.
I always wandered why more systems dont do it.
It is a real problem to allocate
a 256Mbyte DDR BAR out of the 32-bit PCI space.

MST


From roland at topspin.com  Thu Nov 25 10:25:48 2004
From: roland at topspin.com (Roland Dreier)
Date: Thu, 25 Nov 2004 10:25:48 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
In-Reply-To: <D4F8F0B3820E754C887699BEF26A89403F4F46@taurus.voltaire.com>
	(shaharf@voltaire.com's
	message of "Thu, 25 Nov 2004 18:37:19 +0200")
References: <D4F8F0B3820E754C887699BEF26A89403F4F46@taurus.voltaire.com>
Message-ID: <528y8pyedv.fsf@topspin.com>

    shaharf> OK, I accept that for most IB types and structures, but
    shaharf> what about ib_user_mad structure & header?

    shaharf> I suggest exposing that header to the user space (like
    shaharf> many other files, asm/errno for example), or
    shaharf> alternatively to use only well known structures for the
    shaharf> device (I am not sure if that is feasible).

/usr/include/asm/errno.h does not come directly from the kernel.  It
is sanitized and packaged as part of glibc, and even this use is
largely due to historical reasons.  Adding more such dependencies on
kernel headers is not the right way forward.

It would make sense for OpenIB to ship a package like "libibmad" that
has all the headers required for using the ib_umad module.  Userspace
and the kernel need to agree on the ABI, obviously, but physicaly
sharing the same .h file ends up creating more problems than it solves.

 - Roland


From shaharf at voltaire.com  Sun Nov 28 01:13:44 2004
From: shaharf at voltaire.com (shaharf)
Date: Sun, 28 Nov 2004 11:13:44 +0200
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
Message-ID: <D4F8F0B3820E754C887699BEF26A89403F4FC3@taurus.voltaire.com>

> 
> It would make sense for OpenIB to ship a package like "libibmad" that
> has all the headers required for using the ib_umad module.  Userspace
> and the kernel need to agree on the ABI, obviously, but physicaly
> sharing the same .h file ends up creating more problems than it
solves.
> 
>  - Roland

OK. I am currently working on such libibmad and I will add the header
file copy to it.

Please take the usermode stuff in account when changing relevant header
files.

Shahar 


From halr at voltaire.com  Sun Nov 28 07:38:50 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Sun, 28 Nov 2004 10:38:50 -0500
Subject: [openib-general] Re: [PATCH] cleanup/fixes for handle_outgoing_smp
In-Reply-To: <41A52E8A.3000802@ichips.intel.com>
References: <41A52E8A.3000802@ichips.intel.com>
Message-ID: <1101656330.4145.9.camel@localhost.localdomain>

On Wed, 2004-11-24 at 19:59, Sean Hefty wrote:
> This patch restructures handle_outgoing_smp to improve its readability
> and fixes the following issues: removes unneeded memory allocation for
> received SMP, properly sends a SMP if the underlying HCA driver does not
> provide a process_mad routine, and deallocates the allocated received
> SMP in all failures cases.

This patch was rejected. I'm not sure why. Can you regenerate it ?

-- Hal


From roland at topspin.com  Sun Nov 28 08:43:26 2004
From: roland at topspin.com (Roland Dreier)
Date: Sun, 28 Nov 2004 08:43:26 -0800
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
In-Reply-To: <D4F8F0B3820E754C887699BEF26A89403F4FC3@taurus.voltaire.com>
	(shaharf@voltaire.com's
	message of "Sun, 28 Nov 2004 11:13:44 +0200")
References: <D4F8F0B3820E754C887699BEF26A89403F4FC3@taurus.voltaire.com>
Message-ID: <52d5xxx6tt.fsf@topspin.com>

    shaharf> OK. I am currently working on such libibmad and I will
    shaharf> add the header file copy to it.

    shaharf> Please take the usermode stuff in account when changing
    shaharf> relevant header files.

The intention is that userspace can read the kernel's ABI version from
/sys/class/infiniband_mad/abi_version and compare it to the value of
IB_USER_MAD_ABI_VERSION that userspace was compiled with.  We just
need to be careful to increment the ABI version if we make any
incompatible changes to ib_user_mad.h.

 - Roland


From shaharf at voltaire.com  Mon Nov 29 02:33:03 2004
From: shaharf at voltaire.com (shaharf)
Date: Mon, 29 Nov 2004 12:33:03 +0200
Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand
	support (public headers)
Message-ID: <D4F8F0B3820E754C887699BEF26A89403F508A@taurus.voltaire.com>


> 
> The intention is that userspace can read the kernel's ABI version from
> /sys/class/infiniband_mad/abi_version and compare it to the value of
> IB_USER_MAD_ABI_VERSION that userspace was compiled with.  We just
> need to be careful to increment the ABI version if we make any
> incompatible changes to ib_user_mad.h.
> 
>  - Roland

I will implement such check in the library. Thanks,

Shahar


From Andras.Horvath at cern.ch  Mon Nov 29 02:39:34 2004
From: Andras.Horvath at cern.ch (Andras.Horvath at cern.ch)
Date: Mon, 29 Nov 2004 11:39:34 +0100
Subject: [openib-general] HP ZX1 and HP IB cards...
In-Reply-To: <20041124191624.GA9404@mellanox.co.il>
References: <20041123225624.GO10431@esmail.cup.hp.com>
	<20041124191624.GA9404@mellanox.co.il>
Message-ID: <20041129103934.GT2630@cern.ch>

Hello,

Maybe related, maybe not: I also have a HP rx2600 and a Voltaire HCA,
same kernel (2.6.10-rc2), but a different error (after modprobe mthca
and a few minutes of delay). Please see below.

Andras

ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0)
GSI 60 (level, low) -> CPU 1 (0x0100) vector 61
ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 61
ib_mthca 0000:81:00.0: Found bridge: Mellanox Technology MT23108 PCI Bridge (0000:80:01.0)
ib_mthca 0000:81:00.0: FW version 000100180000, max commands 1
ib_mthca 0000:81:00.0: FW size 6143 KB (start cfa00000, end cfffffff)
ib_mthca 0000:81:00.0: HCA memory size 131071 KB (start c8000000, end cfffffff)
ib_mthca 0000:81:00.0: Max QPs: 16777216, reserved QPs: 16, entry size: 256
ib_mthca 0000:81:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64
ib_mthca 0000:81:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64
ib_mthca 0000:81:00.0: reserved MPTs: 16, reserved MTTs: 16
ib_mthca 0000:81:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1
ib_mthca 0000:81:00.0: Max QP/MCG: 16777216, reserved MGMs: 0
ib_mthca 0000:81:00.0: Flags: 003f0337
ib_mthca 0000:81:00.0: profile[ 0]--10/20 @ 0x        c8000000 (size 0x 4000000)
ib_mthca 0000:81:00.0: profile[ 1]-- 0/16 @ 0x        cc000000 (size 0x 1000000)
ib_mthca 0000:81:00.0: profile[ 2]-- 7/18 @ 0x        cd000000 (size 0x  800000)
ib_mthca 0000:81:00.0: profile[ 3]-- 9/17 @ 0x        cd800000 (size 0x  800000)
ib_mthca 0000:81:00.0: profile[ 4]-- 3/16 @ 0x        ce000000 (size 0x  400000)
ib_mthca 0000:81:00.0: profile[ 5]-- 4/16 @ 0x        ce400000 (size 0x  200000)
ib_mthca 0000:81:00.0: profile[ 6]--12/15 @ 0x        ce600000 (size 0x  100000)
ib_mthca 0000:81:00.0: profile[ 7]-- 8/13 @ 0x        ce700000 (size 0x   80000)
ib_mthca 0000:81:00.0: profile[ 8]--11/ 7 @ 0x        ce780000 (size 0x    1000)
ib_mthca 0000:81:00.0: profile[ 9]-- 6/ 5 @ 0x        ce781000 (size 0x     800)
ib_mthca 0000:81:00.0: HCA memory: allocated 105990 KB/124928 KB (18938 KB free)
ib_mthca 0000:81:00.0: Allocated EQ 1 with 65536 entries
ib_mthca 0000:81:00.0: Allocated EQ 2 with 128 entries
ib_mthca 0000:81:00.0: Allocated EQ 3 with 128 entries
ib_mthca 0000:81:00.0: Setting mask 00000000000343fe for eqn 2
ib_mthca 0000:81:00.0: Setting mask 0000000000000400 for eqn 3
ib_mthca 0000:81:00.0: Failed to initialize queue pair table, aborting.
ib_mthca 0000:81:00.0: Clearing mask 00000000000343fe for eqn 2
ib_mthca 0000:81:00.0: Clearing mask 0000000000000400 for eqn 3
ib_mthca: probe of 0000:81:00.0 failed with error -16


80:01.0 PCI bridge: Mellanox Technology MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode])
        Flags: bus master, 66Mhz, medium devsel, latency 64
        Bus: primary=80, secondary=81, subordinate=81, sec-latency=64
        Memory behind bridge: c8000000-d08fffff
        Capabilities: [70] PCI-X non-bridge device.

81:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
        Subsystem: Mellanox Technology MT23108 InfiniHost
        Flags: 66Mhz, medium devsel, IRQ 61
        Memory at 00000000d0800000 (64-bit, non-prefetchable) [size=1M]
        Memory at 00000000d0000000 (64-bit, prefetchable) [size=8M]
        Memory at 00000000c8000000 (64-bit, prefetchable) [size=128M]
        Capabilities: [40] #0d [001f]
        Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
        Capabilities: [70] PCI-X non-bridge device.


From mst at mellanox.co.il  Mon Nov 29 05:43:44 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 29 Nov 2004 15:43:44 +0200
Subject: [openib-general] struct class and device.c
Message-ID: <20041129134344.GA2991@mellanox.co.il>

Hi!
Why does not core/device.c use the struct class to manage
the list of IB devices?
Are there disadvatages with this approach?

mst


From roland at topspin.com  Mon Nov 29 08:36:19 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 08:36:19 -0800
Subject: [openib-general] HP ZX1 and HP IB cards...
In-Reply-To: <20041129103934.GT2630@cern.ch> (Andras Horvath's message of
	"Mon, 29 Nov 2004 11:39:34 +0100")
References: <20041123225624.GO10431@esmail.cup.hp.com>
	<20041124191624.GA9404@mellanox.co.il> <20041129103934.GT2630@cern.ch>
Message-ID: <52zn10vcho.fsf@topspin.com>

    Andras> Hello, Maybe related, maybe not: I also have a HP rx2600
    Andras> and a Voltaire HCA, same kernel (2.6.10-rc2), but a
    Andras> different error (after modprobe mthca and a few minutes of
    Andras> delay). Please see below.

This looks like an interrupt routing problem.

    Andras> ib_mthca 0000:81:00.0: Failed to initialize queue pair table, aborting.
    Andras> ib_mthca 0000:81:00.0: Clearing mask 00000000000343fe for eqn 2
    Andras> ib_mthca 0000:81:00.0: Clearing mask 0000000000000400 for eqn 3
    Andras> ib_mthca: probe of 0000:81:00.0 failed with error -16

Initializing the QP table is the first time the driver tries to
execute a FW command and get a completion interrupt.  It seems the
driver never see the interrupt and eventually times out (should be a 1
minute timeout).

Do other drivers work on this system?  With this kernel?  If so what
IRQ do they assign to the HCA (it's shown in /proc/interrupts after
the driver is loaded).

Thanks,
 Roland


From roland at topspin.com  Mon Nov 29 08:37:49 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 08:37:49 -0800
Subject: [openib-general] struct class and device.c
In-Reply-To: <20041129134344.GA2991@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 29 Nov 2004 15:43:44 +0200")
References: <20041129134344.GA2991@mellanox.co.il>
Message-ID: <52vfbovcf6.fsf@topspin.com>

    Michael> Hi!  Why does not core/device.c use the struct class to
    Michael> manage the list of IB devices?  Are there disadvatages
    Michael> with this approach?

struct class is used.  The code is in core/sysfs.c

 - Roland


From ido at mellanox.co.il  Mon Nov 29 09:03:12 2004
From: ido at mellanox.co.il (Ido Bukspan)
Date: Mon, 29 Nov 2004 19:03:12 +0200
Subject: [openib-general] possible oops when calling ipoib_neigh_destructor
	while ipoib mod ule is down.
Message-ID: <91DB792C7985D411BEC300B40080D29C711B95@mtvex01.mtv.mtl.com>

	As far as I understand, if the kernel calls "ipoib_neigh_destructor"
after the ib_ipoib module has been taken down a kernel oops can occur. In
most cases when a driver is taken down, the kernel cleanup has already
destroyed all the ipoib driver relevant entries. We noticed that while
applications such as NetPerf are running while ipoib is taken down, the
neighbor entry may be held (by the kernel) after the module is taken down.
The destructor will only be called way after the application exits or being
terminated.
	In such case the Kernel will call the destructor method after the
module is already down resulting in a kernel oops.
	I am not 100% sure about that, but this is what I am seeing
happening when I take the ipoib module down while netperf is running. Am I
right? and if so what can be done?
	I thought that maybe we can change the neighbor destructor pointer
to NULL when the module exits.

	-Ido


Ido Bukspan
Mellanox Technologies Ltd.
Phone : (972)-3-6259500 ,Ext 518.
Fax     : (972)-3-5614943
mailto:ido at mellanox.co.il
http://www.mellanox.com

                                    No play No game


From ido at mellanox.co.il  Mon Nov 29 09:10:45 2004
From: ido at mellanox.co.il (Ido Bukspan)
Date: Mon, 29 Nov 2004 19:10:45 +0200
Subject: [openib-general] Unicast ARP
Message-ID: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>

	In gen2 we have a problem if the destination path (e.g. LID)
changes. Typically the kernel issues periodically unicast ARP packets to
addresses which are in the ARP cache to ensure that the neighbors are up and
that the ARP cache is up to date. In gen2, unicast  ARPs arrive to the ipoib
driver (hard_xmit) with no address handle assigned. Now, SA query is
initiated by the ipoib driver, then ARP is sent to this address, and when
ARP response arrives back, ARP cache is not updated with the new path. In
other words, ARP cache believes that the relevant entry is up to date and
doesn't notice the path change.

	I think that we should hold a link list which contains the current
address handles with the corresponding GID. When a unicast ARP is sent
(hard_xmit), instead of going to the SA, look up the right GID in our list,
then send a unicast ARP with this address handle. If path has changed, then
the ARP reply won't arrive and it will cause the ARP cache to be refreshed.

	This solution also reduces the burden on the SA.

	What do you think ?

	-Ido


Ido Bukspan
Mellanox Technologies Ltd.
Phone : (972)-3-6259500 ,Ext 518.
Fax     : (972)-3-5614943
mailto:ido at mellanox.co.il
http://www.mellanox.com

                                    No play No game


From halr at voltaire.com  Mon Nov 29 09:02:49 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 12:02:49 -0500
Subject: [openib-general] [RFC] [PATCH] mad: Change mad thread model to	be
	1 thread/port	rather than 1 thread/port/CPU
In-Reply-To: <419E64C7.9030905@ichips.intel.com>
References: <1100895525.4136.11.camel@localhost.localdomain>
	<419E64C7.9030905@ichips.intel.com>
Message-ID: <1101747769.4145.179.camel@localhost.localdomain>

On Fri, 2004-11-19 at 16:25, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU
> > (Note that I have not applied this but am requesting comments).
> > 
> > Index: mad.c
> > ===================================================================
> > --- mad.c	(revision 1269)
> > +++ mad.c	(working copy)
> > @@ -1900,7 +1900,7 @@
> >  		goto error7;
> >  
> >  	snprintf(name, sizeof name, "ib_mad%d", port_num);
> > -	port_priv->wq = create_workqueue(name);
> > +	port_priv->wq = create_singlethread_workqueue(name);
> >  	if (!port_priv->wq) {
> >  		ret = -ENOMEM;
> >  		goto error8;
> 
> My guess is that this is probably preferable to having 1/port/CPU, 
> especially on larger systems.  It would depend on what the clients do 
> when notified of a completion.

I too think we should change this to a single threaded workqueue (and
will do so shortly).

> I guess one advantage of keeping it 1/port/CPU (for now) is that it 
> would help test multi-threaded support.

One can always change it back for testing purposes but this means there
is not as much "automatic" testing by default.

-- Hal


From roland at topspin.com  Mon Nov 29 09:33:05 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 09:33:05 -0800
Subject: [openib-general] possible oops when calling
	ipoib_neigh_destructor while ipoib mod ule is down.
In-Reply-To: <91DB792C7985D411BEC300B40080D29C711B95@mtvex01.mtv.mtl.com> (Ido
	Bukspan's message of "Mon, 29 Nov 2004 19:03:12 +0200")
References: <91DB792C7985D411BEC300B40080D29C711B95@mtvex01.mtv.mtl.com>
Message-ID: <521xecv9v2.fsf@topspin.com>

    Ido> As far as I understand, if the kernel calls
    Ido> "ipoib_neigh_destructor" after the ib_ipoib module has been
    Ido> taken down a kernel oops can occur. In most cases when a
    Ido> driver is taken down, the kernel cleanup has already
    Ido> destroyed all the ipoib driver relevant entries. We noticed
    Ido> that while applications such as NetPerf are running while
    Ido> ipoib is taken down, the neighbor entry may be held (by the
    Ido> kernel) after the module is taken down.  The destructor will
    Ido> only be called way after the application exits or being
    Ido> terminated.

This seems quite likely.  I didn't check this very carefully -- it
looked like the kernel killed all neighbours when an interface is
unregistered, but I guess when the neighbours are still in use they
can hang around for a while.

It does seem like IPoIB needs to keep track of all the neighbour
structures it has added path context to.  When unregistering an
interface, IPoIB should free the path context and reset the
destructure.  However I'm not sure exactly how to do this while making
sure the kernel hasn't freed the neighbour first (it seems there are
some tricky races here).

 - Roland


From roland at topspin.com  Mon Nov 29 09:34:42 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 09:34:42 -0800
Subject: [openib-general] Unicast ARP
In-Reply-To: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> (Ido
	Bukspan's message of "Mon, 29 Nov 2004 19:10:45 +0200")
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
Message-ID: <52wtw4tv7x.fsf@topspin.com>

    Ido> I think that we should hold a link list which contains the
    Ido> current address handles with the corresponding GID. When a
    Ido> unicast ARP is sent (hard_xmit), instead of going to the SA,
    Ido> look up the right GID in our list, then send a unicast ARP
    Ido> with this address handle. If path has changed, then the ARP
    Ido> reply won't arrive and it will cause the ARP cache to be
    Ido> refreshed.

This makes sense, although I would use an rb_tree indexed by GID
rather than a linked list.

Another possibility would be to perform the SA lookup for unicast ARPs
and then update any neighbour path information if the reply is
different from what we have stored.

 - Roland


From ftillier at infiniconsys.com  Mon Nov 29 09:37:11 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Mon, 29 Nov 2004 09:37:11 -0800
Subject: [openib-general] possible oops when callingipoib_neigh_destructor
	while ipoib mod ule is down.
In-Reply-To: <521xecv9v2.fsf@topspin.com>
Message-ID: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com>

> From: Roland Dreier [mailto:roland at topspin.com]
> Sent: Monday, November 29, 2004 9:33 AM
> 
> It does seem like IPoIB needs to keep track of all the neighbour
> structures it has added path context to.  When unregistering an
> interface, IPoIB should free the path context and reset the
> destructure.  However I'm not sure exactly how to do this while making
> sure the kernel hasn't freed the neighbour first (it seems there are
> some tricky races here).
> 

Maybe I'm clueless (quite likely), but why not just have each neighbour
structure take a reference on the module when it is created?  The destructor
would release that reference.  That should solve races involved with
cleaning up, as well as ensuring that the module is still around for the
destructor to get invoked.

- Fab


From halr at voltaire.com  Mon Nov 29 09:33:39 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 12:33:39 -0500
Subject: [openib-general] Unicast ARP
In-Reply-To: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
Message-ID: <1101749619.4145.214.camel@localhost.localdomain>

On Mon, 2004-11-29 at 12:10, Ido Bukspan wrote:
> 	In gen2 we have a problem if the destination path (e.g. LID)
> changes. 

Is this restricted to LID changes or apply more generally to hardware
address (GID and/or QPN) changes ? (I had seen something similar when
the QPN changed; ARP timeout/retries are needed prior to connectivity
being restored).

> Typically the kernel issues periodically unicast ARP packets to
> addresses which are in the ARP cache to ensure that the neighbors are up and
> that the ARP cache is up to date. In gen2, unicast  ARPs arrive to the ipoib
> driver (hard_xmit) with no address handle assigned. Now, SA query is
> initiated by the ipoib driver, 

Is this the PathRecord lookup for unicast GID of destination ? Is the
path record (info) cached ?

> then ARP is sent to this address, and when
> ARP response arrives back, ARP cache is not updated with the new path.

Do you mean hardware address here (GID + QPN) ?

>  In
> other words, ARP cache believes that the relevant entry is up to date and
> doesn't notice the path change.
> 
> 	I think that we should hold a link list which contains the current
> address handles with the corresponding GID. When a unicast ARP is sent
> (hard_xmit), instead of going to the SA, look up the right GID in our list,
> then send a unicast ARP with this address handle. If path has changed, then
> the ARP reply won't arrive and it will cause the ARP cache to be refreshed.
> 
> 	This solution also reduces the burden on the SA.
> 
> 	What do you think ?
> 
> 	-Ido
> 
> 
> 
> Ido Bukspan
> Mellanox Technologies Ltd.
> Phone : (972)-3-6259500 ,Ext 518.
> Fax     : (972)-3-5614943
> mailto:ido at mellanox.co.il
> http://www.mellanox.com
> 
>                                     No play No game
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Mon Nov 29 09:43:12 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 09:43:12 -0800
Subject: [openib-general] possible oops when
	callingipoib_neigh_destructor while ipoib mod ule is down.
In-Reply-To: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> (Fab
	Tillier's message of "Mon, 29 Nov 2004 09:37:11 -0800")
References: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com>
Message-ID: <52ekictutr.fsf@topspin.com>

    Fab> Maybe I'm clueless (quite likely), but why not just have each
    Fab> neighbour structure take a reference on the module when it is
    Fab> created?  The destructor would release that reference.  That
    Fab> should solve races involved with cleaning up, as well as
    Fab> ensuring that the module is still around for the destructor
    Fab> to get invoked.

That's a possibility but then no IPoIB neighbour structures could ever
be garbage collected, which doesn't seem like a good idea.

 - Roland


From halr at voltaire.com  Mon Nov 29 09:39:05 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 12:39:05 -0500
Subject: [openib-general] Unicast ARP
In-Reply-To: <52wtw4tv7x.fsf@topspin.com>
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<52wtw4tv7x.fsf@topspin.com>
Message-ID: <1101749945.4145.219.camel@localhost.localdomain>

On Mon, 2004-11-29 at 12:34, Roland Dreier wrote:
> This makes sense, although I would use an rb_tree indexed by GID
> rather than a linked list.

Is GID sufficient or is QPN also needed ?

-- Hal


From roland at topspin.com  Mon Nov 29 10:00:29 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 10:00:29 -0800
Subject: [openib-general] Unicast ARP
In-Reply-To: <1101749945.4145.219.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Mon, 29 Nov 2004 12:39:05 -0500")
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<52wtw4tv7x.fsf@topspin.com>
	<1101749945.4145.219.camel@localhost.localdomain>
Message-ID: <52act0tu0y.fsf@topspin.com>

    Hal> Is GID sufficient or is QPN also needed ?

You're right.  It should be indexed by HW address.

 - R.


From roland at topspin.com  Mon Nov 29 10:03:04 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 10:03:04 -0800
Subject: [openib-general] Unicast ARP
In-Reply-To: <1101749619.4145.214.camel@localhost.localdomain> (Hal
	Rosenstock's message of "Mon, 29 Nov 2004 12:33:39 -0500")
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<1101749619.4145.214.camel@localhost.localdomain>
Message-ID: <52653ottwn.fsf@topspin.com>

    Hal> Is this restricted to LID changes or apply more generally to
    Hal> hardware address (GID and/or QPN) changes ? (I had seen
    Hal> something similar when the QPN changed; ARP timeout/retries
    Hal> are needed prior to connectivity being restored).

Just LID changes.  If GID or QPN changes, then the HW address is
different and the kernel neighbour code can notice.  It's just like an
IP address moving to a different MAC on ethernet: either the old ARP
entry has to time out, or the interface with the new address has to
send a gratuitous ARP.

    Hal> Is this the PathRecord lookup for unicast GID of destination
    Hal> ? Is the path record (info) cached ?

Yes, path lookup for unicast GID.  It's not cached, which is what
causes the issue: the ARP will get a reply and so the kernel will
believe the neighbour is still valid, even though it has an obsolete
path in it.

    Ido> then ARP is sent to this address, and when ARP response
    Ido> arrives back, ARP cache is not updated with the new path.

    Hal> Do you mean hardware address here (GID + QPN) ?

No, he meant path (LID, SL, etc)

 - R.


From roland at topspin.com  Mon Nov 29 10:03:57 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 10:03:57 -0800
Subject: [openib-general] possible oops when
	callingipoib_neigh_destructor while ipoib mod ule is down.
In-Reply-To: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> (Fab
	Tillier's message of "Mon, 29 Nov 2004 09:37:11 -0800")
References: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com>
Message-ID: <521xecttv6.fsf@topspin.com>

    Fab> Maybe I'm clueless (quite likely), but why not just have each
    Fab> neighbour structure take a reference on the module when it is
    Fab> created?  The destructor would release that reference.  That
    Fab> should solve races involved with cleaning up, as well as
    Fab> ensuring that the module is still around for the destructor
    Fab> to get invoked.

Sorry, I think I read this backwards before.  But if each neighbour
has a reference on the IPoIB module then it will be nearly impossible
to unload the IPoIB module (which solves the problem but again in not
a very nice way).

 - Roland


From ftillier at infiniconsys.com  Mon Nov 29 10:07:06 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Mon, 29 Nov 2004 10:07:06 -0800
Subject: [openib-general] possible oops when callingipoib_neigh_destructor
	while ipoib mod ule is down.
In-Reply-To: <52ekictutr.fsf@topspin.com>
Message-ID: <000101c4d63e$3d14db70$655aa8c0@infiniconsys.com>

> From: Roland Dreier [mailto:roland at topspin.com]
> Sent: Monday, November 29, 2004 9:43 AM
> 
>     Fab> Maybe I'm clueless (quite likely), but why not just have each
>     Fab> neighbour structure take a reference on the module when it is
>     Fab> created?  The destructor would release that reference.  That
>     Fab> should solve races involved with cleaning up, as well as
>     Fab> ensuring that the module is still around for the destructor
>     Fab> to get invoked.
> 
> That's a possibility but then no IPoIB neighbour structures could ever
> be garbage collected, which doesn't seem like a good idea.

I'm confused - why not?  Why couldn't garbage collection invoke a similar
code path to the destructor?

- Fab


From roland at topspin.com  Mon Nov 29 10:12:54 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 10:12:54 -0800
Subject: [openib-general] possible oops when
	callingipoib_neigh_destructor while ipoib mod ule is down.
In-Reply-To: <000101c4d63e$3d14db70$655aa8c0@infiniconsys.com> (Fab
	Tillier's message of "Mon, 29 Nov 2004 10:07:06 -0800")
References: <000101c4d63e$3d14db70$655aa8c0@infiniconsys.com>
Message-ID: <52wtw4sevt.fsf@topspin.com>

    Fab> I'm confused - why not?  Why couldn't garbage collection
    Fab> invoke a similar code path to the destructor?

I was confused -- I thought you meant for the module to take a
reference on each neighbour (which would prevent it from being
destroyed until the module released it).

On the other hand as I said, if the module can't be unloaded until
there are no neighbours left, this could make it very difficult to
unload the module.

 - Roland


From ftillier at infiniconsys.com  Mon Nov 29 10:13:39 2004
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Mon, 29 Nov 2004 10:13:39 -0800
Subject: [openib-general] possible oops when callingipoib_neigh_destructor
	while ipoib mod ule is down.
In-Reply-To: <521xecttv6.fsf@topspin.com>
Message-ID: <000201c4d63f$27aa4350$655aa8c0@infiniconsys.com>

> From: Roland Dreier [mailto:roland at topspin.com]
> Sent: Monday, November 29, 2004 10:04 AM
> 
>     Fab> Maybe I'm clueless (quite likely), but why not just have each
>     Fab> neighbour structure take a reference on the module when it is
>     Fab> created?  The destructor would release that reference.  That
>     Fab> should solve races involved with cleaning up, as well as
>     Fab> ensuring that the module is still around for the destructor
>     Fab> to get invoked.
> 
> Sorry, I think I read this backwards before.  But if each neighbour
> has a reference on the IPoIB module then it will be nearly impossible
> to unload the IPoIB module (which solves the problem but again in not
> a very nice way).
> 

Don't the neighbour structures get cleaned up when an interface goes down?
Can't the interface go down while the module has outstanding references?

I probably just don't know enough about how module unload works.

- Fab


From roland at topspin.com  Mon Nov 29 10:16:40 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 10:16:40 -0800
Subject: [openib-general] possible oops when
	callingipoib_neigh_destructor while ipoib mod ule is down.
In-Reply-To: <000201c4d63f$27aa4350$655aa8c0@infiniconsys.com> (Fab
	Tillier's message of "Mon, 29 Nov 2004 10:13:39 -0800")
References: <000201c4d63f$27aa4350$655aa8c0@infiniconsys.com>
Message-ID: <52sm6ssepj.fsf@topspin.com>

    Fab> Don't the neighbour structures get cleaned up when an
    Fab> interface goes down?  Can't the interface go down while the
    Fab> module has outstanding references?

The original issue was that neighbour structures can hang around after
an interface is unregistered.

An interface can be downed even if the module has a non-zero reference
count, of course.  But under 2.6, for a normal network interface, one
can rmmod the corresponding module at any time -- the interface is
brought down and the module is unloaded.  I'd prefer to preserve that,
rather than forcing the user to down all IPoIB interfaces and then
wait an arbitrarily long time for all neighbours to be freed before
unloading the IPoIB module.

 - Roland


From mshefty at ichips.intel.com  Mon Nov 29 10:51:31 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 29 Nov 2004 10:51:31 -0800
Subject: [openib-general] MAD registration for newer vendor classes
In-Reply-To: <194470-220041142517294790@M2W103.mail2web.com>
References: <194470-220041142517294790@M2W103.mail2web.com>
Message-ID: <41AB6FB3.4040301@ichips.intel.com>

volta104 at mail.netvision.net.il wrote:

> Hi,
> 
> For the newer vendor classes (0x30-0x4f), should we add OUI to the
> registration and put the demux into the MAD layer for these classes by OUI ?
> 
> If so, I will work up a patch for this.

I guess I need to re-examine the MAD dispatching, but I can't think of 
a reason why it wouldn't already support vendor classes already.

- Sean


From mst at mellanox.co.il  Mon Nov 29 10:57:39 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 29 Nov 2004 20:57:39 +0200
Subject: [openib-general] Unicast ARP
In-Reply-To: <52653ottwn.fsf@topspin.com>
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<1101749619.4145.214.camel@localhost.localdomain>
	<52653ottwn.fsf@topspin.com>
Message-ID: <20041129185739.GA3394@mellanox.co.il>

Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Unicast ARP":
>     Hal> Is this restricted to LID changes or apply more generally to
>     Hal> hardware address (GID and/or QPN) changes ? (I had seen
>     Hal> something similar when the QPN changed; ARP timeout/retries
>     Hal> are needed prior to connectivity being restored).
> 
> Just LID changes.  If GID or QPN changes, then the HW address is
> different and the kernel neighbour code can notice.  It's just like an
> IP address moving to a different MAC on ethernet: either the old ARP
> entry has to time out, or the interface with the new address has to
> send a gratuitous ARP.


Currently it also seems that by just bringing the interface up
and down the hw address will change.
Unless I am doing something wrong, this inconvenience seems
to be caused by a different QP number being assigned.
Cant this be solved e.g. by assigning a fixed QP number
for IP over IB?

Thanks,
MST


From roland at topspin.com  Mon Nov 29 11:04:14 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 11:04:14 -0800
Subject: [openib-general] Unicast ARP
In-Reply-To: <20041129185739.GA3394@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 29 Nov 2004 20:57:39 +0200")
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<1101749619.4145.214.camel@localhost.localdomain>
	<52653ottwn.fsf@topspin.com> <20041129185739.GA3394@mellanox.co.il>
Message-ID: <52oehgsci9.fsf@topspin.com>

    Michael> Currently it also seems that by just bringing the
    Michael> interface up and down the hw address will change.  Unless
    Michael> I am doing something wrong, this inconvenience seems to
    Michael> be caused by a different QP number being assigned.  Cant
    Michael> this be solved e.g. by assigning a fixed QP number for IP
    Michael> over IB?

It seems you are doing something wrong.  The QP is allocated when the
interface is created and remains the same when the interface is
brought up and down:

    # ip addr show dev ib0
    7: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast qlen 128
        link/[32] 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    # ifconfig ib0 down
    # ifconfig ib0 up
    # ip addr show dev ib0
    7: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast qlen 128
        link/[32] 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

 - Roland


From halr at voltaire.com  Mon Nov 29 10:59:35 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 13:59:35 -0500
Subject: [openib-general] MAD registration for newer vendor classes
In-Reply-To: <41AB6FB3.4040301@ichips.intel.com>
References: <194470-220041142517294790@M2W103.mail2web.com>
	<41AB6FB3.4040301@ichips.intel.com>
Message-ID: <1101754775.4145.239.camel@localhost.localdomain>

On Mon, 2004-11-29 at 13:51, Sean Hefty wrote:
> volta104 at mail.netvision.net.il wrote:
> 
> > Hi,
> > 
> > For the newer vendor classes (0x30-0x4f), should we add OUI to the
> > registration and put the demux into the MAD layer for these classes by OUI ?
> > 
> > If so, I will work up a patch for this.
> 
> I guess I need to re-examine the MAD dispatching, but I can't think of 
> a reason why it wouldn't already support vendor classes already.

There are a new range of vendor classes (at IBA 1.1) which embed the OUI
so that multiple vendors can share the same class. There needs to be
another level of demux for these.

-- Hal


From mst at mellanox.co.il  Mon Nov 29 11:17:51 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 29 Nov 2004 21:17:51 +0200
Subject: [openib-general] struct class and device.c
In-Reply-To: <52vfbovcf6.fsf@topspin.com>
References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com>
Message-ID: <20041129191751.GA3450@mellanox.co.il>

Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] struct class and device.c":
>     Michael> Hi!  Why does not core/device.c use the struct class to
>     Michael> manage the list of IB devices?  Are there disadvatages
>     Michael> with this approach?
> 
> struct class is used.  The code is in core/sysfs.c
> 

No, I had in mind using the children list in the class
instead of the device_list in device.c and interfaces list 
instead of the client_list. Why not? Wouldnt that work?

MST


From mshefty at ichips.intel.com  Mon Nov 29 11:19:30 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 29 Nov 2004 11:19:30 -0800
Subject: [openib-general] MAD registration for newer vendor classes
In-Reply-To: <1101754775.4145.239.camel@localhost.localdomain>
References: <194470-220041142517294790@M2W103.mail2web.com>
	<41AB6FB3.4040301@ichips.intel.com>
	<1101754775.4145.239.camel@localhost.localdomain>
Message-ID: <41AB7642.2050803@ichips.intel.com>

Hal Rosenstock wrote:
>>>For the newer vendor classes (0x30-0x4f), should we add OUI to the
>>>registration and put the demux into the MAD layer for these classes by OUI ?
>>>
>>>If so, I will work up a patch for this.
>>
>>I guess I need to re-examine the MAD dispatching, but I can't think of 
>>a reason why it wouldn't already support vendor classes already.
> 
> 
> There are a new range of vendor classes (at IBA 1.1) which embed the OUI
> so that multiple vendors can share the same class. There needs to be
> another level of demux for these.

You know, I'd really like to rant about the entire MAD architecture 
right about now...

I think it makes sense to add OUI to the MAD interface.


From halr at voltaire.com  Mon Nov 29 11:16:22 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 14:16:22 -0500
Subject: [openib-general] Unicast ARP
In-Reply-To: <52oehgsci9.fsf@topspin.com>
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<1101749619.4145.214.camel@localhost.localdomain>
	<52653ottwn.fsf@topspin.com>
	<20041129185739.GA3394@mellanox.co.il> <52oehgsci9.fsf@topspin.com>
Message-ID: <1101755782.4145.256.camel@localhost.localdomain>

On Mon, 2004-11-29 at 14:04, Roland Dreier wrote:
>     Michael> Currently it also seems that by just bringing the
>     Michael> interface up and down the hw address will change.  Unless
>     Michael> I am doing something wrong, this inconvenience seems to
>     Michael> be caused by a different QP number being assigned.  Cant
>     Michael> this be solved e.g. by assigning a fixed QP number for IP
>     Michael> over IB?
> 
> It seems you are doing something wrong.  The QP is allocated when the
> interface is created and remains the same when the interface is
> brought up and down:

The only time I see that is when removing and readding the IPoIB module.
(That does not mean I am recommending a fixed QP for IPoIB (at least
unconnected mode).

-- Hal


From roland at topspin.com  Mon Nov 29 11:24:34 2004
From: roland at topspin.com (Roland Dreier)
Date: Mon, 29 Nov 2004 11:24:34 -0800
Subject: [openib-general] struct class and device.c
In-Reply-To: <20041129191751.GA3450@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 29 Nov 2004 21:17:51 +0200")
References: <20041129134344.GA2991@mellanox.co.il>
	<52vfbovcf6.fsf@topspin.com> <20041129191751.GA3450@mellanox.co.il>
Message-ID: <52k6s4sbkd.fsf@topspin.com>

    Michael> No, I had in mind using the children list in the class
    Michael> instead of the device_list in device.c and interfaces
    Michael> list instead of the client_list. Why not? Wouldnt that
    Michael> work?

I guess it would work but it seems ugly to rely on the internals of
struct class.  Although drivers/ieee1394 does do exactly this (without
any locking)....

 - R.


From halr at voltaire.com  Mon Nov 29 11:23:26 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 14:23:26 -0500
Subject: [openib-general] MAD registration for newer vendor classes
In-Reply-To: <41AB7642.2050803@ichips.intel.com>
References: <194470-220041142517294790@M2W103.mail2web.com>
	<41AB6FB3.4040301@ichips.intel.com>
	<1101754775.4145.239.camel@localhost.localdomain>
	<41AB7642.2050803@ichips.intel.com>
Message-ID: <1101756206.4145.265.camel@localhost.localdomain>

On Mon, 2004-11-29 at 14:19, Sean Hefty wrote:
> > There are a new range of vendor classes (at IBA 1.1) which embed the OUI
> > so that multiple vendors can share the same class. There needs to be
> > another level of demux for these.
> 
> You know, I'd really like to rant about the entire MAD architecture 
> right about now...

I was there for this one so you can vent at me :-)

> I think it makes sense to add OUI to the MAD interface.

I will work something up on this and post to the list when it is
"ready".

Also, based on this, do you think it makes sense for an OpenIB OUI (if
we are to utilize these classes) ?

-- Hal


From mshefty at ichips.intel.com  Mon Nov 29 11:30:20 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 29 Nov 2004 11:30:20 -0800
Subject: [openib-general] Re: [PATCH] cleanup/fixes for handle_outgoing_smp
In-Reply-To: <1101656644.4145.15.camel@localhost.localdomain>
References: <41A52E8A.3000802@ichips.intel.com>
	<1101656644.4145.15.camel@localhost.localdomain>
Message-ID: <41AB78CC.2000006@ichips.intel.com>

Hal Rosenstock wrote:
>>This patch restructures handle_outgoing_smp to improve its readability
> 
> I can't see for sure for your patch.
The main changes are that the code is outdented and moved from nested 
if's to a switch statement.


>>and fixes the following issues: removes unneeded memory allocation for
>>received SMP, 
> 
> It looks like the allocation strategy is slightly modified.

It was.  The allocation is not done unless process_mad will be called.

>>properly sends a SMP if the underlying HCA driver does not
>>provide a process_mad routine, 
> 
> Missed setting the return code here.

I believe that the original code would call the agent's send_handler if 
process_mad was not provided.

> 
>>and deallocates the allocated received
>>SMP in all failures cases.
> 
> 
> What failure case did not deallocate the allocated received SMP ?

Don't recall exactly.  Might have been if process_mad consumed the MAD, 
which I guess isn't a failure case.  I can regenerate a new patch.


From mshefty at ichips.intel.com  Mon Nov 29 11:40:54 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 29 Nov 2004 11:40:54 -0800
Subject: [openib-general] [PATCH] [re-send] cleanup/fixes for
	handle_outgoing_smp
Message-ID: <20041129114054.2afd03b4.mshefty@ichips.intel.com>

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 1291)
+++ core/mad.c	(working copy)
@@ -366,108 +366,93 @@
 			       struct ib_send_wr *send_wr)
 {
 	int ret;
+	struct ib_mad_private *mad_priv;
+	struct ib_mad_send_wc mad_send_wc;
 
 	if (!smi_handle_dr_smp_send(smp,
 				    mad_agent->device->node_type,
 				    mad_agent->port_num)) {
 		ret = -EINVAL;
 		printk(KERN_ERR PFX "Invalid directed route\n");
-		goto error1;
+		goto out;
 	}
-	if (smi_check_local_dr_smp(smp,
-				   mad_agent->device,
-				   mad_agent->port_num)) {
-		struct ib_mad_private *mad_priv;
-		struct ib_mad_agent_private *mad_agent_priv;
-		struct ib_mad_send_wc mad_send_wc;
-
-		mad_priv = kmem_cache_alloc(ib_mad_cache,
-					    (in_atomic() || irqs_disabled()) ?
-					    GFP_ATOMIC : GFP_KERNEL);
-		if (!mad_priv) {
-			ret = -ENOMEM;
-			printk(KERN_ERR PFX "No memory for local "
-			       "response MAD\n");
-			goto error1;
-		}
+	/* Check to post send on QP or process locally. */
+	ret = smi_check_local_dr_smp(smp, mad_agent->device,
+				     mad_agent->port_num);
+	if (!ret || !mad_agent->device->process_mad)
+		goto out;
 
-		mad_agent_priv = container_of(mad_agent,
-					      struct ib_mad_agent_private,
-					      agent);
-
-		if (mad_agent->device->process_mad) {
-			ret = mad_agent->device->process_mad(
-					    mad_agent->device,
-					    0,
-					    mad_agent->port_num,
-					    smp->dr_slid, /* ? */
+	mad_priv = kmem_cache_alloc(ib_mad_cache,
+				    (in_atomic() || irqs_disabled()) ?
+				    GFP_ATOMIC : GFP_KERNEL);
+	if (!mad_priv) {
+		ret = -ENOMEM;
+		printk(KERN_ERR PFX "No memory for local response MAD\n");
+		goto out;
+	}
+	ret = mad_agent->device->process_mad(mad_agent->device, 0,
+					     mad_agent->port_num, smp->dr_slid,
 					    (struct ib_mad *)smp,
 					    (struct ib_mad *)&mad_priv->mad);
-			if (ret & IB_MAD_RESULT_SUCCESS) {
-				if (ret & IB_MAD_RESULT_CONSUMED) {
-					ret = 1;
-					goto error1;
-				}
-				if (ret & IB_MAD_RESULT_REPLY) {
-					/*
-					 * See if response is solicited and
-					 * there is a recv handler
-					 */
-					if (solicited_mad(&mad_priv->mad.mad) && 
-					    mad_agent_priv->agent.recv_handler) {
-						struct ib_wc wc;
-
-						/*
-						 * Defined behavior is to
-						 * complete response before
-						 * request
-						 */
-						wc.wr_id = send_wr->wr_id;
-						wc.status = IB_WC_SUCCESS;
-						wc.opcode = IB_WC_RECV;
-						wc.vendor_err = 0;
-						wc.byte_len = sizeof(struct ib_mad);
-						wc.src_qp = 0;  /* IB_QPT_SMI ? */
-						wc.wc_flags = 0;
-						wc.pkey_index = 0;
-						wc.slid = IB_LID_PERMISSIVE;
-						wc.sl = 0;
-						wc.dlid_path_bits = 0;
-						mad_priv->header.recv_wc.wc = &wc;
-						mad_priv->header.recv_wc.mad_len =
-							sizeof(struct ib_mad);
-						INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
-						mad_priv->header.recv_buf.grh = NULL;
-						mad_priv->header.recv_buf.mad =
-							&mad_priv->mad.mad;
-						mad_priv->header.recv_wc.recv_buf =
-							&mad_priv->header.recv_buf;
-						mad_agent_priv->agent.recv_handler(
-							mad_agent,
-							&mad_priv->header.recv_wc);
-					} else
-						kmem_cache_free(ib_mad_cache, mad_priv);
-				} else
-					kmem_cache_free(ib_mad_cache, mad_priv);
-			} else
-				kmem_cache_free(ib_mad_cache, mad_priv);
-		}
-
-		if (mad_agent_priv->agent.send_handler) {
-			/* Now, complete send */
-			mad_send_wc.status = IB_WC_SUCCESS;
-			mad_send_wc.vendor_err = 0;
-			mad_send_wc.wr_id = send_wr->wr_id;
-			mad_agent_priv->agent.send_handler(
-						mad_agent,
-						&mad_send_wc);
-			ret = 1;
+	switch (ret)
+	{
+	case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY:
+		/*
+		 * See if response is solicited and
+		 * there is a recv handler
+		 */
+		if (solicited_mad(&mad_priv->mad.mad) && 
+		    mad_agent->recv_handler) {
+			struct ib_wc wc;
+
+			/*
+                         * Defined behavior is to complete response before
+			 * request
+			 */
+			wc.wr_id = send_wr->wr_id;
+			wc.status = IB_WC_SUCCESS;
+			wc.opcode = IB_WC_RECV;
+			wc.vendor_err = 0;
+			wc.byte_len = sizeof(struct ib_mad);
+			wc.src_qp = IB_QP0;
+			wc.wc_flags = 0;
+			wc.pkey_index = 0;
+			wc.slid = IB_LID_PERMISSIVE;
+			wc.sl = 0;
+			wc.dlid_path_bits = 0;
+			mad_priv->header.recv_wc.wc = &wc;
+			mad_priv->header.recv_wc.mad_len =
+				sizeof(struct ib_mad);
+			INIT_LIST_HEAD(&mad_priv->header.recv_buf.list);
+			mad_priv->header.recv_buf.grh = NULL;
+			mad_priv->header.recv_buf.mad = &mad_priv->mad.mad;
+			mad_priv->header.recv_wc.recv_buf =
+				&mad_priv->header.recv_buf;
+			mad_agent->recv_handler(mad_agent,
+						&mad_priv->header.recv_wc);
 		} else
-			ret = -EINVAL;
-	} else 
+			kmem_cache_free(ib_mad_cache, mad_priv);
+		break;
+	case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
+		kmem_cache_free(ib_mad_cache, mad_priv);
+		break;
+	case IB_MAD_RESULT_SUCCESS:
 		ret = 0;
+		kmem_cache_free(ib_mad_cache, mad_priv);
+		goto out;
+	default:
+		kmem_cache_free(ib_mad_cache, mad_priv);
+		ret = -EINVAL;
+		goto out;
+	}
 
-error1:
+	/* Complete send */
+	mad_send_wc.status = IB_WC_SUCCESS;
+	mad_send_wc.vendor_err = 0;
+	mad_send_wc.wr_id = send_wr->wr_id;
+	mad_agent->send_handler(mad_agent, &mad_send_wc);
+	ret = 1;
+out:
 	return ret;
 }
 

From mshefty at ichips.intel.com  Mon Nov 29 11:48:57 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 29 Nov 2004 11:48:57 -0800
Subject: [openib-general] MAD registration for newer vendor classes
In-Reply-To: <1101756206.4145.265.camel@localhost.localdomain>
References: <194470-220041142517294790@M2W103.mail2web.com>
	<41AB6FB3.4040301@ichips.intel.com>
	<1101754775.4145.239.camel@localhost.localdomain>
	<41AB7642.2050803@ichips.intel.com>
	<1101756206.4145.265.camel@localhost.localdomain>
Message-ID: <41AB7D29.409@ichips.intel.com>

Hal Rosenstock wrote:
> Also, based on this, do you think it makes sense for an OpenIB OUI (if
> we are to utilize these classes) ?

I think that it makes sense, but I'd wait until we actually have code 
that utilizes it.

- Sean


From mst at mellanox.co.il  Mon Nov 29 11:55:15 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 29 Nov 2004 21:55:15 +0200
Subject: [openib-general] struct class and device.c
In-Reply-To: <52k6s4sbkd.fsf@topspin.com>
References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com>
	<20041129191751.GA3450@mellanox.co.il> <52k6s4sbkd.fsf@topspin.com>
Message-ID: <20041129195515.GA3514@mellanox.co.il>

Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] struct class and device.c":
>     Michael> No, I had in mind using the children list in the class
>     Michael> instead of the device_list in device.c and interfaces
>     Michael> list instead of the client_list. Why not? Wouldnt that
>     Michael> work?
> 
> I guess it would work but it seems ugly to rely on the internals of
> struct class.  Although drivers/ieee1394 does do exactly this (without
> any locking)....

Yes.

Further, thats just for the alloc_name hacks. I am not sure why
that is useful - this creates arbitrary names like mthca0
which are not really useful to identify the devices.
Couldnt just say pci bus names be used?


From mst at mellanox.co.il  Mon Nov 29 12:02:07 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 29 Nov 2004 22:02:07 +0200
Subject: [openib-general] struct class and device.c
In-Reply-To: <20041129195515.GA3514@mellanox.co.il>
References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com>
	<20041129191751.GA3450@mellanox.co.il> <52k6s4sbkd.fsf@topspin.com>
	<20041129195515.GA3514@mellanox.co.il>
Message-ID: <20041129200207.GB3514@mellanox.co.il>

Hello!
Sorry about replying to myself ...

Quoting Michael S. Tsirkin (mst at mellanox.co.il) "Re: [openib-general] struct class and device.c":
> Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] struct class and device.c":
> >     Michael> No, I had in mind using the children list in the class
> >     Michael> instead of the device_list in device.c and interfaces
> >     Michael> list instead of the client_list. Why not? Wouldnt that
> >     Michael> work?
> > 
> > I guess it would work but it seems ugly to rely on the internals of
> > struct class.  Although drivers/ieee1394 does do exactly this (without
> > any locking)....
> 
> Yes.

I mean,  duplicating the code from drivers/base/ is ugly too.

> Further, thats just for the alloc_name hacks. I am not sure why
> that is useful - this creates arbitrary names like mthca0
> which are not really useful to identify the devices.
> Couldnt just say pci bus names be used?

Or use the system guid. As it is you can pull a device out
and another device will be renamed.

mst


From halr at voltaire.com  Mon Nov 29 12:06:31 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 15:06:31 -0500
Subject: [openib-general] MAD registration for newer vendor classes
In-Reply-To: <41AB7D29.409@ichips.intel.com>
References: <194470-220041142517294790@M2W103.mail2web.com>
	<41AB6FB3.4040301@ichips.intel.com>
	<1101754775.4145.239.camel@localhost.localdomain>
	<41AB7642.2050803@ichips.intel.com>
	<1101756206.4145.265.camel@localhost.localdomain>
	<41AB7D29.409@ichips.intel.com>
Message-ID: <1101758791.4145.268.camel@localhost.localdomain>

On Mon, 2004-11-29 at 14:48, Sean Hefty wrote:
> I think that it makes sense, but I'd wait until we actually have code 
> that utilizes it.

An initial proposal for diagnostics will be posted in the next day or
so. In it, there is an ibping utility. It is currently defined as having
two ways of running it: with vendor MADs and with normal UD transport.
That is the first use (if it meets with consensus).

-- Hal


From halr at voltaire.com  Mon Nov 29 12:45:08 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Mon, 29 Nov 2004 15:45:08 -0500
Subject: [openib-general] Re: [PATCH] [re-send] cleanup/fixes for
	handle_outgoing_smp
In-Reply-To: <20041129114054.2afd03b4.mshefty@ichips.intel.com>
References: <20041129114054.2afd03b4.mshefty@ichips.intel.com>
Message-ID: <1101761108.4145.274.camel@localhost.localdomain>

On Mon, 2004-11-29 at 14:40, Sean Hefty wrote:

> -		if (mad_agent_priv->agent.send_handler) {
> -			/* Now, complete send */
> -			mad_send_wc.status = IB_WC_SUCCESS;
> -			mad_send_wc.vendor_err = 0;
> -			mad_send_wc.wr_id = send_wr->wr_id;
> -			mad_agent_priv->agent.send_handler(
> -						mad_agent,
> -						&mad_send_wc);

> +	/* Complete send */
> +	mad_send_wc.status = IB_WC_SUCCESS;
> +	mad_send_wc.vendor_err = 0;
> +	mad_send_wc.wr_id = send_wr->wr_id;
> +	mad_agent->send_handler(mad_agent, &mad_send_wc);
> +	ret = 1;

Currently, it isn't safe to eliminate the check for the send_handler.
(The registration code does not guarantee that a send_handler was
supplied; it only does in the case where no registration request is
supplied with the registration). 

Should a send handler always be required or should this check be added
back in ?

-- Hal


From mshefty at ichips.intel.com  Mon Nov 29 13:24:31 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 29 Nov 2004 13:24:31 -0800
Subject: [openib-general] Re: [PATCH] [re-send] cleanup/fixes for
	handle_outgoing_smp
In-Reply-To: <1101761108.4145.274.camel@localhost.localdomain>
References: <20041129114054.2afd03b4.mshefty@ichips.intel.com>
	<1101761108.4145.274.camel@localhost.localdomain>
Message-ID: <41AB938F.6020708@ichips.intel.com>

Hal Rosenstock wrote:

> On Mon, 2004-11-29 at 14:40, Sean Hefty wrote:
> 
> 
>>-		if (mad_agent_priv->agent.send_handler) {
>>-			/* Now, complete send */
>>-			mad_send_wc.status = IB_WC_SUCCESS;
>>-			mad_send_wc.vendor_err = 0;
>>-			mad_send_wc.wr_id = send_wr->wr_id;
>>-			mad_agent_priv->agent.send_handler(
>>-						mad_agent,
>>-						&mad_send_wc);
> 
> 
>>+	/* Complete send */
>>+	mad_send_wc.status = IB_WC_SUCCESS;
>>+	mad_send_wc.vendor_err = 0;
>>+	mad_send_wc.wr_id = send_wr->wr_id;
>>+	mad_agent->send_handler(mad_agent, &mad_send_wc);
>>+	ret = 1;
> 
> 
> Currently, it isn't safe to eliminate the check for the send_handler.
> (The registration code does not guarantee that a send_handler was
> supplied; it only does in the case where no registration request is
> supplied with the registration). 
> 
> Should a send handler always be required or should this check be added
> back in ?

The send_handler is checked in ib_post_send_mad.


From gdror at mellanox.co.il  Tue Nov 30 00:59:24 2004
From: gdror at mellanox.co.il (Dror Goldenberg)
Date: Tue, 30 Nov 2004 10:59:24 +0200
Subject: [openib-general] Unicast ARP
Message-ID: <506C3D7B14CDD411A52C00025558DED606933619@mtlex01.yok.mtl.com>


> -----Original Message-----
> From: Roland Dreier [mailto:roland at topspin.com] 
> Sent: Monday, November 29, 2004 9:04 PM
> 
> 
>     Michael> Currently it also seems that by just bringing the
>     Michael> interface up and down the hw address will change.  Unless
>     Michael> I am doing something wrong, this inconvenience seems to
>     Michael> be caused by a different QP number being assigned.  Cant
>     Michael> this be solved e.g. by assigning a fixed QP number for IP
>     Michael> over IB?
> 
> It seems you are doing something wrong.  The QP is allocated 
> when the interface is created and remains the same when the 
> interface is brought up and down:
> 
>     # ip addr show dev ib0
>     7: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc 
> pfifo_fast qlen 128
>         link/[32] 
> 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 
> brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>     # ifconfig ib0 down
>     # ifconfig ib0 up
>     # ip addr show dev ib0
>     7: ib0: <BROADCAST,MULTICAST,UP> mtu 2044 qdisc 
> pfifo_fast qlen 128
>         link/[32] 
> 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 
> brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> 
Roland, 
You're right, it takes more than "ifconfig down" for the  QPN to change.
If you take the module down after doing ifconfig down, then the QPN
may change.
Assigning specific QPN for ipoib requires allocation of QPN space which
is beyond IB spec verbs. Current verbs do not allow it. I don't have any
objection for that, except that you have to hold a set of preallocated QPs
with specific numbers and hand them over to privileged consumer when
requested to.  I wouldn't commit that it will work on any HCA architecture.

-Dror
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20041130/49773a75/attachment.html>

From mst at mellanox.co.il  Tue Nov 30 01:05:05 2004
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 30 Nov 2004 11:05:05 +0200
Subject: [openib-general] Unicast ARP
In-Reply-To: <1101755782.4145.256.camel@localhost.localdomain>
References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com>
	<1101749619.4145.214.camel@localhost.localdomain>
	<52653ottwn.fsf@topspin.com> <20041129185739.GA3394@mellanox.co.il>
	<52oehgsci9.fsf@topspin.com>
	<1101755782.4145.256.camel@localhost.localdomain>
Message-ID: <20041130090505.GA11212@mellanox.co.il>

Hello!
Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] Unicast ARP":
> On Mon, 2004-11-29 at 14:04, Roland Dreier wrote:
> >     Michael> Currently it also seems that by just bringing the
> >     Michael> interface up and down the hw address will change.  Unless
> >     Michael> I am doing something wrong, this inconvenience seems to
> >     Michael> be caused by a different QP number being assigned.  Cant
> >     Michael> this be solved e.g. by assigning a fixed QP number for IP
> >     Michael> over IB?
> > 
> > It seems you are doing something wrong.  The QP is allocated when the
> > interface is created and remains the same when the interface is
> > brought up and down:
> 
> The only time I see that is when removing and readding the IPoIB module.

That was it. My script was unloading the module. Thanks.
MST


From halr at voltaire.com  Tue Nov 30 08:34:11 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 30 Nov 2004 11:34:11 -0500
Subject: [openib-general] [RFC] Proposed OpenIB Diagnostic Tools
Message-ID: <1101832451.6411.89.camel@localhost.localdomain>

Hi,

Attached is an initial proposal for diagnostic tools. They fall into two
categories: host and network. Applications (or scripts) and library
support would be supplied.

This is an initial writeup on the high level descriptions of the tools
and an initial syntax. All comments welcome.

BTW, are there coding guidelines for user space ?

Note that the implementation of these tools will take a back seat to
getting the OpenSM up and running with gen2.

At some point soon, I will put a copy of this in the gen2 tree perhaps
at /gen2/trunk/src/userspace/diags

-- Hal


-------------- next part --------------
Diagnostic Tools
11/29/04

user space applications (also library support)
two categories: host and network

Host Oriented Diagnostic Tools

1. ibstatus

Description:
ibstatus displays basic information obtained from the local IB driver.
-v enables verbose mode. Normal output includes LID, SMLID, port state,
link width active, and port physical state. Verbose includes all sysfs
supported parameters for that interface and port.

Syntax:
ibstatus [-v] [-I mthca0] [-p port]

Dependencies:
sysfs support in mthca

2. ibroute

Description:
ibroute uses SMPs to display the forwarding tables (unicast 
(LinearForwardingTable or LFT) or multicast (MulticastForwardingTable or MFT))
for the specified LID.

Syntax:
ibroute [-multi] [-m mkey] [-pa path] [-I mthca0] [-p port] LID

Dependencies:
user MAD access, SMA

3. ibtracert

Description:
ibtracert uses SMPs to trace the path from a source GID/LID to a 
destination GID/LID. The source GID/LID must be local to the node. 
Each hop along the path is displayed until the destination is reached or 
a hop does not respond. By using -mg and/or -ml options, multicast path
tracing can be performed between source and destination nodes.

Syntax:
ibtracert [-m mkey] [-pa path] [-sg SGID] [-sl SLID] [-dg DGID] [-dl DLID] \
		[-mg MGID] [-ml MLID] [-I mthca0] [-p port]

Dependencies:
user MAD access, SMA

4. smpquery

Description:
smpquery allows a basic subset of standard SMP queries including the following:
local information (LID, GID, etc.), node information (from NodeDescription,
NodeInfo, and possibly SwitchInfo if node is a switch), port information
(port address and state), and port parameters (SLtoVLMappingTable, 
VLArbitrationTable, HOQLife, ...).

Syntax:
smpquery [-m mkey] [-l LID] [-pa path] [-I mthca0] [-p port] \
	 	[-l] [-n] [-pi] [-pp]

Dependencies:
User MAD access

5. smpdump

Description:
smpdump is a general purpose SMP utility which gets SM attributes from a 
specified SMA. The result is dumped as hex (-x) or string (-s), with hex 
as the default.

Syntax:
smpdump [-m mkey] [-l LID] [-p path]  [-I mthca0] [-p port] \
		[-a attributeID] [-am attributeModifier] [-s] [-x]

Dependencies:
User MAD access

6. perfquery

Description:
perfquery uses PerfMgt GMPs to obtain the PortCounters (basic performance 
and error counters) from the PMA at the node specified. -r resets these 
counters after obtaining them.

Syntax:
perfquery [-I mthca0] [-p port] [-r] [-g GID] LID

Dependencies:
User MAD access, PMA

7. ibping

Description:
ibping uses UD transport to validate connectivity between IB nodes. 
It is run as client/server (daemon). -v option uses vendor MADs rather than
normal UD transport.

Syntax:
ibping [-d] [-v] [-c count] [-i interval] [-s packetsize] \
		[-I mthca0] [-p port] [-q qkey] [-g DGID] [-qp dqp] [-dl DLID] 

-d: run as daemon (server)

Dependencies:
user MAD access 


Network Oriented Diagnostics

8. ibnetdiscover

Description:
ibnetdiscover performs IB subnet discovery and outputs a human readable 
topology file. GUIDs, node types, and port numbers are displayed 
as well as port LIDs and NodeDescriptions. All nodes (and links) are displayed
(full topology).

Syntax:
ibnetdiscover [-I mthca0] [-p port] [-o topology-filename]

Dependencies:
user MAD access

Future versions of this file will be annotated with additional information
including system guid, system type, internal to physical mapping, and physical
location information (blade or ASIC number, etc.). 

9. ibhosts

Description:
ibhosts either walks the IB subnet topology or uses an already saved topology 
file and extracts the HCA nodes.

Syntax:
ibhosts [-I mthca0] [-p port] [-i topology-filename] [-o ibhosts-filename]

Dependencies:
user MAD access, ibnetdiscover

10. ibswitches

Description:
ibswitches either walks the IB subnet topology or uses an already saved 
topology file and extracts the IB switches. 

Syntax:
ibswitches [-I mthca0] [-p port] \
		[-i topology-filename] [-o ibswitches-filename]

Dependencies:
user MAD access, ibnetdiscover

11. ibnetverify

Description:
ibnetverify uses a full topology file that was created by ibnetdiscover, 
scans the network to see if the current topology matches or not displaying 
any discrepancies, and validates the connectivity and reports errors 
(from port counters).

Syntax:
ibnetverify -f filename [-I mthca0] [-p port]

Dependencies:
user MAD access, ibnetdiscover

From philippe.gregoire at cea.fr  Tue Nov 30 09:18:19 2004
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Tue, 30 Nov 2004 18:18:19 +0100
Subject: [openib-general] testing OpenIB on TopSpin hardware ?
Message-ID: <200411301718.SAA08841@styx.bruyeres.cea.fr>

Hello,
I would like to test the latest OpenIB software on our test platform, 
specially the SDP part.
We have a HP-DL380/DL360 (IA32) cluster with 12 nodes connected through 
TopSpin HCA and TopSpin 90 IB switch.
I got the OpenIB source with svn. What is the latest version , gen1 or 1.0 ?
1.0 looks like the version available in march, correct ?
What is the firmware requirement for the HCA and the switch ?

Thanks for your help

Philippe Gregoire
CEA/DAM


From halr at voltaire.com  Tue Nov 30 09:55:31 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 30 Nov 2004 12:55:31 -0500
Subject: [openib-general] [RFC] DIagnostic Tools Proposal
Message-ID: <1101837331.6411.266.camel@localhost.localdomain>

is now located in the tree as:

https://openib.org/svn/gen2/trunk/src/userspace/diags/diagtools-proposal.txt


From halr at voltaire.com  Tue Nov 30 10:15:48 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 30 Nov 2004 13:15:48 -0500
Subject: [openib-general] smpdump and current MAD layer
Message-ID: <1101838548.6411.276.camel@localhost.localdomain>

Hi,

I believe there is an issue with smpdump (or gmpdump) and just want to
make sure I am not forgetting something as I am want to do :-)  

Each received MAD can only have 1 client which "owns" it. That client is
either determined via solicited routing or version/class/method (and
soon OUI) routing.

So solicited MAD responses cannot currently be snooped nor can
unsolicited ones for which an agent is registered (Since SMA and PMA are
currently firmware based, the latter is not an issue for the current
implementation).

Is the above correct ? If so, do you see a "clean" way around this ?

Thanks.

-- Hal


From mshefty at ichips.intel.com  Tue Nov 30 10:40:30 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 30 Nov 2004 10:40:30 -0800
Subject: [openib-general] [PATCH] added documentation for exported functions
Message-ID: <20041130104030.274312d0.mshefty@ichips.intel.com>

Patch adds documentation for exported functions that did not have it in ib_verbs.h
and device.c.  Fixes slight formatting issue in ib_mad.h documentation.

Patch will be committed shortly after sending this.

- Sean


Index: include/ib_verbs.h
===================================================================
--- include/ib_verbs.h	(revision 1302)
+++ include/ib_verbs.h	(working copy)
@@ -849,28 +849,107 @@
 		   u8 port_num, int port_modify_mask,
 		   struct ib_port_modify *port_modify);
 
+/**
+ * ib_alloc_pd - Allocates an unused protection domain.
+ * @device: The device on which to allocate the protection domain.
+ *
+ * A protection domain object provides an association between QPs, shared
+ * receive queues, address handles, memory regions, and memory windows.
+ */
 struct ib_pd *ib_alloc_pd(struct ib_device *device);
+
+/**
+ * ib_dealloc_pd - Deallocates a protection domain.
+ * @pd: The protection domain to deallocate.
+ */
 int ib_dealloc_pd(struct ib_pd *pd);
 
+/**
+ * ib_create_ah - Creates an address handle for the given address vector.
+ * @pd: The protection domain associated with the address handle.
+ * @ah_attr: The attributes of the address vector.
+ *
+ * The address handle is used to reference a local or global destination
+ * in all UD QP post sends.
+ */
 struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_modify_ah - Modifies the address vector associated with an address
+ *   handle.
+ * @ah: The address handle to modify.
+ * @ah_attr: The new address vector attributes to associate with the
+ *   address handle.
+ */
 int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_query_ah - Queries the address vector associated with an address
+ *   handle.
+ * @ah: The address handle to query.
+ * @ah_attr: The address vector attributes associated with the address
+ *   handle.
+ */
 int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_destroy_ah - Destroys an address handle.
+ * @ah: The address handle to destroy.
+ */
 int ib_destroy_ah(struct ib_ah *ah);
 
+/**
+ * ib_create_qp - Creates a QP associated with the specified protection
+ *   domain.
+ * @pd: The protection domain associated with the QP.
+ * @qp_init_attr: A list of initial attributes required to create the QP.
+ */
 struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			   struct ib_qp_init_attr *qp_init_attr);
 
+/**
+ * ib_modify_qp - Modifies the attributes for the specified QP and then
+ *   transitions the QP to the given state.
+ * @qp: The QP to modify.
+ * @qp_attr: On input, specifies the QP attributes to modify.  On output,
+ *   the current values of selected QP attributes are returned.
+ * @qp_attr_mask: A bit-mask used to specify which attributes of the QP
+ *   are being modified.
+ */
 int ib_modify_qp(struct ib_qp *qp,
 		 struct ib_qp_attr *qp_attr,
 		 int qp_attr_mask);
 
+/**
+ * ib_query_qp - Returns the attribute list and current values for the
+ *   specified QP.
+ * @qp: The QP to query.
+ * @qp_attr: The attributes of the specified QP.
+ * @qp_attr_mask: A bit-mask used to select specific attributes to query.
+ * @qp_init_attr: Additional attributes of the selected QP.
+ *
+ * The qp_attr_mask may be used to limit the query to gathering only the
+ * selected attributes.
+ */
 int ib_query_qp(struct ib_qp *qp,
 		struct ib_qp_attr *qp_attr,
 		int qp_attr_mask,
 		struct ib_qp_init_attr *qp_init_attr);
 
+/**
+ * ib_destroy_qp - Destroys the specified QP.
+ * @qp: The QP to destroy.
+ */
 int ib_destroy_qp(struct ib_qp *qp);
 
+/**
+ * ib_post_send - Posts a list of work requests to the send queue of
+ *   the specified QP.
+ * @qp: The QP to post the work request on.
+ * @send_wr: A list of work requests to post on the send queue.
+ * @bad_send_wr: On an immediate failure, this parameter will reference
+ *   the work request that failed to be posted on the QP.
+ */
 static inline int ib_post_send(struct ib_qp *qp,
 			       struct ib_send_wr *send_wr,
 			       struct ib_send_wr **bad_send_wr)
@@ -878,6 +957,14 @@
 	return qp->device->post_send(qp, send_wr, bad_send_wr);
 }
 
+/**
+ * ib_post_recv - Posts a list of work requests to the receive queue of
+ *   the specified QP.
+ * @qp: The QP to post the work request on.
+ * @recv_wr: A list of work requests to post on the receive queue.
+ * @bad_recv_wr: On an immediate failure, this parameter will reference
+ *   the work request that failed to be posted on the QP.
+ */
 static inline int ib_post_recv(struct ib_qp *qp,
 			       struct ib_recv_wr *recv_wr,
 			       struct ib_recv_wr **bad_recv_wr)
@@ -885,12 +972,37 @@
 	return qp->device->post_recv(qp, recv_wr, bad_recv_wr);
 }
 
+/**
+ * ib_create_cq - Creates a CQ on the specified device.
+ * @device: The device on which to create the CQ.
+ * @comp_handler: A user-specified callback that is invoked when a
+ *   completion event occurs on the CQ.
+ * @event_handler: A user-specified callback that is invoked when an
+ *   asynchronous event not associated with a completion occurs on the CQ.
+ * @cq_context: Context associated with the CQ returned to the user via
+ *   the associated completion and event handlers.
+ * @cqe: The minimum size of the CQ.
+ *
+ * Users can examine the cq structure to determine the actual CQ size.
+ */
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
 			   void *cq_context, int cqe);
 
+/**
+ * ib_resize_cq - Modifies the capacity of the CQ.
+ * @cq: The CQ to resize.
+ * @cqe: The minimum size of the CQ.
+ *
+ * Users can examine the cq structure to determine the actual CQ size.
+ */
 int ib_resize_cq(struct ib_cq *cq, int cqe);
+
+/**
+ * ib_destroy_cq - Destroys the specified CQ.
+ * @cq: The CQ to destroy.
+ */
 int ib_destroy_cq(struct ib_cq *cq);
 
 /**
@@ -911,13 +1023,24 @@
 	return cq->device->poll_cq(cq, num_entries, wc);
 }
 
+/**
+ * ib_peek_cq - Returns the number of unreaped completions currently
+ *   on the specified CQ.
+ * @cq: The CQ to peek.
+ * @wc_cnt: A minimum number of unreaped completions to check for.
+ *
+ * If the number of unreaped completions is greater than or equal to wc_cnt,
+ * this function returns wc_cnt, otherwise, it returns the actual number of
+ * unreaped completions.
+ */
 int ib_peek_cq(struct ib_cq *cq, int wc_cnt);
 
 /**
- * ib_req_notify_cq - request completion notification
- * @cq:the CQ to generate an event for
- * @cq_notify:%IB_CQ_SOLICITED for next solicited event,
- * %IB_CQ_NEXT_COMP for any completion.
+ * ib_req_notify_cq - Request completion notification on a CQ.
+ * @cq: The CQ to generate an event for.
+ * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will
+ *   occur on the next solicited event. If set to %IB_CQ_NEXT_COMP,
+ *   notification will occur on the next completion.
  */
 static inline int ib_req_notify_cq(struct ib_cq *cq,
 				   enum ib_cq_notify cq_notify)
@@ -925,6 +1048,13 @@
 	return cq->device->req_notify_cq(cq, cq_notify);
 }
 
+/**
+ * ib_req_ncomp_notif - Request completion notification when there are
+ *   at least the specified number of unreaped completions on the CQ.
+ * @cq: The CQ to generate an event for.
+ * @wc_cnt: The number of unreaped completions that should be on the
+ *   CQ before an event is generated.
+ */
 static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
 {
 	return cq->device->req_ncomp_notif ?
@@ -932,14 +1062,52 @@
 		-ENOSYS;
 }
 
+/**
+ * ib_get_dma_mr - Returns a memory region for system memory that is
+ *   usable for DMA.
+ * @pd: The protection domain associated with the memory region.
+ * @mr_access_flags: Specifies the memory access rights.
+ */
 struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
 
+/**
+ * ib_reg_phys_mr - Prepares a virtually addressed memory region for use
+ *   by an HCA.
+ * @pd: The protection domain associated assigned to the registered region.
+ * @phys_buf_array: Specifies a list of physical buffers to use in the
+ *   memory region.
+ * @num_phys_buf: Specifies the size of the phys_buf_array.
+ * @mr_access_flags: Specifies the memory access rights.
+ * @iova_start: The offset of the region's starting I/O virtual address.
+ */
 struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
 			     struct ib_phys_buf *phys_buf_array,
 			     int num_phys_buf,
 			     int mr_access_flags,
 			     u64 *iova_start);
 
+/**
+ * ib_rereg_phys_mr - Modifies the attributes of an existing memory region.
+ *   Conceptually, this call performs the functions deregister memory region
+ *   followed by register physical memory region.  Where possible,
+ *   resources are reused instead of deallocated and reallocated.
+ * @mr: The memory region to modify.
+ * @mr_rereg_mask: A bit-mask used to indicate which of the following
+ *   properties of the memory region are being modified.
+ * @pd: If %IB_MR_REREG_PD is set in mr_rereg_mask, this field specifies
+ *   the new protection domain to associated with the memory region,
+ *   otherwise, this parameter is ignored.
+ * @phys_buf_array: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this
+ *   field specifies a list of physical buffers to use in the new
+ *   translation, otherwise, this parameter is ignored.
+ * @num_phys_buf: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this
+ *   field specifies the size of the phys_buf_array, otherwise, this
+ *   parameter is ignored.
+ * @mr_access_flags: If %IB_MR_REREG_ACCESS is set in mr_rereg_mask, this
+ *   field specifies the new memory access rights, otherwise, this
+ *   parameter is ignored.
+ * @iova_start: The offset of the region's starting I/O virtual address.
+ */
 int ib_rereg_phys_mr(struct ib_mr *mr,
 		     int mr_rereg_mask,
 		     struct ib_pd *pd,
@@ -948,11 +1116,35 @@
 		     int mr_access_flags,
 		     u64 *iova_start);
 
+/**
+ * ib_query_mr - Retrieves information about a specific memory region.
+ * @mr: The memory region to retrieve information about.
+ * @mr_attr: The attributes of the specified memory region.
+ */
 int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
+
+/**
+ * ib_dereg_mr - Deregisters a memory region and removes it from the
+ *   HCA translation table.
+ * @mr: The memory region to deregister.
+ */
 int ib_dereg_mr(struct ib_mr *mr);
 
+/**
+ * ib_alloc_mw - Allocates a memory window.
+ * @pd: The protection domain associated with the memory window.
+ */
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd);
 
+/**
+ * ib_bind_mw - Posts a work request to the send queue of the specified
+ *   QP, which binds the memory window to the given address range and
+ *   remote access attributes.
+ * @qp: QP to post the bind work request on.
+ * @mw: The memory window to bind.
+ * @mw_bind: Specifies information about the memory window, including
+ *   its address range, remote access rights, and associated memory region.
+ */
 static inline int ib_bind_mw(struct ib_qp *qp,
 			     struct ib_mw *mw,
 			     struct ib_mw_bind *mw_bind)
@@ -963,12 +1155,32 @@
 		-ENOSYS;
 }
 
+/**
+ * ib_dealloc_mw - Deallocates a memory window.
+ * @mw: The memory window to deallocate.
+ */
 int ib_dealloc_mw(struct ib_mw *mw);
 
+/**
+ * ib_alloc_fmr - Allocates a unmapped fast memory region.
+ * @pd: The protection domain associated with the unmapped region.
+ * @mr_access_flags: Specifies the memory access rights.
+ * @fmr_attr: Attributes of the unmapped region.
+ *
+ * A fast memory region must be mapped before it can be used as part of
+ * a work request.
+ */
 struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd,
 			    int mr_access_flags,
 			    struct ib_fmr_attr *fmr_attr);
 
+/**
+ * ib_map_phys_fmr - Maps a list of physical pages to a fast memory region.
+ * @fmr: The fast memory region to associate with the pages.
+ * @page_list: An array of physical pages to map to the fast memory region.
+ * @list_len: The number of pages in page_list.
+ * @iova: The I/O virtual address to use with the mapped region.
+ */
 static inline int ib_map_phys_fmr(struct ib_fmr *fmr,
 				  u64 *page_list, int list_len,
 				  u64 iova)
@@ -976,10 +1188,38 @@
 	return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova);
 }
 
+/**
+ * ib_unmap_fmr - Removes the mapping from a list of fast memory regions.
+ * @fmr_list: A linked list of fast memory regions to unmap.
+ */
 int ib_unmap_fmr(struct list_head *fmr_list);
+
+/**
+ * ib_dealloc_fmr - Deallocates a fast memory region.
+ * @fmr: The fast memory region to deallocate.
+ */
 int ib_dealloc_fmr(struct ib_fmr *fmr);
 
+/**
+ * ib_attach_mcast - Attaches the specified QP to a multicast group.
+ * @qp: QP to attach to the multicast group.  The QP must be type
+ *   IB_QPT_UD.
+ * @gid: Multicast group GID.
+ * @lid: Multicast group LID in host byte order.
+ *
+ * In order to send and receive multicast packets, subnet
+ * administration must have created the multicast group and configured
+ * the fabric appropriately.  The port associated with the specified
+ * QP must also be a member of the multicast group.
+ */
 int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
+
+/**
+ * ib_detach_mcast - Detaches the specified QP from a multicast group.
+ * @qp: QP to detach from the multicast group.
+ * @gid: Multicast group GID.
+ * @lid: Multicast group LID in host byte order.
+ */
 int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
 
 #endif /* IB_VERBS_H */
Index: include/ib_mad.h
===================================================================
--- include/ib_mad.h	(revision 1302)
+++ include/ib_mad.h	(working copy)
@@ -110,16 +110,16 @@
 
 /**
  * ib_mad_send_handler - callback handler for a sent MAD.
- * @mad_agent - MAD agent that sent the MAD.
- * @mad_send_wc - Send work completion information on the sent MAD.
+ * @mad_agent: MAD agent that sent the MAD.
+ * @mad_send_wc: Send work completion information on the sent MAD.
  */
 typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent,
 				    struct ib_mad_send_wc *mad_send_wc);
 
 /**
  * ib_mad_recv_handler - callback handler for a received MAD.
- * @mad_agent - MAD agent requesting the received MAD.
- * @mad_recv_wc - Received work completion information on the received MAD.
+ * @mad_agent: MAD agent requesting the received MAD.
+ * @mad_recv_wc: Received work completion information on the received MAD.
  *
  * MADs received in response to a send request operation will be handed to
  * the user after the send operation completes.  All data buffers given
@@ -130,15 +130,15 @@
 
 /**
  * ib_mad_agent - Used to track MAD registration with the access layer.
- * @device - Reference to device registration is on.
- * @qp - Reference to QP used for sending and receiving MADs.
- * @recv_handler - Callback handler for a received MAD.
- * @send_handler - Callback handler for a sent MAD.
- * @context - User-specified context associated with this registration.
- * @hi_tid - Access layer assigned transaction ID for this client.
+ * @device: Reference to device registration is on.
+ * @qp: Reference to QP used for sending and receiving MADs.
+ * @recv_handler: Callback handler for a received MAD.
+ * @send_handler: Callback handler for a sent MAD.
+ * @context: User-specified context associated with this registration.
+ * @hi_tid: Access layer assigned transaction ID for this client.
  *   Unsolicited MADs sent by this client will have the upper 32-bits
  *   of their TID set to this value.
- * @port_num - Port number on which QP is registered
+ * @port_num: Port number on which QP is registered
  */
 struct ib_mad_agent {
 	struct ib_device	*device;
@@ -152,9 +152,9 @@
 
 /**
  * ib_mad_send_wc - MAD send completion information.
- * @wr_id - Work request identifier associated with the send MAD request.
- * @status - Completion status.
- * @vendor_err - Optional vendor error information returned with a failed
+ * @wr_id: Work request identifier associated with the send MAD request.
+ * @status: Completion status.
+ * @vendor_err: Optional vendor error information returned with a failed
  *   request.
  */
 struct ib_mad_send_wc {
@@ -165,11 +165,11 @@
 
 /**
  * ib_mad_recv_buf - received MAD buffer information.
- * @list - Reference to next data buffer for a received RMPP MAD.
- * @grh - References a data buffer containing the global route header.
+ * @list: Reference to next data buffer for a received RMPP MAD.
+ * @grh: References a data buffer containing the global route header.
  *   The data refereced by this buffer is only valid if the GRH is
  *   valid.
- * @mad - References the start of the received MAD.
+ * @mad: References the start of the received MAD.
  */
 struct ib_mad_recv_buf {
 	struct list_head	list;
@@ -179,9 +179,9 @@
 
 /**
  * ib_mad_recv_wc - received MAD information.
- * @wc - Completion information for the received data.
- * @recv_buf - Specifies the location of the received data buffer(s).
- * @mad_len - The length of the received MAD, without duplicated headers.
+ * @wc: Completion information for the received data.
+ * @recv_buf: Specifies the location of the received data buffer(s).
+ * @mad_len: The length of the received MAD, without duplicated headers.
  *
  * For received response, the wr_id field of the wc is set to the wr_id
  *   for the corresponding send request.
@@ -194,12 +194,12 @@
 
 /**
  * ib_mad_reg_req - MAD registration request
- * @mgmt_class - Indicates which management class of MADs should be receive
+ * @mgmt_class: Indicates which management class of MADs should be receive
  *   by the caller.  This field is only required if the user wishes to
  *   receive unsolicited MADs, otherwise it should be 0.
- * @mgmt_class_version - Indicates which version of MADs for the given
+ * @mgmt_class_version: Indicates which version of MADs for the given
  *   management class to receive.
- * @method_mask - The caller will receive unsolicited MADs for any method
+ * @method_mask: The caller will receive unsolicited MADs for any method
  *   where @method_mask = 1.
  */
 struct ib_mad_reg_req {
@@ -210,21 +210,21 @@
 
 /**
  * ib_register_mad_agent - Register to send/receive MADs.
- * @device - The device to register with.
- * @port_num - The port on the specified device to use.
- * @qp_type - Specifies which QP to access.  Must be either
+ * @device: The device to register with.
+ * @port_num: The port on the specified device to use.
+ * @qp_type: Specifies which QP to access.  Must be either
  *   IB_QPT_SMI or IB_QPT_GSI.
- * @mad_reg_req - Specifies which unsolicited MADs should be received
+ * @mad_reg_req: Specifies which unsolicited MADs should be received
  *   by the caller.  This parameter may be NULL if the caller only
  *   wishes to receive solicited responses.
- * @rmpp_version - If set, indicates that the client will send
+ * @rmpp_version: If set, indicates that the client will send
  *   and receive MADs that contain the RMPP header for the given version.
  *   If set to 0, indicates that RMPP is not used by this client.
- * @send_handler - The completion callback routine invoked after a send
+ * @send_handler: The completion callback routine invoked after a send
  *   request has completed.
- * @recv_handler - The completion callback routine invoked for a received
+ * @recv_handler: The completion callback routine invoked for a received
  *   MAD.
- * @context - User specified context associated with the registration.
+ * @context: User specified context associated with the registration.
  */
 struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 					   u8 port_num,
@@ -237,7 +237,7 @@
 
 /**
  * ib_unregister_mad_agent - Unregisters a client from using MAD services.
- * @mad_agent - Corresponding MAD registration request to deregister.
+ * @mad_agent: Corresponding MAD registration request to deregister.
  *
  * After invoking this routine, MAD services are no longer usable by the
  * client on the associated QP.
@@ -247,9 +247,9 @@
 /**
  * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated
  *   with the registered client.
- * @mad_agent - Specifies the associated registration to post the send to.
- * @send_wr - Specifies the information needed to send the MAD(s).
- * @bad_send_wr - Specifies the MAD on which an error was encountered.
+ * @mad_agent: Specifies the associated registration to post the send to.
+ * @send_wr: Specifies the information needed to send the MAD(s).
+ * @bad_send_wr: Specifies the MAD on which an error was encountered.
  *
  * Sent MADs are not guaranteed to complete in the order that they were posted.
  */
@@ -259,8 +259,8 @@
 
 /**
  * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer.
- * @mad_recv_wc - Work completion information for a received MAD.
- * @buf - User-provided data buffer to receive the coalesced buffers.  The
+ * @mad_recv_wc: Work completion information for a received MAD.
+ * @buf: User-provided data buffer to receive the coalesced buffers.  The
  *   referenced buffer should be at least the size of the mad_len specified
  *   by @mad_recv_wc.
  *
@@ -273,7 +273,7 @@
 /**
  * ib_free_recv_mad - Returns data buffers used to receive a MAD to the
  *   access layer.
- * @mad_recv_wc - Work completion information for a received MAD.
+ * @mad_recv_wc: Work completion information for a received MAD.
  *
  * Clients receiving MADs through their ib_mad_recv_handler must call this
  * routine to return the work completion buffers to the access layer.
@@ -282,8 +282,8 @@
 
 /**
  * ib_cancel_mad - Cancels an outstanding send MAD operation.
- * @mad_agent - Specifies the registration associated with sent MAD.
- * @wr_id - Indicates the work request identifier of the MAD to cancel.
+ * @mad_agent: Specifies the registration associated with sent MAD.
+ * @wr_id: Indicates the work request identifier of the MAD to cancel.
  *
  * MADs will be returned to the user through the corresponding
  * ib_mad_send_handler.
@@ -293,15 +293,15 @@
 
 /**
  * ib_redirect_mad_qp - Registers a QP for MAD services.
- * @qp - Reference to a QP that requires MAD services.
- * @rmpp_version - If set, indicates that the client will send
+ * @qp: Reference to a QP that requires MAD services.
+ * @rmpp_version: If set, indicates that the client will send
  *   and receive MADs that contain the RMPP header for the given version.
  *   If set to 0, indicates that RMPP is not used by this client.
- * @send_handler - The completion callback routine invoked after a send
+ * @send_handler: The completion callback routine invoked after a send
  *   request has completed.
- * @recv_handler - The completion callback routine invoked for a received
+ * @recv_handler: The completion callback routine invoked for a received
  *   MAD.
- * @context - User specified context associated with the registration.
+ * @context: User specified context associated with the registration.
  *
  * Use of this call allows clients to use MAD services, such as RMPP,
  * on user-owned QPs.  After calling this routine, users may send
@@ -316,8 +316,8 @@
 /**
  * ib_process_mad_wc - Processes a work completion associated with a
  *   MAD sent or received on a redirected QP.
- * @mad_agent - Specifies the registered MAD service using the redirected QP.
- * @wc - References a work completion associated with a sent or received
+ * @mad_agent: Specifies the registered MAD service using the redirected QP.
+ * @wc: References a work completion associated with a sent or received
  *   MAD segment.
  *
  * This routine is used to complete or continue processing on a MAD request.
Index: core/device.c
===================================================================
--- core/device.c	(revision 1302)
+++ core/device.c	(working copy)
@@ -556,6 +556,17 @@
 }
 EXPORT_SYMBOL(ib_modify_device);
 
+/**
+ * ib_modify_port - Modifies the attributes for the specified port.
+ * @device: The device to modify.
+ * @port_num: The number of the port to modify.
+ * @port_modify_mask: Mask used to specify which attributes of the port
+ *   to change.
+ * @port_modify: New attribute values for the port.
+ *
+ * ib_modify_port() changes a port's attributes as specified by the
+ * @port_modify_mask and @port_modify structure.
+ */
 int ib_modify_port(struct ib_device *device,
 		   u8 port_num, int port_modify_mask,
 		   struct ib_port_modify *port_modify)


From mshefty at ichips.intel.com  Tue Nov 30 10:55:47 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 30 Nov 2004 10:55:47 -0800
Subject: [openib-general] smpdump and current MAD layer
In-Reply-To: <1101838548.6411.276.camel@localhost.localdomain>
References: <1101838548.6411.276.camel@localhost.localdomain>
Message-ID: <41ACC233.2000201@ichips.intel.com>

Hal Rosenstock wrote:
> Each received MAD can only have 1 client which "owns" it. That client is
> either determined via solicited routing or version/class/method (and
> soon OUI) routing.

This is correct.  This was done to avoid having to copy received MADs.

> So solicited MAD responses cannot currently be snooped nor can
> unsolicited ones for which an agent is registered (Since SMA and PMA are
> currently firmware based, the latter is not an issue for the current
> implementation).
> 
> Is the above correct ? If so, do you see a "clean" way around this ?

This is something that was briefly discussed before.  I think that I 
would support snooping by extending the ib_mad_reg_reg structure to 
indicate a registration type, possibly along with some additional 
filtering parameters.  (We could also create a new snoop routine.)

One issue with snooping MAD is whether the snooping occurs above or 
below RMPP, or possibly in both places.

- Sean


From halr at voltaire.com  Tue Nov 30 11:33:19 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 30 Nov 2004 14:33:19 -0500
Subject: [openib-general] smpdump and current MAD layer
In-Reply-To: <41ACC233.2000201@ichips.intel.com>
References: <1101838548.6411.276.camel@localhost.localdomain>
	<41ACC233.2000201@ichips.intel.com>
Message-ID: <1101843199.6411.288.camel@localhost.localdomain>

On Tue, 2004-11-30 at 13:55, Sean Hefty wrote:
> This is something that was briefly discussed before.  I think that I 
> would support snooping by extending the ib_mad_reg_reg structure to 
> indicate a registration type, possibly along with some additional 
> filtering parameters.  (We could also create a new snoop routine.)

Maybe a single bit field in the registration request to indicate snoop.

Another question is what is the granularity of snoop registrations
needed to be supported ? Is one SMP snooper and one GMP snooper
sufficient ? Should the snoopers be per class ? It seems to me that
going down to the method level is too much for snoopers. This is just
another way of expressing the filtering parameters you mention.

> One issue with snooping MAD is whether the snooping occurs above or 
> below RMPP, or possibly in both places.

In general, I would think the GMP snooping would specify whether it is
to be done above or below RMPP and perhaps the class or all GS classes
(some combinations wouldn't make sense). I could see if one thought one
was having problems with RMPP handling doing it below RMPP and otherwise
above (the normal case).

-- Hal


From mshefty at ichips.intel.com  Tue Nov 30 11:47:47 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 30 Nov 2004 11:47:47 -0800
Subject: [openib-general] smpdump and current MAD layer
In-Reply-To: <1101843199.6411.288.camel@localhost.localdomain>
References: <1101838548.6411.276.camel@localhost.localdomain>
	<41ACC233.2000201@ichips.intel.com>
	<1101843199.6411.288.camel@localhost.localdomain>
Message-ID: <41ACCE63.5050800@ichips.intel.com>

Hal Rosenstock wrote:

>>This is something that was briefly discussed before.  I think that I 
>>would support snooping by extending the ib_mad_reg_reg structure to 
>>indicate a registration type, possibly along with some additional 
>>filtering parameters.  (We could also create a new snoop routine.)
> 
> Maybe a single bit field in the registration request to indicate snoop.
> 
> Another question is what is the granularity of snoop registrations
> needed to be supported ? Is one SMP snooper and one GMP snooper
> sufficient ? Should the snoopers be per class ? It seems to me that
> going down to the method level is too much for snoopers. This is just
> another way of expressing the filtering parameters you mention.

I guess filtering can be done above the MAD layer, so just letting the 
user specify the qp_type may be all that's needed, beyond indicating 
that snooping is desired.  If we go this route, we can probably support 
any number of snoopers.

>>One issue with snooping MAD is whether the snooping occurs above or 
>>below RMPP, or possibly in both places.
> 
> In general, I would think the GMP snooping would specify whether it is
> to be done above or below RMPP and perhaps the class or all GS classes
> (some combinations wouldn't make sense). I could see if one thought one
> was having problems with RMPP handling doing it below RMPP and otherwise
> above (the normal case).

Hmm... we could let the client decide through the rmpp_version 
parameter.  Also, would snooping include redirected QPs?  I think that 
we can support this.

- Sean


From halr at voltaire.com  Tue Nov 30 12:24:59 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 30 Nov 2004 15:24:59 -0500
Subject: [openib-general] smpdump and current MAD layer
In-Reply-To: <41ACCE63.5050800@ichips.intel.com>
References: <1101838548.6411.276.camel@localhost.localdomain>
	<41ACC233.2000201@ichips.intel.com>
	<1101843199.6411.288.camel@localhost.localdomain>
	<41ACCE63.5050800@ichips.intel.com>
Message-ID: <1101846298.6411.351.camel@localhost.localdomain>

On Tue, 2004-11-30 at 14:47, Sean Hefty wrote:
> I guess filtering can be done above the MAD layer, 

That seems like the right way to go to me.

> so just letting the 
> user specify the qp_type may be all that's needed, beyond indicating 
> that snooping is desired.  If we go this route, we can probably support 
> any number of snoopers.

Then there are really only 2 snoopers at the MAD level (SMI, GSI). Any
additional demux (snoopers) would be done above the MAD level.

> >>One issue with snooping MAD is whether the snooping occurs above or 
> >>below RMPP, or possibly in both places.
> > 
> > In general, I would think the GMP snooping would specify whether it is
> > to be done above or below RMPP and perhaps the class or all GS classes
> > (some combinations wouldn't make sense). I could see if one thought one
> > was having problems with RMPP handling doing it below RMPP and otherwise
> > above (the normal case).
> 
> Hmm... we could let the client decide through the rmpp_version 
> parameter.  

I like it. Nothing new needed here. It's just a question of when to
implement above and below RMPP.

> Also, would snooping include redirected QPs?  I think that 
> we can support this.

Once a QP is redirected, is the MAD layer still involved in handing off
the receive completions for that QP ? 

-- Hal


From halr at voltaire.com  Tue Nov 30 14:46:28 2004
From: halr at voltaire.com (Hal Rosenstock)
Date: Tue, 30 Nov 2004 17:46:28 -0500
Subject: [openib-general] smpdump and current MAD layer
In-Reply-To: <41ACCE63.5050800@ichips.intel.com>
References: <1101838548.6411.276.camel@localhost.localdomain>
	<41ACC233.2000201@ichips.intel.com>
	<1101843199.6411.288.camel@localhost.localdomain>
	<41ACCE63.5050800@ichips.intel.com>
Message-ID: <1101854788.6411.373.camel@localhost.localdomain>

On Tue, 2004-11-30 at 14:47, Sean Hefty wrote:
> I guess filtering can be done above the MAD layer, so just letting the 
> user specify the qp_type may be all that's needed, beyond indicating 
> that snooping is desired.  If we go this route, we can probably support 
> any number of snoopers.

Does that mean the snoopers would just be a list based on qp_type (and
we have a list per QP type (SMI, GSI)) ?

-- Hal


From mshefty at ichips.intel.com  Tue Nov 30 14:57:10 2004
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 30 Nov 2004 14:57:10 -0800
Subject: [openib-general] smpdump and current MAD layer
In-Reply-To: <1101846298.6411.351.camel@localhost.localdomain>
References: <1101838548.6411.276.camel@localhost.localdomain>
	<41ACC233.2000201@ichips.intel.com>
	<1101843199.6411.288.camel@localhost.localdomain>
	<41ACCE63.5050800@ichips.intel.com>
	<1101846298.6411.351.camel@localhost.localdomain>
Message-ID: <41ACFAC6.1060804@ichips.intel.com>

Hal Rosenstock wrote:

>>I guess filtering can be done above the MAD layer, 
> 
> That seems like the right way to go to me.

Same here.

> Then there are really only 2 snoopers at the MAD level (SMI, GSI). Any
> additional demux (snoopers) would be done above the MAD level.

I was referring to allowing multiple clients snoop QP0/1 traffic.  To 
implement this, it seems that we'd only need a single list per QP per port.

>>Also, would snooping include redirected QPs?  I think that 
>>we can support this.
> 
> Once a QP is redirected, is the MAD layer still involved in handing off
> the receive completions for that QP ? 

Currently, the API expects the user to call ib_process_mad_wc for 
*some* send or receive completions: those associated with RMPP and for 
requests/responses.  We can state that users of redirected QPs should 
always call ib_process_mad_wc for any MAD related work completion, but 
that isn't strictly enforceable as long as the user controls the CQ.