From vonwyl at EIG.UNIGE.CH Mon Nov 1 02:39:32 2004 From: vonwyl at EIG.UNIGE.CH (von Wyl) Date: Mon, 01 Nov 2004 11:39:32 +0100 Subject: [openib-general] Failed limiting maximum outstanding PCI reads Message-ID: <41861264.2020305@eig.unige.ch> Hi, I get this problem after installing the gen2 roland-merge stack (for linux kernel) and the gen1 trunk (for useraccess) : vapi: Inspecting PCI chipset: [ECHEC ] Failed limiting maximum outstanding PCI reads and when I try lspci -s 3 (port number 3...) I get : 04:03.0 PCI bridge: Mellanox Technology: Unknown device 5a46 (rev a1) If anyone has an idea, please e-mail me... From kcm at psc.edu Mon Nov 1 04:44:07 2004 From: kcm at psc.edu (Ken MacInnis) Date: Mon, 01 Nov 2004 07:44:07 -0500 Subject: [openib-general] Problem with 2.4.24 and gen1 In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com> Message-ID: <41862F97.3080301@psc.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Right. I've had this machine (and OS running a much more vanilla configuration) and HBA using the OpenIB and MTI stacks just fine in the past. Dual Opteron, 8GB RAM, PCI-X MT23108. This same problem happens with this kernel on fairly different hardware we're using too, though.. It is Fedora Core 1, vanilla 2.4.24-based with Lustre 1.2.6 patches/mods. Almost nothing is modular in the kernel.. it is either off or compiled in. In fact, ACPI is turned off.. perhaps enabling it would be beneficial? I have attached the config file if that helps. Perhaps there is something critical I have unknowingly disabled. Also, another question I have is fairly naive -- at what point are the Lion Cub (PCI Express) cards supported in the OpenIB stack? I seem to remember the Tavor code supporting them inherently but in a non-efficient manner if native code wasn't used. Ken Tziporet Koren wrote: | The problem is that the driver does not get the interrupt for the command | completion, | and thus you get the error: "Command not completed after timeout". | | It is related to the OS & system you are using. What is the distribution you | are using? We once saw such problems with older versions of SuSE. | | Try to add append="acpi=off" to the lilo you are using or add also | disableapic in the same append line. | | | Tziporet | | | -----Original Message----- | From: Ken MacInnis [mailto:kcm at psc.edu] | Sent: Sunday, October 31, 2004 8:20 PM | To: openib-general at openib.org | Subject: [openib-general] Problem with 2.4.24 and gen1 | I've got a fairly modified kernel here I'm trying to get a OpenIB stack | running on. It's a vanilla 2.4.24 kernel with Lustre and other patches | in it, but I'm seeing this when I modprobe ib_tavor: | | Oct 31 13:13:05 samwise kernel: THH(1): cmdif.c[1190]: Command not | completed after timeout: cmd=TAV | OR_IF_CMD_MAD_IFC (0x24), token=0x1400, pid=0x8E1, go=0 | Oct 31 13:13:05 samwise kernel: THH(1): CMD ERROR DUMP. opcode=0x24, | opc_mod = 0x1, exec_time_micro | =300000000 | . | . | Oct 31 13:13:06 samwise kernel: THH(1): cmdif.c[842]: Failed command | 0x24 (TAVOR_IF_CMD_MAD_IFC): s | tatus=0x103 (0x0103 - unexpected error - fatal) | Oct 31 13:13:06 samwise kernel: | Oct 31 13:13:06 samwise kernel: THH(1): thh_hob.c[2790]: | THH_hob_query_port_prop: cmdif returned FA | TAL | Oct 31 13:13:06 samwise kernel: VIPKL(1): qpm.c[278]: QPM_new: | HOBKL_query_port_prop returned with | error: -254 = VAPI_EFATAL | Oct 31 13:13:06 samwise kernel: VIPKL(1): qpm.c[302]: QPM_new: | returned with error: -254 = VAPI_EF | ATAL | Oct 31 13:13:06 samwise kernel: THH(1): thh_hob.c[3474]: | THH_hob_fatal_err_thread: RECEIVED FATAL E | RROR WAKEUP | Oct 31 13:13:06 samwise kernel: THH(1): thh_hob.c[4490]: | THH_hob_halt_hca: HALT HCA returned 0x103 | Oct 31 13:13:06 samwise kernel: THH(1): thh_hob.c[1620]: | THH_hob_destroy: FATAL ERROR | Oct 31 13:13:06 samwise kernel: THH(1): thh_hob.c[1627]: | THH_hob_destroy: PERFORMING SW RESET. pa=0 | xFE9F0010 va=0xF8A01010 | Oct 31 13:13:06 samwise kernel: | Oct 31 13:13:06 samwise kernel: Mellanox Tavor Device Driver is creating | device "InfiniHost0" (bus=0 | 4, devfn=00) | Oct 31 13:13:06 samwise kernel: | Oct 31 13:13:06 samwise kernel: | [KERNEL_IB][_tsIbTavorInitOne][tavor_main.c:86]InfiniHost0: VAPI_ope | n_hca failed, status -254 (Fatal error (Local Catastrophic Error)) | Oct 31 13:13:06 samwise kernel: | [SRPTP][srp_host_init][srp_host.c:1495]SRP Host using indirect addre | ssing | | | This occurs with an older openib rev (200-ish) as well as one up-to-date | as of today. | | Everything else (modules.conf, etc.) is set up as it has been when I was | messing with 2.4 kernels and OpenIB a few months ago, so I'm not | thinking it's related to such. | | Any ideas? Yes, I know it's 2.4 as well as a fairly older 2.4, but I | have no choice here. :) lspci -vvv bits follow. | | 03:01.0 PCI bridge: Mellanox Technology: Unknown device 5a46 (rev a1) | (prog-if 00 [Normal decode]) | Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- | ParErr- Stepping- SERR+ FastB2B- | Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- | SERR-

Reset- FastB2B- | Capabilities: [70] PCI-X non-bridge device. | Command: DPERE+ ERO+ RBC=0 OST=4 | Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, | DC=simple, DMMRBC=0, DMOST=0, D | MCRS=0, RSCEM- | 04:00.0 InfiniBand: Mellanox Technology: Unknown device 5a44 (rev a1) | Subsystem: Mellanox Technology: Unknown device 5a44 | Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- | ParErr- Stepping- SERR+ FastB2B- | Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- | SERR-

References: <20041029170917.3faa58e3.mshefty@ichips.intel.com> Message-ID: <1099322613.12249.25.camel@hpc-1> On Fri, 2004-10-29 at 20:09, Sean Hefty wrote: > On Fri, 29 Oct 2004 18:06:47 -0600 > Shirley Ma wrote: > > > Here is the patch. > > Note that my patch removes the lock when calling ib_post_send. But, > holding the lock when calling ib_post_send() should be fine. Also, the > current completion code assumes that the work requests are queued in the > same order that the sends are posted in. Releasing the lock after > queuing the request, but before calling ib_psot_send() allows work > requests to be posted out of order from the order that they are queued > on the send posted list. So should this patch be applied or is it superceeded by your pending patch (and I should wait for that) ? Thanks. -- Hal From roland at topspin.com Mon Nov 1 07:27:27 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 01 Nov 2004 07:27:27 -0800 Subject: [openib-general] [PATCH]spinlock shouldn't be held while calling ib_post_send() In-Reply-To: <1099322613.12249.25.camel@hpc-1> (Hal Rosenstock's message of "Mon, 01 Nov 2004 10:23:33 -0500") References: <20041029170917.3faa58e3.mshefty@ichips.intel.com> <1099322613.12249.25.camel@hpc-1> Message-ID: <52pt2xvbc0.fsf@topspin.com> Hal> So should this patch be applied or is it superceeded by your Hal> pending patch (and I should wait for that) ? sounds like the patch is not needed and actively breaks things, so my guess would be that it's better not to apply. - R. From roland at topspin.com Mon Nov 1 07:29:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 01 Nov 2004 07:29:38 -0800 Subject: [openib-general] Problem with 2.4.24 and gen1 In-Reply-To: <41862F97.3080301@psc.edu> (Ken MacInnis's message of "Mon, 01 Nov 2004 07:44:07 -0500") References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com> <41862F97.3080301@psc.edu> Message-ID: <52lldlvb8d.fsf@topspin.com> Ken> Also, another question I have is fairly naive -- at what Ken> point are the Lion Cub (PCI Express) cards supported in the Ken> OpenIB stack? I seem to remember the Tavor code supporting Ken> them inherently but in a non-efficient manner if native code Ken> wasn't used. Lion Cub aka Arbel aka PCI Ex HCA is supported in Tavor compatibility mode right now by the mthca driver. I just received firmware and documentation for native mode last week, so support for that will be "coming soon." However Tavor mode is not really much of a performance hit -- on a suitable motherboard you should still be able to hit (bus limited) 20 Gb/sec of throughput. - Roland From roland at topspin.com Mon Nov 1 07:30:01 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 01 Nov 2004 07:30:01 -0800 Subject: [openib-general] Failed limiting maximum outstanding PCI reads In-Reply-To: <41861264.2020305@eig.unige.ch> (von Wyl's message of "Mon, 01 Nov 2004 11:39:32 +0100") References: <41861264.2020305@eig.unige.ch> Message-ID: <52hdo9vb7q.fsf@topspin.com> von> Hi, I get this problem after installing the gen2 roland-merge von> stack (for linux kernel) and the gen1 trunk (for useraccess) gen1 userspace won't work with gen2 kernel side, unfortunately. - Roland From kcm at psc.edu Mon Nov 1 07:48:57 2004 From: kcm at psc.edu (Ken MacInnis) Date: Mon, 01 Nov 2004 10:48:57 -0500 Subject: [openib-general] Problem with 2.4.24 and gen1 In-Reply-To: <52lldlvb8d.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com> <41862F97.3080301@psc.edu> <52lldlvb8d.fsf@topspin.com> Message-ID: <41865AE9.20808@psc.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roland Dreier wrote: | Ken> Also, another question I have is fairly naive -- at what | Ken> point are the Lion Cub (PCI Express) cards supported in the | Ken> OpenIB stack? I seem to remember the Tavor code supporting | Ken> them inherently but in a non-efficient manner if native code | Ken> wasn't used. | | Lion Cub aka Arbel aka PCI Ex HCA is supported in Tavor compatibility | mode right now by the mthca driver. I just received firmware and | documentation for native mode last week, so support for that will be | "coming soon." However Tavor mode is not really much of a performance | hit -- on a suitable motherboard you should still be able to hit (bus | limited) 20 Gb/sec of throughput. Does this extend to the older non-mthca code? I assume this Tavor compatibility mode is wrt. the HBA, so that it does. :) Thanks! Ken - -- Ken MacInnis - Systems Engineer, PSC - http://www.psc.edu/~kcm/ kcm at psc dot edu - +1 412 268 9833 (w) - +1 412 268 5832 (f) Pittsburgh Supercomputing Center - 4400 Fifth Ave - Pittsburgh, PA 15213 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (MingW32) iD8DBQFBhlrpnT0C17PQhv4RAsUJAJ0drjCY0G6UDeztXJDPIHJqA8NUuQCfarLj xeIisjQe2XGV9GQ755KaU+c= =pe9I -----END PGP SIGNATURE----- From tziporet at mellanox.co.il Mon Nov 1 08:00:38 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 1 Nov 2004 18:00:38 +0200 Subject: [openib-general] Problem with 2.4.24 and gen1 Message-ID: <506C3D7B14CDD411A52C00025558DED6064BE96E@mtlex01.yok.mtl.com> Tavor mode and native Arbel mode have the same performance. The main change for the native mode is the ability to work without attached DDR and use the system memory instead. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Monday, November 01, 2004 5:30 PM To: Ken MacInnis Cc: Tziporet Koren; Tech_Support; openib-general at openib.org Subject: Re: [openib-general] Problem with 2.4.24 and gen1 Ken> Also, another question I have is fairly naive -- at what Ken> point are the Lion Cub (PCI Express) cards supported in the Ken> OpenIB stack? I seem to remember the Tavor code supporting Ken> them inherently but in a non-efficient manner if native code Ken> wasn't used. Lion Cub aka Arbel aka PCI Ex HCA is supported in Tavor compatibility mode right now by the mthca driver. I just received firmware and documentation for native mode last week, so support for that will be "coming soon." However Tavor mode is not really much of a performance hit -- on a suitable motherboard you should still be able to hit (bus limited) 20 Gb/sec of throughput. - Roland -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Nov 1 08:17:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 11:17:38 -0500 Subject: [openib-general] [PATCH]code optimization in ib_register_mad_agent() In-Reply-To: <20041029171345.1d01e8a3.mshefty@ichips.intel.com> References: <20041029171345.1d01e8a3.mshefty@ichips.intel.com> Message-ID: <1099325858.3074.1.camel@hpc-1> On Fri, 2004-10-29 at 20:13, Sean Hefty wrote: > On Fri, 29 Oct 2004 17:35:40 -0600 > Shirley Ma wrote: > > > I am starting to look at the access layer code. Here is a code > > optimization patch in ib_register_mad_agent(). > > ib_mad_client_id must be incremented while holding the spinlock (or > converted into an atomic). The rest of the initialization looks fine > moved upwards. Thanks. Applied with moving the ib_mad_client_id increment down under holding the registration lock. -- Hal From halr at voltaire.com Mon Nov 1 08:24:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 11:24:35 -0500 Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap from 2 verb routines. In-Reply-To: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com> References: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com> Message-ID: <1099326275.3074.3.camel@hpc-1> On Fri, 2004-10-29 at 16:14, Sean Hefty wrote: > On Fri, 29 Oct 2004 13:01:03 -0700 (PDT) > Krishna Kumar wrote: > > > Hi, > > > > I know this changes the verbs interface a bit, but ... > > > > I don't see a value in the qp_cap being passed to different routines, > > when either ib_qp_attr or ib_qp_init_attr, both of which contain a > > qp_cap, are being passed at the same time. > > The parameter is there to separate input/output parameters, and resulted > from the original VAPI evolution of the code. There's no strong > technical reason that it cannot be removed. Should this patch be applied ? If so, I will test this and then it can also be merged to roland's branch. -- Hal From kcm at psc.edu Mon Nov 1 09:40:15 2004 From: kcm at psc.edu (Ken MacInnis) Date: Mon, 01 Nov 2004 12:40:15 -0500 Subject: [openib-general] Problem with 2.4.24 and gen1 In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6064BE95C@mtlex01.yok.mtl.com> Message-ID: <418674FF.7050209@psc.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 ACPI was already not in the kernel. Appending 'noapic disableapic' did work to load the Tavor code. :) Thanks for the hint! However, now OpenSM is still misbehaving: - ------------------------------------------------- OpenSM Rev:B1-rc1 Command Line Arguments: ~ Log File: /tmp/osm.log - ------------------------------------------------- Error from osm_opensm_init (1) Error from osm_opensm_bind (0x2A) [1099330621:000868906][4000] -> OpenSM Rev:B1-rc1 [1099330621:000868958][4000] -> osm_opensm_init: Forcing single threaded dispatcher. [1099330621:000869383][4000] -> osm_report_notice: Received Generic Notice type:3 num:66 from LID:0x 0000 GUID:0xfe80000000000000,0x0000000000000000 [1099330621:000869402][4000] -> osm_report_notice: Received Generic Notice type:3 num:66 from LID:0x 0000 GUID:0xfe80000000000000,0x0000000000000000 [1099330621:000869445][4000] -> __osm_vendor_get_ca_ids: ERR 3D09: No available channel adapters. [1099330621:000869456][4000] -> osm_vendor_get_all_port_attr: ERR 3D13: Fail to get CA Ids . [1099330621:000869484][4000] -> __osm_vendor_get_ca_ids: ERR 3D11: : Bad parameter in calling: EVAPI _list_hcas. [1099330621:000869493][4000] -> osm_vendor_get_guid_ca_and_port: ERR 3D16: Fail to get CA Ids . [1099330621:000869503][4000] -> osm_vendor_bind: ERR 5005: Fail to find port number of port guid:0x0 000000000000000 [1099330621:000869515][4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed. [1099330621:000869526][4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR). Any ideas on this? I did make very sure to check that userland and opensm was in sync with the kernel bits I'm using. The 0s in the LID and GUID are concerning me. I may end up trying the newer OpenIB stack for fun (ha), and see if that works better. Ken Tziporet Koren wrote: | Hi, | | The problem is that the driver does not get the interrupt for the command | completion, | and thus you get the error: "Command not completed after timeout". | | It is related to the OS & system you are using. What is the distribution you | are using? We once saw such problems with older versions of SuSE. | | Try to add append="acpi=off" to the lilo you are using or add also | disableapic in the same append line. | -----Original Message----- | From: Ken MacInnis [mailto:kcm at psc.edu] | Sent: Sunday, October 31, 2004 8:20 PM | To: openib-general at openib.org | Subject: [openib-general] Problem with 2.4.24 and gen1 | I've got a fairly modified kernel here I'm trying to get a OpenIB stack | running on. It's a vanilla 2.4.24 kernel with Lustre and other patches | in it, but I'm seeing this when I modprobe ib_tavor: | | Oct 31 13:13:05 samwise kernel: THH(1): cmdif.c[1190]: Command not | completed after timeout: cmd=TAV - -- Ken MacInnis - Systems Engineer, PSC - http://www.psc.edu/~kcm/ kcm at psc dot edu - +1 412 268 9833 (w) - +1 412 268 5832 (f) Pittsburgh Supercomputing Center - 4400 Fifth Ave - Pittsburgh, PA 15213 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (MingW32) iD8DBQFBhnT/nT0C17PQhv4RAicqAJ9hRiudNE1Bfof+BDrG09XfA5jD/wCcDH/D UT/E1V7i0yO6pPPOx9oobNQ= =R5wl -----END PGP SIGNATURE----- From mshefty at ichips.intel.com Mon Nov 1 09:40:03 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 09:40:03 -0800 Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap from 2 verb routines. In-Reply-To: <1099326275.3074.3.camel@hpc-1> References: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com> <1099326275.3074.3.camel@hpc-1> Message-ID: <20041101094003.6c7bc3e0.mshefty@ichips.intel.com> On Mon, 01 Nov 2004 11:24:35 -0500 Hal Rosenstock wrote: > On Fri, 2004-10-29 at 16:14, Sean Hefty wrote: > > On Fri, 29 Oct 2004 13:01:03 -0700 (PDT) > > Krishna Kumar wrote: > > > > > Hi, > > > > > > I know this changes the verbs interface a bit, but ... > > > > > > I don't see a value in the qp_cap being passed to different > > > routines, when either ib_qp_attr or ib_qp_init_attr, both of which > > > contain a qp_cap, are being passed at the same time. > > > > The parameter is there to separate input/output parameters, and > > resulted from the original VAPI evolution of the code. There's no > > strong technical reason that it cannot be removed. > > Should this patch be applied ? If so, I will test this and then it can > also be merged to roland's branch. I'm fine with applying this patch - just wanted to let others provide input. We should probably modify ipoib before committing the changes. - Sean From mshefty at ichips.intel.com Mon Nov 1 09:41:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 09:41:46 -0800 Subject: [openib-general] [PATCH]spinlock shouldn't be held while calling ib_post_send() In-Reply-To: <52pt2xvbc0.fsf@topspin.com> References: <20041029170917.3faa58e3.mshefty@ichips.intel.com> <1099322613.12249.25.camel@hpc-1> <52pt2xvbc0.fsf@topspin.com> Message-ID: <20041101094146.59996de5.mshefty@ichips.intel.com> On Mon, 01 Nov 2004 07:27:27 -0800 Roland Dreier wrote: > Hal> So should this patch be applied or is it superceeded by your > Hal> pending patch (and I should wait for that) ? > > sounds like the patch is not needed and actively breaks things, so my > guess would be that it's better not to apply. Correct - I would not apply this patch. - Sean From halr at voltaire.com Mon Nov 1 10:40:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 13:40:42 -0500 Subject: [openib-general] SM and smi Message-ID: <1099334442.3074.45.camel@hpc-1> The Get/Set(SMInfo) is one aspect related to the MAD layer for SM support which has been discussed on the list (and the changes are still on my TODO list). I was wondering how others saw SMI support for the SM. It seems to me that it makes sense to expose the routines that the agent is using so that they do not need to be reinvented for the SM. Does that make sense ? Better yet might be exposing a routine for SM class sending. Thanks. -- Hal From halr at voltaire.com Mon Nov 1 12:42:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 15:42:16 -0500 Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap from 2 verb routines. In-Reply-To: <20041101094003.6c7bc3e0.mshefty@ichips.intel.com> References: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com> <1099326275.3074.3.camel@hpc-1> <20041101094003.6c7bc3e0.mshefty@ichips.intel.com> Message-ID: <1099341735.3074.97.camel@hpc-1> On Mon, 2004-11-01 at 12:40, Sean Hefty wrote: > On Mon, 01 Nov 2004 11:24:35 -0500 > Hal Rosenstock wrote: > > > On Fri, 2004-10-29 at 16:14, Sean Hefty wrote: > > > On Fri, 29 Oct 2004 13:01:03 -0700 (PDT) > > > Krishna Kumar wrote: > > > > > > > Hi, > > > > > > > > I know this changes the verbs interface a bit, but ... > > > > > > > > I don't see a value in the qp_cap being passed to different > > > > routines, when either ib_qp_attr or ib_qp_init_attr, both of which > > > > contain a qp_cap, are being passed at the same time. > > > > > > The parameter is there to separate input/output parameters, and > > > resulted from the original VAPI evolution of the code. There's no > > > strong technical reason that it cannot be removed. > > > > Should this patch be applied ? If so, I will test this and then it can > > also be merged to roland's branch. > > I'm fine with applying this patch - just wanted to let others provide > input. We should probably modify ipoib before committing the changes. Thanks. Applied (excepting the change to mthca_provider.c). Attached is the remaining patch for roland's branch. Note mad.c will need to be moved over as well. -- Hal Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 1106) +++ include/ib_verbs.h (working copy) @@ -709,12 +709,10 @@ struct ib_ah_attr *ah_attr); int (*destroy_ah)(struct ib_ah *ah); struct ib_qp * (*create_qp)(struct ib_pd *pd, - struct ib_qp_init_attr *qp_init_attr, - struct ib_qp_cap *qp_cap); + struct ib_qp_init_attr *qp_init_attr); int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, - int qp_attr_mask, - struct ib_qp_cap *qp_cap); + int qp_attr_mask); int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, @@ -851,13 +849,11 @@ int ib_destroy_ah(struct ib_ah *ah); struct ib_qp *ib_create_qp(struct ib_pd *pd, - struct ib_qp_init_attr *qp_init_attr, - struct ib_qp_cap *qp_cap); + struct ib_qp_init_attr *qp_init_attr); int ib_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, - int qp_attr_mask, - struct ib_qp_cap *qp_cap); + int qp_attr_mask); int ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, Index: core/verbs.c =================================================================== --- core/verbs.c (revision 1106) +++ core/verbs.c (working copy) @@ -105,12 +105,11 @@ /* Queue pairs */ struct ib_qp *ib_create_qp(struct ib_pd *pd, - struct ib_qp_init_attr *qp_init_attr, - struct ib_qp_cap *qp_cap) + struct ib_qp_init_attr *qp_init_attr) { struct ib_qp *qp; - qp = pd->device->create_qp(pd, qp_init_attr, qp_cap); + qp = pd->device->create_qp(pd, qp_init_attr); if (!IS_ERR(qp)) { qp->device = pd->device; @@ -133,10 +132,9 @@ int ib_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, - int qp_attr_mask, - struct ib_qp_cap *qp_cap) + int qp_attr_mask) { - return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, qp_cap); + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); } EXPORT_SYMBOL(ib_modify_qp); Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 1106) +++ hw/mthca/mthca_provider.c (working copy) @@ -287,8 +287,7 @@ } static struct ib_qp *mthca_create_qp(struct ib_pd *pd, - struct ib_qp_init_attr *init_attr, - struct ib_qp_cap *qp_cap) + struct ib_qp_init_attr *init_attr) { struct mthca_qp *qp; int err; @@ -347,8 +346,7 @@ return ERR_PTR(err); } - *qp_cap = init_attr->cap; - qp_cap->max_inline_data = 0; + init_attr->cap.max_inline_data = 0; return &qp->ibqp; } Index: ulp/ipoib/ipoib_verbs.c =================================================================== --- ulp/ipoib/ipoib_verbs.c (revision 1106) +++ ulp/ipoib/ipoib_verbs.c (working copy) @@ -27,7 +27,6 @@ { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr *qp_attr; - struct ib_qp_cap qp_cap; int attr_mask; int ret; u16 pkey_index; @@ -47,7 +46,7 @@ /* set correct QKey for QP */ qp_attr->qkey = priv->qkey; attr_mask = IB_QP_QKEY; - ret = ib_modify_qp(priv->qp, qp_attr, attr_mask, &qp_cap); + ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); if (ret) { ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); goto out; @@ -98,7 +97,6 @@ .rq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_UD }; - struct ib_qp_cap qp_cap; struct ib_qp_attr qp_attr; int attr_mask; @@ -115,7 +113,7 @@ } set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - priv->qp = ib_create_qp(priv->pd, &init_attr, &qp_cap); + priv->qp = ib_create_qp(priv->pd, &init_attr); if (IS_ERR(priv->qp)) { ipoib_warn(priv, "failed to create QP\n"); return PTR_ERR(priv->qp); @@ -137,7 +135,7 @@ IB_QP_PORT | IB_QP_PKEY_INDEX | IB_QP_STATE; - ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask, &qp_cap); + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); if (ret) { ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); goto out_fail; @@ -146,7 +144,7 @@ qp_attr.qp_state = IB_QPS_RTR; /* Can't set this in a INIT->RTR transition */ attr_mask &= ~IB_QP_PORT; - ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask, &qp_cap); + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); if (ret) { ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret); goto out_fail; @@ -156,7 +154,7 @@ qp_attr.sq_psn = 0; attr_mask |= IB_QP_SQ_PSN; attr_mask &= ~IB_QP_PKEY_INDEX; - ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask, &qp_cap); + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); if (ret) { ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret); goto out_fail; From roland at topspin.com Mon Nov 1 14:15:33 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 01 Nov 2004 14:15:33 -0800 Subject: [openib-general] SM and smi In-Reply-To: <1099334442.3074.45.camel@hpc-1> (Hal Rosenstock's message of "Mon, 01 Nov 2004 13:40:42 -0500") References: <1099334442.3074.45.camel@hpc-1> Message-ID: <52lldltdve.fsf@topspin.com> Hal> I was wondering how others saw SMI support for the SM. It Hal> seems to me that it makes sense to expose the routines that Hal> the agent is using so that they do not need to be reinvented Hal> for the SM. Does that make sense ? Better yet might be Hal> exposing a routine for SM class sending. I think SMI processing should be applied to all DR SMPs passed to ib_post_send_mad(). This is what the Topspin stack does and I believe it is what OpenSM expects. - Roland From halr at voltaire.com Mon Nov 1 14:23:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 17:23:36 -0500 Subject: [openib-general] SM and smi In-Reply-To: <52lldltdve.fsf@topspin.com> References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com> Message-ID: <1099347815.3270.5.camel@localhost.localdomain> On Mon, 2004-11-01 at 17:15, Roland Dreier wrote: > I think SMI processing should be applied to all DR SMPs passed to > ib_post_send_mad(). This is what the Topspin stack does and I believe > it is what OpenSM expects. That works for me. In doing that, the 0 hop outgoing case should call process_mad and return the response appropriately. Is the same thing true for DLID = local LID or does the driver or HCA handle this case as well ? -- Hal From roland at topspin.com Mon Nov 1 14:26:53 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 01 Nov 2004 14:26:53 -0800 Subject: [openib-general] SM and smi In-Reply-To: <1099347815.3270.5.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 01 Nov 2004 17:23:36 -0500") References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com> <1099347815.3270.5.camel@localhost.localdomain> Message-ID: <52hdo9tdci.fsf@topspin.com> Hal> That works for me. In doing that, the 0 hop outgoing case Hal> should call process_mad and return the response Hal> appropriately. This behavior should probably be set by a flag. It's required for Tavor/Arbel, but we found on Anafa2 that performance was much better if we just posted zero-hop DR SMPs to the send queue. Hal> Is the same thing true for DLID = local LID or does the Hal> driver or HCA handle this case as well ? No, it's not required for LID-routed loopbacks. - R. From mashirle at us.ibm.com Mon Nov 1 15:01:26 2004 From: mashirle at us.ibm.com (Shirley Ma) Date: Mon, 1 Nov 2004 15:01:26 -0800 Subject: [openib-general] [PATCH]remove redundant assignment in ib_post_send_mad() Message-ID: <200411011501.26812.mashirle@us.ibm.com> I am using my unix account to send the patch. Hope it works. diff -urN access/mad.c access.patch2/mad.c --- access/mad.c 2004-11-01 14:51:41.356902216 -0800 +++ access.patch2/mad.c 2004-11-01 14:53:37.003321288 -0800 @@ -368,16 +368,15 @@ struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_port_private *port_priv; - cur_send_wr = send_wr; /* Validate supplied parameters */ if (!mad_agent || !send_wr) { - *bad_send_wr = cur_send_wr; + *bad_send_wr = send_wr; return -EINVAL; } if (!mad_agent->send_handler || (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) { - *bad_send_wr = cur_send_wr; + *bad_send_wr = send_wr; return -EINVAL; } -- Thanks Shirley Ma IBM Linux Technology Center From mshefty at ichips.intel.com Mon Nov 1 15:06:20 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 15:06:20 -0800 Subject: [openib-general] [PATCH]remove redundant assignment in ib_post_send_mad() In-Reply-To: <200411011501.26812.mashirle@us.ibm.com> References: <200411011501.26812.mashirle@us.ibm.com> Message-ID: <20041101150620.686aad29.mshefty@ichips.intel.com> On Mon, 1 Nov 2004 15:01:26 -0800 Shirley Ma wrote: > I am using my unix account to send the patch. Hope it works. > > diff -urN access/mad.c access.patch2/mad.c > --- access/mad.c 2004-11-01 14:51:41.356902216 -0800 > +++ access.patch2/mad.c 2004-11-01 14:53:37.003321288 -0800 > @@ -368,16 +368,15 @@ > struct ib_mad_agent_private *mad_agent_priv; > struct ib_mad_port_private *port_priv; > > - cur_send_wr = send_wr; > /* Validate supplied parameters */ > if (!mad_agent || !send_wr) { > - *bad_send_wr = cur_send_wr; > + *bad_send_wr = send_wr; > return -EINVAL; > } > > if (!mad_agent->send_handler || > (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) { > - *bad_send_wr = cur_send_wr; > + *bad_send_wr = send_wr; > return -EINVAL; > } Patch looks good to me, and should be applied It raises an issue with the current code, though. There are checks for a valid mad_agent, valid_wr, but not a valid *bad_send_wr. I'm wondering if we should convert these checks to BUG_ON, or add in a check for a *bad_send_wr. As a minor optimization, we could make bad_send_wr optional for cases where only a single work request is being posted. - Sean From halr at voltaire.com Mon Nov 1 15:32:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 18:32:21 -0500 Subject: [openib-general] [PATCH]remove redundant assignment in ib_post_send_mad() In-Reply-To: <200411011501.26812.mashirle@us.ibm.com> References: <200411011501.26812.mashirle@us.ibm.com> Message-ID: <1099351941.3270.17.camel@localhost.localdomain> Thanks. Applied. From mashirle at us.ibm.com Mon Nov 1 15:37:02 2004 From: mashirle at us.ibm.com (Shirley Ma) Date: Mon, 1 Nov 2004 15:37:02 -0800 Subject: [openib-general] [PATCH]return the wrong bad_send_wr in ib_post_send_mad() Message-ID: <200411011537.02928.mashirle@us.ibm.com> Here is the patch for return the correct bad_send_wr value after calling ib_send_mad() in ib_post_send_mad(). diff -urN access/mad.c access.patch3/mad.c --- access/mad.c 2004-11-01 14:51:41.000000000 -0800 +++ access.patch3/mad.c 2004-11-01 15:31:05.013571784 -0800 @@ -389,7 +389,6 @@ cur_send_wr = send_wr; while (cur_send_wr) { unsigned long flags; - struct ib_send_wr *bad_wr; struct ib_mad_send_wr_private *mad_send_wr; next_send_wr = (struct ib_send_wr *)cur_send_wr->next; @@ -423,7 +422,7 @@ cur_send_wr->next = NULL; ret = ib_send_mad(mad_agent_priv, mad_send_wr, - cur_send_wr, &bad_wr); + cur_send_wr, bad_send_wr); if (ret) { /* Handle QP overrun separately... -ENOMEM */ @@ -432,7 +431,6 @@ list_del(&mad_send_wr->agent_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - *bad_send_wr = cur_send_wr; atomic_dec(&mad_agent_priv->refcount); return ret; } -- Thanks Shirley Ma IBM Linux Technology Center From halr at voltaire.com Mon Nov 1 15:39:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 18:39:59 -0500 Subject: [openib-general] [PATCH]remove redundant assignment in ib_post_send_mad() In-Reply-To: <20041101150620.686aad29.mshefty@ichips.intel.com> References: <200411011501.26812.mashirle@us.ibm.com> <20041101150620.686aad29.mshefty@ichips.intel.com> Message-ID: <1099352398.3270.25.camel@localhost.localdomain> On Mon, 2004-11-01 at 18:06, Sean Hefty wrote: > It raises an issue with the current code, though. There are checks for > a valid mad_agent, valid_wr, but not a valid *bad_send_wr. I'm > wondering if we should convert these checks to BUG_ON, or add in a check > for a *bad_send_wr. I don't think this is an "or". A check for *bad_send_wr should be added (which might be changed based on the below question). I will post a patch for this. IMO these should be BUG_ON but just errors as these are localized coding errors in some client. > As a minor optimization, we could make bad_send_wr > optional for cases where only a single work request is being posted. If *bad_send_wr is to be validated, the only time when NULL is allowed is when there is only one send_wr. Wouldn't this nullify any savings (unless the validation is removed) ? -- Hal From mashirle at us.ibm.com Mon Nov 1 15:47:47 2004 From: mashirle at us.ibm.com (Shirley Ma) Date: Mon, 1 Nov 2004 15:47:47 -0800 Subject: [openib-general] [PATCH]return the wrong bad_send_wr in ib_send_mad() In-Reply-To: <200411011537.02928.mashirle@us.ibm.com> References: <200411011537.02928.mashirle@us.ibm.com> Message-ID: <200411011547.47539.mashirle@us.ibm.com> Another patch to fix wrong bad_send_wr in ib_send_mad() diff -urN access/mad.c access.patch4/mad.c --- access/mad.c 2004-11-01 14:51:41.000000000 -0800 +++ access.patch4/mad.c 2004-11-01 15:44:08.173513376 -0800 @@ -347,10 +347,8 @@ list_add_tail(&mad_send_wr->send_list, &port_priv->send_posted_mad_list); port_priv->send_posted_mad_count++; - } else { + } else printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret); - *bad_send_wr = send_wr; - } spin_unlock_irqrestore(&port_priv->send_list_lock, flags); return ret; } -- Thanks Shirley Ma IBM Linux Technology Center From halr at voltaire.com Mon Nov 1 15:59:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 18:59:08 -0500 Subject: [openib-general] [PATCH] mad: Validate *bad_send_wr in ib_post_send_mad() Message-ID: <1099353548.2628.1.camel@hpc-1> mad: Validate *bad_send_wr in ib_post_send_mad() Index: mad.c =================================================================== --- mad.c (revision 1109) +++ mad.c (working copy) @@ -369,7 +369,7 @@ struct ib_mad_port_private *port_priv; /* Validate supplied parameters */ - if (!mad_agent || !send_wr) { + if (!mad_agent || !send_wr || !*bad_send_wr) { *bad_send_wr = send_wr; return -EINVAL; } From mshefty at ichips.intel.com Mon Nov 1 15:52:41 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 15:52:41 -0800 Subject: [openib-general] [PATCH] mad: Validate *bad_send_wr in ib_post_send_mad() In-Reply-To: <1099353548.2628.1.camel@hpc-1> References: <1099353548.2628.1.camel@hpc-1> Message-ID: <20041101155241.4d058ef4.mshefty@ichips.intel.com> On Mon, 01 Nov 2004 18:59:08 -0500 Hal Rosenstock wrote: > mad: Validate *bad_send_wr in ib_post_send_mad() > > Index: mad.c > =================================================================== > --- mad.c (revision 1109) > +++ mad.c (working copy) > @@ -369,7 +369,7 @@ > struct ib_mad_port_private *port_priv; > > /* Validate supplied parameters */ > - if (!mad_agent || !send_wr) { > + if (!mad_agent || !send_wr || !*bad_send_wr) { > *bad_send_wr = send_wr; We can't set bad_send_wr if it's invalid. - Sean From mshefty at ichips.intel.com Mon Nov 1 15:58:05 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 15:58:05 -0800 Subject: [openib-general] [PATCH]remove redundant assignment in ib_post_send_mad() In-Reply-To: <1099352398.3270.25.camel@localhost.localdomain> References: <200411011501.26812.mashirle@us.ibm.com> <20041101150620.686aad29.mshefty@ichips.intel.com> <1099352398.3270.25.camel@localhost.localdomain> Message-ID: <20041101155805.515aa53b.mshefty@ichips.intel.com> On Mon, 01 Nov 2004 18:39:59 -0500 Hal Rosenstock wrote: > I don't think this is an "or". A check for *bad_send_wr should be > added(which might be changed based on the below question). I will post > a patch for this. IMO these should be BUG_ON but just errors as these > are localized coding errors in some client. > > > As a minor optimization, we could make bad_send_wr > > optional for cases where only a single work request is being posted. > > If *bad_send_wr is to be validated, the only time when NULL is allowed > is when there is only one send_wr. Wouldn't this nullify any savings > (unless the validation is removed) ? Yes, I was thinking of the case where we removed the validation, but the savings is to make it easier on clients that always send a single MAD at a time. - Sean From krkumar at us.ibm.com Mon Nov 1 16:12:18 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 1 Nov 2004 16:12:18 -0800 (PST) Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad Message-ID: I believe the recent changes to catch all atomic_dec races with unregister failed to catch one spot in ib_post_send_mad. This routine increments mad_agent_priv->refcnt, and while unregister can run, if the ib_send_mad() fails, we drop the refcnt without checking if the refcnt has dropped to zero. The unregister would block indefinitely waiting to be woken up. I think the rest of the atomic_dec's looks good though. Patch included as attachment as well as inline ... Thanks, - KK diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-01 16:01:05.000000000 -0800 +++ 2/mad.c 2004-11-01 16:01:09.000000000 -0800 @@ -432,7 +432,8 @@ int ib_post_send_mad(struct ib_mad_agent spin_unlock_irqrestore(&mad_agent_priv->lock, flags); *bad_send_wr = cur_send_wr; - atomic_dec(&mad_agent_priv->refcount); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); return ret; } cur_send_wr= next_send_wr; -------------- next part -------------- diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-01 16:01:05.000000000 -0800 +++ 2/mad.c 2004-11-01 16:01:09.000000000 -0800 @@ -432,7 +432,8 @@ int ib_post_send_mad(struct ib_mad_agent spin_unlock_irqrestore(&mad_agent_priv->lock, flags); *bad_send_wr = cur_send_wr; - atomic_dec(&mad_agent_priv->refcount); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); return ret; } cur_send_wr= next_send_wr; From tduffy at sun.com Mon Nov 1 16:25:47 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 01 Nov 2004 16:25:47 -0800 Subject: [openib-general] [PATCH] Better IPoIB multicast handling In-Reply-To: <52wtxiea77.fsf@topspin.com> References: <528y9yhb5o.fsf@topspin.com> <1098477556.1127.9.camel@duffman> <52wtxiea77.fsf@topspin.com> Message-ID: <1099355147.9878.75.camel@duffman> On Fri, 2004-10-22 at 14:04 -0700, Roland Dreier wrote: > Can you try running with this debugging patch? (It should just crash sooner) So, I haven't been able to trigger the bug like I used to. I am not sure why, but after a series of fiasco's (Linux/sparc64 box rootfs corrupted, Solaris 10 server that I was running my SM on exposed a bug that forced me to upgrade to later build 70, the SM needed to be updated, and the firmware on my Tavor card was hosed in this process, leading me to have to reflash it in protected mode), it is all working fairly smoothly. I do get this warning now when I ifconfig up my device: ib0.8001: multicast group ff12401b8001000000000000ffffffff already attached But it seems harmless. I will keep trying to break it, but I seem to be doing a better job of making myself other work. -tduffy -- "A democracy cannot exist as a permanent form of government. It can only exist until the voters discover that they can vote themselves money from the public treasure. From that moment on, the majority always votes for the candidates promising the most money from the public treasury, with the result that democracy always collapses over loose fiscal policy followed by a dictatorship. The average of the world's greatest civilizations has been two hundred years. These nations have progressed through the following sequence: from bondage to spiritual faith, from spiritual faith to great courage, from courage to liberty, from liberty to abundance, from abundance to selfishness, from selfishness to complacency, from complacency to apathy, from apathy to dependency, from dependency back to bondage." -- Alexander Tyler, 1778 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Mon Nov 1 16:25:44 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 16:25:44 -0800 Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: References: Message-ID: <20041101162544.5af3f091.mshefty@ichips.intel.com> On Mon, 1 Nov 2004 16:12:18 -0800 (PST) Krishna Kumar wrote: > I believe the recent changes to catch all atomic_dec races with > unregister failed to catch one spot in ib_post_send_mad. This routine > increments mad_agent_priv->refcnt, and while unregister can run, if > the ib_send_mad() fails, we drop the refcnt without checking if the > refcnt has dropped to zero. The unregister would block indefinitely > waiting to be woken up. I think the rest of the atomic_dec's looks > good though. I looked at this area of the code, and my thought was that we cannot handle a client that tries to send a MAD at the same time that they unregister. So, I think that a simple atomic_dec should be okay. If a client is calling unregister in a separate thread, then they are essentially trying to send a MAD after unregistering, in which case our data structures have been freed. - Sean > *bad_send_wr = cur_send_wr; > - atomic_dec(&mad_agent_priv->refcount); > + if (atomic_dec_and_test(&mad_agent_priv->refcount)) > + wake_up(&mad_agent_priv->wait); > return ret; > } From krkumar at us.ibm.com Mon Nov 1 16:38:03 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 1 Nov 2004 16:38:03 -0800 (PST) Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: <20041101162544.5af3f091.mshefty@ichips.intel.com> Message-ID: Hi Sean, I think it is reasonable to have current senders racing with unregister. The unregister is waiting for all references to drop to zero before freeing up the resources. It killed the ones waiting for responses (mad_cancel), killed the ones who are executing in callback handlers, and finally after dropping the loader's module refcnt, it waits for the refcnt to drop to zero. These can only be threads which are actively receiving mad packets and those threads in the process of sending mad packets while the unregister was going on (and the ones which fail is the only cause of the problem). Essentially I think the unregister will hang and not free up the resource. Thanks, - KK On Mon, 1 Nov 2004, Sean Hefty wrote: > On Mon, 1 Nov 2004 16:12:18 -0800 (PST) > Krishna Kumar wrote: > > > I believe the recent changes to catch all atomic_dec races with > > unregister failed to catch one spot in ib_post_send_mad. This routine > > increments mad_agent_priv->refcnt, and while unregister can run, if > > the ib_send_mad() fails, we drop the refcnt without checking if the > > refcnt has dropped to zero. The unregister would block indefinitely > > waiting to be woken up. I think the rest of the atomic_dec's looks > > good though. > > I looked at this area of the code, and my thought was that we cannot > handle a client that tries to send a MAD at the same time that they > unregister. So, I think that a simple atomic_dec should be okay. If a > client is calling unregister in a separate thread, then they are > essentially trying to send a MAD after unregistering, in which case our > data structures have been freed. > > - Sean > > > > *bad_send_wr = cur_send_wr; > > - atomic_dec(&mad_agent_priv->refcount); > > + if (atomic_dec_and_test(&mad_agent_priv->refcount)) > > + wake_up(&mad_agent_priv->wait); > > return ret; > > } > > From mshefty at ichips.intel.com Mon Nov 1 16:59:04 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 1 Nov 2004 16:59:04 -0800 Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: References: <20041101162544.5af3f091.mshefty@ichips.intel.com> Message-ID: <20041101165904.7a91f394.mshefty@ichips.intel.com> On Mon, 1 Nov 2004 16:38:03 -0800 (PST) Krishna Kumar wrote: > Hi Sean, > > I think it is reasonable to have current senders racing with > unregister. The unregister is waiting for all references to drop to > zero before freeing up the resources. It killed the ones waiting for > responses(mad_cancel), killed the ones who are executing in callback > handlers, and finally after dropping the loader's module refcnt, it > waits for the refcnt to drop to zero. These can only be threads which > are actively receiving mad packets and those threads in the process of > sending mad packets while the unregister was going on (and the ones > which fail is the only cause of the problem). Essentially I think the > unregister will hang and not free up the resource. The difference here is that a client is calling into the API at the same time that they are trying to unregister. The code, even with this change, cannot handle this condition. For example, if the thread calling ib_unregister_mad_agent executes completely before the thread calling ib_post_send_mad runs (or can take a reference on the mad_agent), the mad_agent is no longer valid, and the structure will have been freed. The thread executing ib_post_send_mad can crash the system at this point. If we want to allow a client to call ib_unregister_mad_agent and ib_post_send_mad simultaneously, then ib_post_send_mad would need to perform some sort of lookup (likely in some global map) to validate the mad_agent. - Sean From krkumar at us.ibm.com Mon Nov 1 17:40:56 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 1 Nov 2004 17:40:56 -0800 (PST) Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: <20041101165904.7a91f394.mshefty@ichips.intel.com> Message-ID: Hi Sean, I agree on the race between the threads, and this is something that I had considered as a separate problem (but now it comes back to haunt me :-). An easier solution for this problem is to make sure that whoever gets the agent (ib_mad_recv_done_handler) validate the mad_agent before calling us. Basically find_mad_agent can hold a refcnt on the agent. Is that correct ? If so, I can make a patch to handle races on that front. This code is pretty complicated, so please let me know if I have grossly mis-stated something (agents and agent_private, and whatnots :-). Thanks for your feedback, - KK On Mon, 1 Nov 2004, Sean Hefty wrote: > On Mon, 1 Nov 2004 16:38:03 -0800 (PST) > Krishna Kumar wrote: > > > Hi Sean, > > > > I think it is reasonable to have current senders racing with > > unregister. The unregister is waiting for all references to drop to > > zero before freeing up the resources. It killed the ones waiting for > > responses(mad_cancel), killed the ones who are executing in callback > > handlers, and finally after dropping the loader's module refcnt, it > > waits for the refcnt to drop to zero. These can only be threads which > > are actively receiving mad packets and those threads in the process of > > sending mad packets while the unregister was going on (and the ones > > which fail is the only cause of the problem). Essentially I think the > > unregister will hang and not free up the resource. > > The difference here is that a client is calling into the API at the same > time that they are trying to unregister. The code, even with this > change, cannot handle this condition. > > For example, if the thread calling ib_unregister_mad_agent executes > completely before the thread calling ib_post_send_mad runs (or can take > a reference on the mad_agent), the mad_agent is no longer valid, and the > structure will have been freed. The thread executing ib_post_send_mad > can crash the system at this point. > > If we want to allow a client to call ib_unregister_mad_agent and > ib_post_send_mad simultaneously, then ib_post_send_mad would need to > perform some sort of lookup (likely in some global map) to validate the > mad_agent. > > - Sean > > From krkumar at us.ibm.com Mon Nov 1 17:50:26 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 1 Nov 2004 17:50:26 -0800 (PST) Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: Message-ID: BTW, what I mean is that in the following code, put the atomic_inc() into the find() routine ... - KK ib_mad_recv_done_handler() { ... spin_lock_irqsave(&port_priv->reg_lock, flags); /* Determine corresponding MAD agent for incoming receive MAD */ solicited = solicited_mad(recv->header.recv_buf.mad); mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, solicited); if (!mad_agent) { spin_unlock_irqrestore(&port_priv->reg_lock, flags); printk(KERN_NOTICE PFX "No matching mad agent found for " "received MAD on port %d\n", port_priv->port_num); } else { atomic_inc(&mad_agent->refcount); spin_unlock_irqrestore(&port_priv->reg_lock, flags); ib_mad_complete_recv(mad_agent, recv, solicited); } ... } On Mon, 1 Nov 2004, Krishna Kumar wrote: > Hi Sean, > > I agree on the race between the threads, and this is something that I > had considered as a separate problem (but now it comes back to haunt > me :-). > > An easier solution for this problem is to make sure that whoever > gets the agent (ib_mad_recv_done_handler) validate the mad_agent > before calling us. Basically find_mad_agent can hold a refcnt > on the agent. Is that correct ? If so, I can make a patch to handle > races on that front. This code is pretty complicated, so please let > me know if I have grossly mis-stated something (agents and agent_private, > and whatnots :-). > > Thanks for your feedback, > > - KK > > On Mon, 1 Nov 2004, Sean Hefty wrote: > > > On Mon, 1 Nov 2004 16:38:03 -0800 (PST) > > Krishna Kumar wrote: > > > > > Hi Sean, > > > > > > I think it is reasonable to have current senders racing with > > > unregister. The unregister is waiting for all references to drop to > > > zero before freeing up the resources. It killed the ones waiting for > > > responses(mad_cancel), killed the ones who are executing in callback > > > handlers, and finally after dropping the loader's module refcnt, it > > > waits for the refcnt to drop to zero. These can only be threads which > > > are actively receiving mad packets and those threads in the process of > > > sending mad packets while the unregister was going on (and the ones > > > which fail is the only cause of the problem). Essentially I think the > > > unregister will hang and not free up the resource. > > > > The difference here is that a client is calling into the API at the same > > time that they are trying to unregister. The code, even with this > > change, cannot handle this condition. > > > > For example, if the thread calling ib_unregister_mad_agent executes > > completely before the thread calling ib_post_send_mad runs (or can take > > a reference on the mad_agent), the mad_agent is no longer valid, and the > > structure will have been freed. The thread executing ib_post_send_mad > > can crash the system at this point. > > > > If we want to allow a client to call ib_unregister_mad_agent and > > ib_post_send_mad simultaneously, then ib_post_send_mad would need to > > perform some sort of lookup (likely in some global map) to validate the > > mad_agent. > > > > - Sean > > > > > > From krkumar at us.ibm.com Mon Nov 1 18:02:54 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 1 Nov 2004 18:02:54 -0800 (PST) Subject: [openib-general] [PATCH] Fix MMU if find_mad_agent() finds no agent. Message-ID: This fixes the above case and I also took the liberty of changing "goto ret" to "goto out", which just looks more aesthetic. I am not including inline, since my patches seem to get inlined automatically and without getting mangled. Hope this continues :-) Thanks, - KK -------------- next part -------------- diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-01 17:57:02.000000000 -0800 +++ 2/mad.c 2004-11-01 17:59:03.000000000 -0800 @@ -660,7 +660,7 @@ static void remove_mad_reg_req(struct ib /* Was MAD registration request supplied with original registration ? */ if (!agent_priv->reg_req) { - goto ret; + goto out; } port_priv = agent_priv->port_priv; @@ -668,7 +668,7 @@ static void remove_mad_reg_req(struct ib if (!class) { printk(KERN_ERR PFX "No class table yet MAD registration " "request supplied\n"); - goto ret; + goto out; } mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); @@ -691,7 +691,7 @@ static void remove_mad_reg_req(struct ib } } -ret: +out: return; } @@ -753,7 +753,7 @@ find_mad_agent(struct ib_mad_port_privat if (!mad_agent) { printk(KERN_ERR PFX "No client 0x%x for received MAD " "on port %d\n", hi_tid, port_priv->port_num); - goto ret; + goto out; } } else { /* Routing is based on version, class, and method */ @@ -761,14 +761,14 @@ find_mad_agent(struct ib_mad_port_privat printk(KERN_ERR PFX "MAD received with unsupported " "class version %d on port %d\n", mad->mad_hdr.class_version, port_priv->port_num); - goto ret; + goto out; } version = port_priv->version[mad->mad_hdr.class_version]; if (!version) { printk(KERN_ERR PFX "MAD received on port %d for class " "version %d with no client\n", port_priv->port_num, mad->mad_hdr.class_version); - goto ret; + goto out; } class = version->method_table[convert_mgmt_class( mad->mad_hdr.mgmt_class)]; @@ -776,18 +776,17 @@ find_mad_agent(struct ib_mad_port_privat printk(KERN_ERR PFX "MAD received on port %d for class " "%d with no client\n", port_priv->port_num, mad->mad_hdr.mgmt_class); - goto ret; + goto out; } mad_agent = class->agent[mad->mad_hdr.method & ~IB_MGMT_METHOD_RESP]; } -ret: - if (!mad_agent->agent.recv_handler) { +out: + if (mad_agent && !mad_agent->agent.recv_handler) { printk(KERN_ERR PFX "No receive handler for client " "%p on port %d\n", - &mad_agent->agent, - port_priv->port_num); + &mad_agent->agent, port_priv->port_num); mad_agent = NULL; } @@ -802,7 +801,7 @@ static int validate_mad(struct ib_mad *m if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { printk(KERN_ERR PFX "MAD received with unsupported base " "version %d\n", mad->mad_hdr.base_version); - goto ret; + goto out; } /* Filter SMI packets sent to other than QP0 */ @@ -816,7 +815,7 @@ static int validate_mad(struct ib_mad *m valid = 1; } -ret: +out: return valid; } @@ -978,7 +977,7 @@ static void ib_mad_recv_done_handler(str /* Validate MAD */ if (!validate_mad(recv->header.recv_buf.mad, qp_num)) - goto ret; + goto out; /* Snoop MAD ? */ if (port_priv->device->snoop_mad) @@ -986,7 +985,7 @@ static void ib_mad_recv_done_handler(str (u8)port_priv->port_num, wc->slid, recv->header.recv_buf.mad)) - goto ret; + goto out; spin_lock_irqsave(&port_priv->reg_lock, flags); /* Determine corresponding MAD agent for incoming receive MAD */ @@ -1003,7 +1002,7 @@ static void ib_mad_recv_done_handler(str ib_mad_complete_recv(mad_agent, recv, solicited); } -ret: +out: if (!mad_agent) { /* Should this case be optimized ? */ kmem_cache_free(ib_mad_cache, recv); @@ -1255,7 +1254,7 @@ void ib_cancel_mad(struct ib_mad_agent * mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); if (!mad_send_wr) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - goto ret; + goto out; } if (mad_send_wr->status == IB_WC_SUCCESS) @@ -1264,7 +1263,7 @@ void ib_cancel_mad(struct ib_mad_agent * if (mad_send_wr->refcount != 0) { mad_send_wr->status = IB_WC_WR_FLUSH_ERR; spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - goto ret; + goto out; } list_del(&mad_send_wr->agent_list); @@ -1281,7 +1280,7 @@ void ib_cancel_mad(struct ib_mad_agent * if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); -ret: +out: return; } EXPORT_SYMBOL(ib_cancel_mad); From krkumar at us.ibm.com Mon Nov 1 18:10:16 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 1 Nov 2004 18:10:16 -0800 (PST) Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup the locking area Message-ID: This is a minor cleanup/optimize in this area. No need to hold lock for too long (for atomic_inc), no multiple unlocks and normal case of finding mad_agent first. Thanks, - KK -------------- next part -------------- --- mad.c.org 2004-11-01 17:41:09.000000000 -0800 +++ mad.c 2004-11-01 17:43:55.000000000 -0800 @@ -988,20 +988,19 @@ static void ib_mad_recv_done_handler(str recv->header.recv_buf.mad)) goto out; - spin_lock_irqsave(&port_priv->reg_lock, flags); /* Determine corresponding MAD agent for incoming receive MAD */ + spin_lock_irqsave(&port_priv->reg_lock, flags); solicited = solicited_mad(recv->header.recv_buf.mad); mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, solicited); - if (!mad_agent) { - spin_unlock_irqrestore(&port_priv->reg_lock, flags); - printk(KERN_NOTICE PFX "No matching mad agent found for " - "received MAD on port %d\n", port_priv->port_num); - } else { + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + if (mad_agent) { atomic_inc(&mad_agent->refcount); - spin_unlock_irqrestore(&port_priv->reg_lock, flags); ib_mad_complete_recv(mad_agent, recv, solicited); - } + } else + printk(KERN_NOTICE PFX "No matching mad agent found for " + "received MAD on port %d\n", port_priv->port_num); out: if (!mad_agent) { From halr at voltaire.com Mon Nov 1 18:52:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 01 Nov 2004 21:52:53 -0500 Subject: [openib-general] mad: Validate *bad_send_wr in ib_post_send_mad() Message-ID: <1099363972.13695.2.camel@hpc-1> mad: Validate *bad_send_wr in ib_post_send_mad() Fix previous patch Index: mad.c =================================================================== --- mad.c (revision 1110) +++ mad.c (working copy) @@ -369,16 +369,15 @@ struct ib_mad_port_private *port_priv; /* Validate supplied parameters */ - if (!mad_agent || !send_wr || !*bad_send_wr) { - *bad_send_wr = send_wr; - return -EINVAL; - } + if (!*bad_send_wr) + goto error1; + if (!mad_agent || !send_wr ) + goto error2; + if (!mad_agent->send_handler || - (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) { - *bad_send_wr = send_wr; - return -EINVAL; - } + (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) + goto error2; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); @@ -439,6 +438,11 @@ } return 0; + +error2: + *bad_send_wr = send_wr; +error1: + return -EINVAL; } EXPORT_SYMBOL(ib_post_send_mad); From halr at voltaire.com Tue Nov 2 07:53:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 10:53:26 -0500 Subject: [openib-general] [PATCH] mad: Fix previous patch again Message-ID: <1099410806.14725.5.camel@hpc-1> mad: Fix previous patch again (bad_send_wr validation in ib_post_send_mad) Index: mad.c =================================================================== --- mad.c (revision 1111) +++ mad.c (working copy) @@ -369,7 +369,7 @@ struct ib_mad_port_private *port_priv; /* Validate supplied parameters */ - if (!*bad_send_wr) + if (!bad_send_wr) goto error1; if (!mad_agent || !send_wr ) From halr at voltaire.com Tue Nov 2 08:18:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 11:18:41 -0500 Subject: [openib-general] [PATCH]return the wrong bad_send_wr in ib_post_send_mad() In-Reply-To: <200411011537.02928.mashirle@us.ibm.com> References: <200411011537.02928.mashirle@us.ibm.com> Message-ID: <1099412321.3114.0.camel@hpc-1> On Mon, 2004-11-01 at 18:37, Shirley Ma wrote: > Here is the patch for return the correct bad_send_wr value after calling > ib_send_mad() in ib_post_send_mad(). Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 2 08:36:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 11:36:41 -0500 Subject: [openib-general] [PATCH]return the wrong bad_send_wr in ib_send_mad() In-Reply-To: <200411011547.47539.mashirle@us.ibm.com> References: <200411011537.02928.mashirle@us.ibm.com> <200411011547.47539.mashirle@us.ibm.com> Message-ID: <1099413401.3581.0.camel@hpc-1> On Mon, 2004-11-01 at 18:47, Shirley Ma wrote: > Another patch to fix wrong bad_send_wr in ib_send_mad() Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 2 08:49:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 11:49:52 -0500 Subject: [openib-general] [PATCH] Fix MMU if find_mad_agent() finds no agent. In-Reply-To: References: Message-ID: <1099414192.3581.8.camel@hpc-1> On Mon, 2004-11-01 at 21:02, Krishna Kumar wrote: > This fixes the above case and I also took the liberty of changing > "goto ret" to "goto out", which just looks more aesthetic. Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 2 08:59:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 11:59:18 -0500 Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup the locking area In-Reply-To: References: Message-ID: <1099414758.3581.15.camel@hpc-1> On Mon, 2004-11-01 at 21:10, Krishna Kumar wrote: > This is a minor cleanup/optimize in this area. No need to hold lock > for too long (for atomic_inc), no multiple unlocks and normal case > of finding mad_agent first. Doesn't this create a window between the unlocking after the mad_agent is found and the atomic_inc ? Couldn't a deregistration occur then ? -- Hal From halr at voltaire.com Tue Nov 2 09:11:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 12:11:46 -0500 Subject: [openib-general] ib_sa oops Message-ID: <1099415506.3581.27.camel@hpc-1> When I did a modprobe -r ib_ipoib, I got the following oops when the SA's send_handler is called on it's deregistering it's MAD client with pending MADs. I first bringup and configure IPoIB: /sbin/modprobe ib_ipoib /sbin/ifconfig ib0 192.168.0.20 I then do: ping -b 192.168.0.255 and ctl-C before it cycles around the list a second time and then: /sbin/modprobe -r ib_ipoib Segmentation fault /var/log/messages showed: Nov 2 10:54:17 hpc-1 kernel: Unable to handle kernel paging request at virtual address f8a50407 Nov 2 10:54:17 hpc-1 kernel: printing eip: Nov 2 10:54:17 hpc-1 kernel: f8a50407 Nov 2 10:54:17 hpc-1 kernel: *pde = 019a5067 Nov 2 10:54:17 hpc-1 kernel: *pte = 00000000 Nov 2 10:54:17 hpc-1 kernel: Oops: 0000 [#1] Nov 2 10:54:17 hpc-1 kernel: SMP Nov 2 10:54:17 hpc-1 kernel: Modules linked in: ib_sa ib_mad ib_services ib_mthca ib_core loop autofs e1000 ohci1394 ieee1394 parport_pc parport usbcore Nov 2 10:54:17 hpc-1 kernel: CPU: 0 Nov 2 10:54:17 hpc-1 kernel: EIP: 0060:[] Not tainted VLI Nov 2 10:54:17 hpc-1 kernel: EFLAGS: 00010246 (2.6.9) Nov 2 10:54:17 hpc-1 kernel: EIP is at 0xf8a50407 Nov 2 10:54:17 hpc-1 kernel: eax: e2f05280 ebx: 00000286 ecx: 00000000 edx: fffffffb Nov 2 10:54:17 hpc-1 kernel: esi: c6ba3340 edi: c6ba3348 ebp: fffffffb esp: e6eebdfc Nov 2 10:54:17 hpc-1 kernel: ds: 007b es: 007b ss: 0068 Nov 2 10:54:17 hpc-1 kernel: Process modprobe (pid: 12680, threadinfo=e6eea000 task=f5f30230) Nov 2 10:54:17 hpc-1 kernel: Stack: f8a217d8 fffffffb 00000000 e2f05280 e6eebe60 c02a1e5e 00000000 f5f30230 Nov 2 10:54:17 hpc-1 kernel: c0117d96 00000000 00000000 00000003 c170b060 c6ff3a70 c6ff3830 c011685a Nov 2 10:54:17 hpc-1 kernel: f5f30230 e74b5800 f5f30230 00000000 e6eebe98 c02a1a92 c6ff3830 c170e4d0 Nov 2 10:54:17 hpc-1 kernel: Call Trace: Nov 2 10:54:17 hpc-1 kernel: [] ib_sa_mcmember_rec_callback+0x5a/0x7f [ib_sa] Nov 2 10:54:17 hpc-1 kernel: [] wait_for_completion+0xc4/0xcc Nov 2 10:54:17 hpc-1 kernel: [] default_wake_function+0x0/0x12 Nov 2 10:54:17 hpc-1 kernel: [] finish_task_switch+0x3a/0x83 Nov 2 10:54:17 hpc-1 kernel: [] schedule+0x326/0x62e Nov 2 10:54:17 hpc-1 kernel: [] send_handler+0xaa/0xbc [ib_sa] Nov 2 10:54:17 hpc-1 kernel: [] cancel_mads+0xe5/0x127 [ib_mad] Nov 2 10:54:17 hpc-1 kernel: [] ib_unregister_mad_agent+0x16/0x135 [ib_mad] Nov 2 10:54:17 hpc-1 kernel: [] default_wake_function+0x0/0x12 Nov 2 10:54:17 hpc-1 kernel: [] default_wake_function+0x0/0x12 Nov 2 10:54:17 hpc-1 kernel: [] ib_get_client_data+0x42/0x4e [ib_core] Nov 2 10:54:17 hpc-1 kernel: [] ib_sa_remove_one+0x44/0x7d [ib_sa] Nov 2 10:54:17 hpc-1 kernel: [] ib_unregister_client+0xee/0xf3 [ib_core] Nov 2 10:54:17 hpc-1 kernel: [] try_stop_module+0x37/0x3b Nov 2 10:54:17 hpc-1 kernel: [] __try_stop_module+0x0/0x41 Nov 2 10:54:17 hpc-1 kernel: [] ib_sa_cleanup+0xf/0x13 [ib_sa] Nov 2 10:54:17 hpc-1 kernel: [] sys_delete_module+0x16d/0x19b Nov 2 10:54:17 hpc-1 kernel: [] sys_munmap+0x51/0x76 Nov 2 10:54:17 hpc-1 kernel: [] sysenter_past_esp+0x52/0x71 Nov 2 10:54:17 hpc-1 kernel: Code: Bad EIP value. -- Hal From roland at topspin.com Tue Nov 2 09:15:31 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 02 Nov 2004 09:15:31 -0800 Subject: [openib-general] ib_sa oops In-Reply-To: <1099415506.3581.27.camel@hpc-1> (Hal Rosenstock's message of "Tue, 02 Nov 2004 12:11:46 -0500") References: <1099415506.3581.27.camel@hpc-1> Message-ID: <52wtx4rx3g.fsf@topspin.com> Hal> When I did a modprobe -r ib_ipoib, I got the following oops Hal> when the SA's send_handler is called on it's deregistering Hal> it's MAD client with pending MADs. Can you reproduce it with a kernel with CONFIG_KALLSYMS turned on so that I can read the oops? Thanks, Roland From roland at topspin.com Tue Nov 2 09:16:34 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 02 Nov 2004 09:16:34 -0800 Subject: [openib-general] ib_sa oops In-Reply-To: <52wtx4rx3g.fsf@topspin.com> (Roland Dreier's message of "Tue, 02 Nov 2004 09:15:31 -0800") References: <1099415506.3581.27.camel@hpc-1> <52wtx4rx3g.fsf@topspin.com> Message-ID: <52sm7srx1p.fsf@topspin.com> Roland> Can you reproduce it with a kernel with CONFIG_KALLSYMS Roland> turned on so that I can read the oops? Sorry, never mind... the line wrapping was so bad that I didn't notice the function names. I'll take a look at what's happening. Thanks, Roland From halr at voltaire.com Tue Nov 2 09:26:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 12:26:53 -0500 Subject: [openib-general] ib_sa oops In-Reply-To: <52wtx4rx3g.fsf@topspin.com> References: <1099415506.3581.27.camel@hpc-1> <52wtx4rx3g.fsf@topspin.com> Message-ID: <1099416413.2985.0.camel@hpc-1> On Tue, 2004-11-02 at 12:15, Roland Dreier wrote: > Hal> When I did a modprobe -r ib_ipoib, I got the following oops > Hal> when the SA's send_handler is called on it's deregistering > Hal> it's MAD client with pending MADs. > > Can you reproduce it with a kernel with CONFIG_KALLSYMS turned on so > that I can read the oops? CONFIG_KALLSYMS is y. CONFIG_KALLSYMS_ALL is not set nor is CONFIG_KALLSYMS_EXTRA_PASS. Should either or both of them be set ? -- Hal From mshefty at ichips.intel.com Tue Nov 2 09:41:08 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 09:41:08 -0800 Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup the locking area In-Reply-To: <1099414758.3581.15.camel@hpc-1> References: <1099414758.3581.15.camel@hpc-1> Message-ID: <20041102094108.7f85249f.mshefty@ichips.intel.com> On Tue, 02 Nov 2004 11:59:18 -0500 Hal Rosenstock wrote: > On Mon, 2004-11-01 at 21:10, Krishna Kumar wrote: > > This is a minor cleanup/optimize in this area. No need to hold lock > > for too long (for atomic_inc), no multiple unlocks and normal case > > of finding mad_agent first. > > Doesn't this create a window between the unlocking after the mad_agent > is found and the atomic_inc ? Couldn't a deregistration occur then ? You are correct, Hal. We need to find and increment the mad_agent under the lock in order to prevent deregistration from completing. - Sean From roland at topspin.com Tue Nov 2 09:45:57 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 02 Nov 2004 09:45:57 -0800 Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap from 2 verb routines. In-Reply-To: <1099341735.3074.97.camel@hpc-1> (Hal Rosenstock's message of "Mon, 01 Nov 2004 15:42:16 -0500") References: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com> <1099326275.3074.3.camel@hpc-1> <20041101094003.6c7bc3e0.mshefty@ichips.intel.com> <1099341735.3074.97.camel@hpc-1> Message-ID: <52oeigrvoq.fsf@topspin.com> As far as I can tell this patch is broken: it removes the qp_cap parameter to modify_qp but doesn't fix up the mthca functions. I added the missing pieces by hand and applied. - R. From mshefty at ichips.intel.com Tue Nov 2 09:50:51 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 09:50:51 -0800 Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: References: <20041101165904.7a91f394.mshefty@ichips.intel.com> Message-ID: <20041102095051.39cdc978.mshefty@ichips.intel.com> On Mon, 1 Nov 2004 17:40:56 -0800 (PST) Krishna Kumar wrote: > I agree on the race between the threads, and this is something that I > had considered as a separate problem (but now it comes back to haunt > me :-). These sort of race conditions was something that I gave careful attention to when updating the code. That doesn't mean that I didn't miss something, and I appreciate that you're willing to review these for correctness. > An easier solution for this problem is to make sure that whoever > gets the agent (ib_mad_recv_done_handler) validate the mad_agent > before calling us. Basically find_mad_agent can hold a refcnt > on the agent. Is that correct ? This is correct. After find_mad_agent is called, the code takes a reference on the mad_agent. I think this is in the portion of code from one of your patches. Moving the reference inside find_mad_agent is a minor restructuring of the code. If we do move the reference, I think it makes sense to move the locking inside that routine as well. From krkumar at us.ibm.com Tue Nov 2 09:46:15 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 09:46:15 -0800 (PST) Subject: [openib-general] [PATCH] ib_mad_recv_done_handler cleanup the locking area In-Reply-To: <1099414758.3581.15.camel@hpc-1> Message-ID: Hi Hal, Yes, you are right. I was talking about this case with Sean, but forgot it when I actually sent the patch. Please disregard it. thx, - KK On Tue, 2 Nov 2004, Hal Rosenstock wrote: > On Mon, 2004-11-01 at 21:10, Krishna Kumar wrote: > > This is a minor cleanup/optimize in this area. No need to hold lock > > for too long (for atomic_inc), no multiple unlocks and normal case > > of finding mad_agent first. > > Doesn't this create a window between the unlocking after the mad_agent > is found and the atomic_inc ? Couldn't a deregistration occur then ? > > -- Hal > > > From mshefty at ichips.intel.com Tue Nov 2 09:52:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 09:52:39 -0800 Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: References: Message-ID: <20041102095239.2b64f0b5.mshefty@ichips.intel.com> On Mon, 1 Nov 2004 17:50:26 -0800 (PST) Krishna Kumar wrote: > spin_lock_irqsave(&port_priv->reg_lock, flags); > /* Determine corresponding MAD agent for incoming receive MAD > */ solicited = solicited_mad(recv->header.recv_buf.mad); > mad_agent = find_mad_agent(port_priv, > recv->header.recv_buf.mad, > solicited); > if (!mad_agent) { > spin_unlock_irqrestore(&port_priv->reg_lock, flags); > printk(KERN_NOTICE PFX "No matching mad agent found > for " > "received MAD on port %d\n", > port_priv->port_num); > } else { > atomic_inc(&mad_agent->refcount); > spin_unlock_irqrestore(&port_priv->reg_lock, flags); > ib_mad_complete_recv(mad_agent, recv, solicited); Related to this, the call to solicited_mad() doesn't need to be made while holding the lock. Moving this outside, we can push the locking inside find_mad_agent as well, if it makes more sense to do so. - Sean From mshefty at ichips.intel.com Tue Nov 2 09:56:47 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 09:56:47 -0800 Subject: [openib-general] [PATCH] for review -- fix MAD completion handling In-Reply-To: <20041028233000.19879b59.mshefty@ichips.intel.com> References: <20041028233000.19879b59.mshefty@ichips.intel.com> Message-ID: <20041102095647.3b74fbc9.mshefty@ichips.intel.com> On Thu, 28 Oct 2004 23:30:00 -0700 Sean Hefty wrote: > Here's what I have to handle MAD completion handling. This patch > tries to fix the issue of matching a completion (successful or error) > with the corresponding work request. Some notes: As just an update, there were a couple of minor issues in this patch (minor to fix anyway...). I will post a new patch after merging in the latest changes to the code and retesting. - Sean From krkumar at us.ibm.com Tue Nov 2 09:59:14 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 09:59:14 -0800 (PST) Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: <20041102095051.39cdc978.mshefty@ichips.intel.com> Message-ID: Hi Sean, I think that is the best approach. And using this method, we can also avoid holding the lock if solicited is set. I will send a patch in a few minutes if this approach looks good. Thanks, - KK On Tue, 2 Nov 2004, Sean Hefty wrote: > Related to this, the call to silicited_mad() doesn't need to be made > while holding the lock. Moving this outside, we can push the locking > inside find_mad_agent as well, if it makes more sense to do so. From mshefty at ichips.intel.com Tue Nov 2 10:21:26 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 10:21:26 -0800 Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: References: <20041102095051.39cdc978.mshefty@ichips.intel.com> Message-ID: <20041102102126.26746a63.mshefty@ichips.intel.com> On Tue, 2 Nov 2004 09:59:14 -0800 (PST) Krishna Kumar wrote: > Hi Sean, > > I think that is the best approach. And using this method, we can also > avoid holding the lock if solicited is set. I will send a patch in a > few minutes if this approach looks good. Sounds good. I think that you'll need to hold the lock even if solicited is set to handle the case where a response is received after the sender unregistered. - Sean From krkumar at us.ibm.com Tue Nov 2 10:40:04 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 10:40:04 -0800 (PST) Subject: [openib-general] [PATCH] Optimize check_class_table and method_table to return BOOL. Message-ID: The callers are just interested in knowing whether any methods or method tables are in use, not the actual use count. Thanks, - KK diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-02 10:32:51.000000000 -0800 +++ 2/mad.c 2004-11-02 10:35:01.000000000 -0800 @@ -530,34 +530,30 @@ static int allocate_method_table(struct return 0; } +/* + * Check to see if there are any methods still in use. + */ static int check_method_table(struct ib_mad_mgmt_method_table *method) { - int i, j; + int i; - /* Check to see if there are any methods still in use */ - j = 0; - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) if (method->agent[i]) - j++; - } - return j; + return 1; + return 0; } +/* + * Check to see if there are any method tables for this class still in use. + */ static int check_class_table(struct ib_mad_mgmt_class_table *class) { - int i, j; + int i; - /* - * Check to see if there are any method tables for this class still - * in use - */ - j = 0; - for (i = 0; i < MAX_MGMT_CLASS; i++) { - if (class->method_table[i]) { - j++; - } - } - return j; + for (i = 0; i < MAX_MGMT_CLASS; i++) + if (class->method_table[i]) + return 1; + return 0; } static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, From halr at voltaire.com Tue Nov 2 10:57:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 13:57:59 -0500 Subject: [openib-general] [PATCH] Better IPoIB multicast handling In-Reply-To: <1099355147.9878.75.camel@duffman> References: <528y9yhb5o.fsf@topspin.com> <1098477556.1127.9.camel@duffman> <52wtxiea77.fsf@topspin.com> <1099355147.9878.75.camel@duffman> Message-ID: <1099421879.2834.3.camel@hpc-1> On Mon, 2004-11-01 at 19:25, Tom Duffy wrote: > I do get this warning now when I ifconfig up my device: > > ib0.8001: multicast group ff12401b8001000000000000ffffffff already attached > > But it seems harmless. I see it too and it appears to be benign. I see it along with a previous message: ib0: multicast join failed for ff12401bffff000000000000ffffffff, status -5 There are multiple Sets of MCMemberRecord with different component masks which are attempted for the broadcast group and the 224.0.0.1 group at when the network interface is brought up with an IP address. -- Hal From halr at voltaire.com Tue Nov 2 11:15:40 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 14:15:40 -0500 Subject: [openib-general] [RFC] [PATCH] Remove redundant ib_qp_cap from 2 verb routines. In-Reply-To: <52oeigrvoq.fsf@topspin.com> References: <20041029131437.6f1d0cf6.mshefty@ichips.intel.com> <1099326275.3074.3.camel@hpc-1> <20041101094003.6c7bc3e0.mshefty@ichips.intel.com> <1099341735.3074.97.camel@hpc-1> <52oeigrvoq.fsf@topspin.com> Message-ID: <1099422940.2838.2.camel@hpc-1> On Tue, 2004-11-02 at 12:45, Roland Dreier wrote: > As far as I can tell this patch is broken: it removes the qp_cap > parameter to modify_qp but doesn't fix up the mthca functions. Sorry :-( It looks like I somehow missed either including the change or the change to mthca_qp.c. If it is the former, I wonder how things could build. Do you think this could be related to the oops ? (I will retest and see if I can recreate it). > I added the missing pieces by hand and applied. Thanks. -- Hal From mshefty at ichips.intel.com Tue Nov 2 11:19:06 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 11:19:06 -0800 Subject: [openib-general] [PATCH] for review -- fix MAD completion handling In-Reply-To: <20041028233000.19879b59.mshefty@ichips.intel.com> References: <20041028233000.19879b59.mshefty@ichips.intel.com> Message-ID: <20041102111906.431f78b0.mshefty@ichips.intel.com> On Thu, 28 Oct 2004 23:30:00 -0700 Sean Hefty wrote: > Here's what I have to handle MAD completion handling. This patch > tries to fix the issue of matching a completion (successful or error) > with the corresponding work request. Some notes: Please use this patch instead. I merged with the latest changes (as of this morning) and tested with opensm running on a remote node and ipoib running locally. This change is for the openib-candidate branch, but going forward, my intention is to create patches for the roland-merge branch. - Sean Index: access/mad.c =================================================================== --- access/mad.c (revision 1116) +++ access/mad.c (working copy) @@ -81,9 +81,8 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, - struct ib_qp *qp); -static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); @@ -130,6 +129,19 @@ 0 : mgmt_class; } +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + /* * ib_register_mad_agent - Register to send/receive MADs */ @@ -148,12 +160,13 @@ struct ib_mad_reg_req *reg_req = NULL; struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; - int ret2; + int ret2, qpn; unsigned long flags; u8 mgmt_class; /* Validate parameters */ - if (qp_type != IB_QPT_GSI && qp_type != IB_QPT_SMI) { + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { ret = ERR_PTR(-EINVAL); goto error1; } @@ -225,14 +238,14 @@ /* Now, fill in the various structures */ memset(mad_agent_priv, 0, sizeof *mad_agent_priv); - mad_agent_priv->port_priv = port_priv; + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; mad_agent_priv->agent.context = context; - mad_agent_priv->agent.qp = port_priv->qp[qp_type]; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; mad_agent_priv->agent.port_num = port_num; spin_lock_irqsave(&port_priv->reg_lock, flags); @@ -256,6 +269,7 @@ } } } + ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); if (ret2) { ret = ERR_PTR(ret2); @@ -272,7 +286,6 @@ INIT_WORK(&mad_agent_priv->work, timeout_sends, mad_agent_priv); atomic_set(&mad_agent_priv->refcount, 1); init_waitqueue_head(&mad_agent_priv->wait); - mad_agent_priv->port_priv = port_priv; return &mad_agent_priv->agent; @@ -292,6 +305,7 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) { struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_port_private *port_priv; unsigned long flags; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, @@ -305,13 +319,14 @@ */ cancel_mads(mad_agent_priv); + port_priv = mad_agent_priv->qp_info->port_priv; cancel_delayed_work(&mad_agent_priv->work); - flush_workqueue(mad_agent_priv->port_priv->wq); + flush_workqueue(port_priv->wq); - spin_lock_irqsave(&mad_agent_priv->port_priv->reg_lock, flags); + spin_lock_irqsave(&port_priv->reg_lock, flags); remove_mad_reg_req(mad_agent_priv); list_del(&mad_agent_priv->agent_list); - spin_unlock_irqrestore(&mad_agent_priv->port_priv->reg_lock, flags); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); /* XXX: Cleanup pending RMPP receives for this agent */ @@ -326,30 +341,51 @@ } EXPORT_SYMBOL(ib_unregister_mad_agent); +static void queue_mad(struct ib_mad_queue *mad_queue, + struct ib_mad_list_head *mad_list) +{ + unsigned long flags; + + mad_list->mad_queue = mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_add_tail(&mad_list->list, &mad_queue->list); + mad_queue->count++; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_send_wr_private *mad_send_wr, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr) { - struct ib_mad_port_private *port_priv; - unsigned long flags; + struct ib_mad_qp_info *qp_info; int ret; - port_priv = mad_agent_priv->port_priv; - /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; mad_send_wr->wr_id = send_wr->wr_id; - send_wr->wr_id = (unsigned long)mad_send_wr; + send_wr->wr_id = (unsigned long)&mad_send_wr->mad_list; + queue_mad(&qp_info->send_queue, &mad_send_wr->mad_list); - spin_lock_irqsave(&port_priv->send_list_lock, flags); ret = ib_post_send(mad_agent_priv->agent.qp, send_wr, bad_send_wr); - if (!ret) { - list_add_tail(&mad_send_wr->send_list, - &port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count++; - } else + if (ret) { printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dequeue_mad(&mad_send_wr->mad_list); + *bad_send_wr = send_wr; + } return ret; } @@ -364,7 +400,6 @@ int ret; struct ib_send_wr *cur_send_wr, *next_send_wr; struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_port_private *port_priv; /* Validate supplied parameters */ if (!bad_send_wr) @@ -379,7 +414,6 @@ mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); - port_priv = mad_agent_priv->port_priv; /* Walk list of send WRs and post each on send list */ cur_send_wr = send_wr; @@ -421,6 +455,7 @@ cur_send_wr, bad_send_wr); if (ret) { /* Handle QP overrun separately... -ENOMEM */ + /* Handle posting when QP is in error state... */ /* Fail send request */ spin_lock_irqsave(&mad_agent_priv->lock, flags); @@ -587,7 +622,7 @@ if (!mad_reg_req) return 0; - private = priv->port_priv; + private = priv->qp_info->port_priv; mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); class = &private->version[mad_reg_req->mgmt_class_version]; if (!*class) { @@ -663,7 +698,7 @@ goto out; } - port_priv = agent_priv->port_priv; + port_priv = agent_priv->qp_info->port_priv; class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; if (!class) { printk(KERN_ERR PFX "No class table yet MAD registration " @@ -695,20 +730,6 @@ return; } -static int convert_qpnum(u32 qp_num) -{ - /* - * XXX: No redirection currently - * QP0 and QP1 only - * Ultimately, will need table of QP numbers and table index - * as QP numbers will not be packed once redirection supported - */ - if (qp_num > 1) { - return -1; - } - return qp_num; -} - static int response_mad(struct ib_mad *mad) { /* Trap represses are responses although response bit is reset */ @@ -913,55 +934,21 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { + struct ib_mad_qp_info *qp_info; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_recv_buf *rbuf; struct ib_mad_private *recv; - union ib_mad_recv_wrid wrid; - unsigned long flags; - u32 qp_num; + struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent = NULL; - int solicited, qpn; - - /* For receive, QP number is field in the WC WRID */ - wrid.wrid = wc->wr_id; - qp_num = wrid.wrid_field.qpn; - qpn = convert_qpnum(qp_num); - if (qpn == -1) { - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); - printk(KERN_ERR PFX "Packet received on unknown QPN %d\n", - qp_num); - return; - } - - /* - * Completion corresponds to first entry on - * posted MAD receive list based on WRID in completion - */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - if (!list_empty(&port_priv->recv_posted_mad_list[qpn])) { - rbuf = list_entry(port_priv->recv_posted_mad_list[qpn].next, - struct ib_mad_recv_buf, - list); - mad_priv_hdr = container_of(rbuf, struct ib_mad_private_header, - recv_buf); - recv = container_of(mad_priv_hdr, struct ib_mad_private, - header); - - /* Remove from posted receive MAD list */ - list_del(&recv->header.recv_buf.list); - port_priv->recv_posted_mad_count[qpn]--; - - } else { - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); - printk(KERN_ERR PFX "Receive completion WR ID 0x%Lx on QP %d " - "with no posted receive\n", - (unsigned long long) wc->wr_id, - qp_num); - return; - } - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + int solicited; + unsigned long flags; + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); pci_unmap_single(port_priv->device->dma_device, pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - @@ -976,7 +963,7 @@ recv->header.recv_buf.grh = &recv->grh; /* Validate MAD */ - if (!validate_mad(recv->header.recv_buf.mad, qp_num)) + if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) goto out; /* Snoop MAD ? */ @@ -1009,7 +996,7 @@ } /* Post another receive request for this QP */ - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); + ib_mad_post_receive_mad(qp_info); } static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) @@ -1030,7 +1017,8 @@ delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; - queue_delayed_work(mad_agent_priv->port_priv->wq, + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, &mad_agent_priv->work, delay); } } @@ -1060,7 +1048,7 @@ /* Reschedule a work item if we have a shorter timeout */ if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { cancel_delayed_work(&mad_agent_priv->work); - queue_delayed_work(mad_agent_priv->port_priv->wq, + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, &mad_agent_priv->work, delay); } } @@ -1114,39 +1102,15 @@ struct ib_wc *wc) { struct ib_mad_send_wr_private *mad_send_wr; - unsigned long flags; - - /* Completion corresponds to first entry on posted MAD send list */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (list_empty(&port_priv->send_posted_mad_list)) { - printk(KERN_ERR PFX "Send completion WR ID 0x%Lx but send " - "list is empty\n", (unsigned long long) wc->wr_id); - goto error; - } - - mad_send_wr = list_entry(port_priv->send_posted_mad_list.next, - struct ib_mad_send_wr_private, - send_list); - if (wc->wr_id != (unsigned long)mad_send_wr) { - printk(KERN_ERR PFX "Send completion WR ID 0x%Lx doesn't match " - "posted send WR ID 0x%lx\n", - (unsigned long long) wc->wr_id, - (unsigned long)mad_send_wr); - goto error; - } - - /* Remove from posted send MAD list */ - list_del(&mad_send_wr->send_list); - port_priv->send_posted_mad_count--; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + struct ib_mad_list_head *mad_list; + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + dequeue_mad(mad_list); /* Restore client wr_id in WC */ wc->wr_id = mad_send_wr->wr_id; ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); - return; - -error: - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); } /* @@ -1156,28 +1120,33 @@ { struct ib_mad_port_private *port_priv; struct ib_wc wc; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; port_priv = (struct ib_mad_port_private*)data; ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { if (wc.status != IB_WC_SUCCESS) { - printk(KERN_ERR PFX "Completion error %d WRID 0x%Lx\n", - wc.status, (unsigned long long) wc.wr_id); + /* Determine if failure was a send or receive. */ + mad_list = (struct ib_mad_list_head *) + (unsigned long)wc.wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->send_queue) + wc.opcode = IB_WC_SEND; + else + wc.opcode = IB_WC_RECV; + } + switch (wc.opcode) { + case IB_WC_SEND: ib_mad_send_done_handler(port_priv, &wc); - } else { - switch (wc.opcode) { - case IB_WC_SEND: - ib_mad_send_done_handler(port_priv, &wc); - break; - case IB_WC_RECV: - ib_mad_recv_done_handler(port_priv, &wc); - break; - default: - printk(KERN_ERR PFX "Wrong Opcode 0x%x on completion\n", - wc.opcode); - break; - } + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; } } } @@ -1307,7 +1276,8 @@ delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; - queue_delayed_work(mad_agent_priv->port_priv->wq, + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, &mad_agent_priv->work, delay); break; } @@ -1332,24 +1302,13 @@ queue_work(port_priv->wq, &port_priv->work); } -static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, - struct ib_qp *qp) +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info) { struct ib_mad_private *mad_priv; struct ib_sge sg_list; struct ib_recv_wr recv_wr; struct ib_recv_wr *bad_recv_wr; - unsigned long flags; int ret; - union ib_mad_recv_wrid wrid; - int qpn; - - - qpn = convert_qpnum(qp->qp_num); - if (qpn == -1) { - printk(KERN_ERR PFX "Post receive to invalid QPN %d\n", qp->qp_num); - return -EINVAL; - } /* * Allocate memory for receive buffer. @@ -1367,47 +1326,32 @@ } /* Setup scatter list */ - sg_list.addr = pci_map_single(port_priv->device->dma_device, + sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device, &mad_priv->grh, sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; - sg_list.lkey = (*port_priv->mr).lkey; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; /* Setup receive WR */ recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; - wrid.wrid_field.index = port_priv->recv_wr_index[qpn]++; - wrid.wrid_field.qpn = qp->qp_num; - recv_wr.wr_id = wrid.wrid; - - /* Link receive WR into posted receive MAD list */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - list_add_tail(&mad_priv->header.recv_buf.list, - &port_priv->recv_posted_mad_list[qpn]); - port_priv->recv_posted_mad_count[qpn]++; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); - /* Now, post receive WR */ - ret = ib_post_recv(qp, &recv_wr, &bad_recv_wr); + /* Post receive WR. */ + queue_mad(&qp_info->recv_queue, &mad_priv->header.mad_list); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); if (ret) { - - pci_unmap_single(port_priv->device->dma_device, + dequeue_mad(&mad_priv->header.mad_list); + pci_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&mad_priv->header, mapping), sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); - /* Unlink from posted receive MAD list */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - list_del(&mad_priv->header.recv_buf.list); - port_priv->recv_posted_mad_count[qpn]--; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx failed ret = %d\n", (unsigned long long) recv_wr.wr_id, ret); @@ -1420,79 +1364,72 @@ /* * Allocate receive MADs and post receive WRs for them */ -static int ib_mad_post_receive_mads(struct ib_mad_port_private *port_priv) +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info) { - int i, j; + int i, ret; for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - for (j = 0; j < IB_MAD_QPS_CORE; j++) { - if (ib_mad_post_receive_mad(port_priv, - port_priv->qp[j])) { - printk(KERN_ERR PFX "receive post %d failed " - "on %s port %d\n", i + 1, - port_priv->device->name, - port_priv->port_num); - } + ret = ib_mad_post_receive_mad(qp_info); + if (ret) { + printk(KERN_ERR PFX "receive post %d failed " + "on %s port %d\n", i + 1, + qp_info->port_priv->device->name, + qp_info->port_priv->port_num); + break; } } - - return 0; + return ret; } /* * Return all the posted receive MADs */ -static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *port_priv) +static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info *qp_info) { - int i; unsigned long flags; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_recv_buf *rbuf; struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - while (!list_empty(&port_priv->recv_posted_mad_list[i])) { + spin_lock_irqsave(&qp_info->recv_queue.lock, flags); + while (!list_empty(&qp_info->recv_queue.list)) { - rbuf = list_entry(port_priv->recv_posted_mad_list[i].next, - struct ib_mad_recv_buf, list); - mad_priv_hdr = container_of(rbuf, - struct ib_mad_private_header, - recv_buf); - recv = container_of(mad_priv_hdr, - struct ib_mad_private, header); + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); - /* Remove for posted receive MAD list */ - list_del(&recv->header.recv_buf.list); - - /* Undo PCI mapping */ - pci_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&recv->header, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), - PCI_DMA_FROMDEVICE); - - kmem_cache_free(ib_mad_cache, recv); - } + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); - INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); - port_priv->recv_posted_mad_count[i] = 0; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + /* Undo PCI mapping */ + pci_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + PCI_DMA_FROMDEVICE); + kmem_cache_free(ib_mad_cache, recv); } + + qp_info->recv_queue.count = 0; + spin_unlock_irqrestore(&qp_info->recv_queue.lock, flags); } /* * Return all the posted send MADs */ -static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *port_priv) +static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info) { unsigned long flags; - spin_lock_irqsave(&port_priv->send_list_lock, flags); - /* Just clear port send posted MAD list */ - INIT_LIST_HEAD(&port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count = 0; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + /* Just clear port send posted MAD list... revisit!!! */ + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + INIT_LIST_HEAD(&qp_info->send_queue.list); + qp_info->send_queue.count = 0; + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); } /* @@ -1618,35 +1555,21 @@ int ret, i, ret2; for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "INIT\n", i); - return ret; + goto error; } - } - - ret = ib_mad_post_receive_mads(port_priv); - if (ret) { - printk(KERN_ERR PFX "Couldn't post receive requests\n"); - goto error; - } - - ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); - if (ret) { - printk(KERN_ERR PFX "Failed to request completion notification\n"); - goto error; - } - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_rtr(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTR\n", i); goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTS\n", i); @@ -1654,17 +1577,31 @@ } } + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion notification\n"); + goto error; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive requests\n"); + goto error; + } + } return 0; + error: - ib_mad_return_posted_recv_mads(port_priv); for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret2 = ib_mad_change_qp_state_to_reset(port_priv->qp[i]); + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); + ret2 = ib_mad_change_qp_state_to_reset(port_priv-> + qp_info[i].qp); if (ret2) { printk(KERN_ERR PFX "ib_mad_port_start: Couldn't " "change QP%d state to RESET\n", i); } } - return ret; } @@ -1676,16 +1613,64 @@ int i, ret; for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_reset(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change " "%s port %d QP%d state to RESET\n", port_priv->device->name, port_priv->port_num, i); } + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); + ib_mad_return_posted_send_mads(&port_priv->qp_info[i]); } +} - ib_mad_return_posted_recv_mads(port_priv); - ib_mad_return_posted_send_mads(port_priv); +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static int create_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = port_priv->cq; + qp_init_attr.recv_cq = port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = port_priv->port_num; + qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); } /* @@ -1694,7 +1679,7 @@ */ static int ib_mad_port_open(struct ib_device *device, int port_num) { - int ret, cq_size, i; + int ret, cq_size; u64 iova = 0; struct ib_phys_buf buf_list = { .addr = 0, @@ -1749,38 +1734,15 @@ goto error5; } - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - struct ib_qp_init_attr qp_init_attr; - - memset(&qp_init_attr, 0, sizeof qp_init_attr); - qp_init_attr.send_cq = port_priv->cq; - qp_init_attr.recv_cq = port_priv->cq; - qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; - qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; - qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; - qp_init_attr.qp_type = i; /* Relies on ib_qp_type enum ordering of IB_QPT_SMI and IB_QPT_GSI */ - qp_init_attr.port_num = port_priv->port_num; - port_priv->qp[i] = ib_create_qp(port_priv->pd, &qp_init_attr); - if (IS_ERR(port_priv->qp[i])) { - printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", i); - ret = PTR_ERR(port_priv->qp[i]); - if (i == 0) - goto error6; - else - goto error7; - } - } + ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; spin_lock_init(&port_priv->reg_lock); - spin_lock_init(&port_priv->recv_list_lock); - spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->agent_list); - INIT_LIST_HEAD(&port_priv->send_posted_mad_list); - for (i = 0; i < IB_MAD_QPS_CORE; i++) - INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); port_priv->wq = create_workqueue("ib_mad"); if (!port_priv->wq) { @@ -1798,15 +1760,14 @@ spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_mad_port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - return 0; error9: destroy_workqueue(port_priv->wq); error8: - ib_destroy_qp(port_priv->qp[1]); + destroy_mad_qp(&port_priv->qp_info[1]); error7: - ib_destroy_qp(port_priv->qp[0]); + destroy_mad_qp(&port_priv->qp_info[0]); error6: ib_dereg_mr(port_priv->mr); error5: @@ -1842,8 +1803,8 @@ ib_mad_port_stop(port_priv); flush_workqueue(port_priv->wq); destroy_workqueue(port_priv->wq); - ib_destroy_qp(port_priv->qp[1]); - ib_destroy_qp(port_priv->qp[0]); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); ib_dereg_mr(port_priv->mr); ib_dealloc_pd(port_priv->pd); ib_destroy_cq(port_priv->cq); Index: access/mad_priv.h =================================================================== --- access/mad_priv.h (revision 1116) +++ access/mad_priv.h (working copy) @@ -79,16 +79,13 @@ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 - -union ib_mad_recv_wrid { - u64 wrid; - struct { - u32 index; - u32 qpn; - } wrid_field; +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; }; struct ib_mad_private_header { + struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; struct ib_mad_recv_buf recv_buf; DECLARE_PCI_UNMAP_ADDR(mapping) @@ -108,7 +105,7 @@ struct list_head agent_list; struct ib_mad_agent agent; struct ib_mad_reg_req *reg_req; - struct ib_mad_port_private *port_priv; + struct ib_mad_qp_info *qp_info; spinlock_t lock; struct list_head send_list; @@ -122,7 +119,7 @@ }; struct ib_mad_send_wr_private { - struct list_head send_list; + struct ib_mad_list_head mad_list; struct list_head agent_list; struct ib_mad_agent *agent; u64 wr_id; /* client WR ID */ @@ -140,11 +137,25 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + /* struct ib_mad_queue overflow_queue; */ +}; + struct ib_mad_port_private { struct list_head port_list; struct ib_device *device; int port_num; - struct ib_qp *qp[IB_MAD_QPS_CORE]; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; @@ -154,15 +165,7 @@ struct list_head agent_list; struct workqueue_struct *wq; struct work_struct work; - - spinlock_t send_list_lock; - struct list_head send_posted_mad_list; - int send_posted_mad_count; - - spinlock_t recv_list_lock; - struct list_head recv_posted_mad_list[IB_MAD_QPS_CORE]; - int recv_posted_mad_count[IB_MAD_QPS_CORE]; - u32 recv_wr_index[IB_MAD_QPS_CORE]; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; #endif /* __IB_MAD_PRIV_H__ */ From krkumar at us.ibm.com Tue Nov 2 11:17:25 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 11:17:25 -0800 (PST) Subject: [openib-general] [PATCH] General cleanup in add_mad_reg_req Message-ID: Optimize a "clear" operation to use memset, and re-arrange the code a bit to make the function cleaner. (Hal, I am sending patches based on latest bits, but not on top of my earlier sent-but-not-applied patches. If it doesn't apply, please let me know and I will recreate the patch). Thanks. - KK diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-02 10:46:19.000000000 -0800 +++ 2/mad.c 2004-11-02 10:58:50.000000000 -0800 @@ -596,31 +596,28 @@ static int add_mad_reg_req(struct ib_mad if (!*class) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_class_table\n"); + ret = -ENOMEM; goto error1; } /* Clear management class table for this class version */ - for (i = 0; i < MAX_MGMT_CLASS; i++) { - (*class)->method_table[i] = NULL; - } + memset((*class)->method_table, 0, + sizeof((*class)->method_table)); /* Allocate method table for this management class */ method = &(*class)->method_table[mgmt_class]; - if (allocate_method_table(method)) { + if ((ret = allocate_method_table(method))) goto error2; - } } else { method = &(*class)->method_table[mgmt_class]; if (!*method) { /* Allocate method table for this management class */ - if (allocate_method_table(method)) { + if ((ret = allocate_method_table(method))) goto error1; - } } } /* Now, make sure methods are not already in use */ - if (method_in_use(method, mad_reg_req)) { + if (method_in_use(method, mad_reg_req)) goto error3; - } /* Finally, add in methods being registered */ for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); @@ -641,13 +638,11 @@ error3: *method = NULL; } ret = -EINVAL; - goto error; + goto error1; error2: kfree(*class); *class = NULL; error1: - ret = -ENOMEM; -error: return ret; } From halr at voltaire.com Tue Nov 2 11:34:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 14:34:25 -0500 Subject: [openib-general] [PATCH] Optimize check_class_table and method_table to return BOOL. In-Reply-To: References: Message-ID: <1099424065.4129.0.camel@hpc-1> On Tue, 2004-11-02 at 13:40, Krishna Kumar wrote: > The callers are just interested in knowing whether any methods or method > tables are in use, not the actual use count. Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 2 11:45:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 14:45:03 -0500 Subject: [openib-general] ifconfig ib0 down and then up vis a vis IP connectivity Message-ID: <1099424703.4129.9.camel@hpc-1> Hi, What is the ARP timeout in Linux ? If I down and then up the ib0 interface, there is some delay before connectivity is restored despite the fact that it is successfully (re)attached to the multicast groups and that all the QPNs seem to be the same. After some time period, connectivity is restored. Any idea on what is different ? It seems like it is an ARP cache issue. Thanks. -- Hal From halr at voltaire.com Tue Nov 2 12:00:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 15:00:04 -0500 Subject: [openib-general] [PATCH] General cleanup in add_mad_reg_req In-Reply-To: References: Message-ID: <1099425604.4129.26.camel@hpc-1> On Tue, 2004-11-02 at 14:17, Krishna Kumar wrote: > Optimize a "clear" operation to use memset, and re-arrange the code a bit > to make the function cleaner. Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 2 12:05:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 15:05:06 -0500 Subject: [openib-general] [PATCH] for review -- fix MAD completion handling In-Reply-To: <20041102111906.431f78b0.mshefty@ichips.intel.com> References: <20041028233000.19879b59.mshefty@ichips.intel.com> <20041102111906.431f78b0.mshefty@ichips.intel.com> Message-ID: <1099425905.4129.29.camel@hpc-1> On Tue, 2004-11-02 at 14:19, Sean Hefty wrote: > Index: access/mad.c > =================================================================== > --- access/mad.c (revision 1116) > +++ access/mad.c (working copy) > @@ -81,9 +81,8 @@ > static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, > struct ib_mad_agent_private *priv); > static void remove_mad_reg_req(struct ib_mad_agent_private *priv); > -static int ib_mad_post_receive_mad(struct ib_mad_port_private > *port_priv, I get an error here: patching file mad.c patch: **** malformed patch at line 10: *port_priv, I think the mail somehow made *port_priv a separate line. -- Hal From mshefty at ichips.intel.com Tue Nov 2 12:02:44 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 2 Nov 2004 12:02:44 -0800 Subject: [openib-general] [PATCH] for review -- fix MAD completion handling In-Reply-To: <1099425905.4129.29.camel@hpc-1> References: <20041028233000.19879b59.mshefty@ichips.intel.com> <20041102111906.431f78b0.mshefty@ichips.intel.com> <1099425905.4129.29.camel@hpc-1> Message-ID: <20041102120244.498d75f2.mshefty@ichips.intel.com> On Tue, 02 Nov 2004 15:05:06 -0500 Hal Rosenstock wrote: > I get an error here: > patching file mad.c > patch: **** malformed patch at line 10: *port_priv, I think my mailer wrapped the lines after I hit send. Let me try again. - Sean Index: access/mad.c =================================================================== --- access/mad.c (revision 1116) +++ access/mad.c (working copy) @@ -81,9 +81,8 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, - struct ib_qp *qp); -static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); @@ -130,6 +129,19 @@ 0 : mgmt_class; } +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + /* * ib_register_mad_agent - Register to send/receive MADs */ @@ -148,12 +160,13 @@ struct ib_mad_reg_req *reg_req = NULL; struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; - int ret2; + int ret2, qpn; unsigned long flags; u8 mgmt_class; /* Validate parameters */ - if (qp_type != IB_QPT_GSI && qp_type != IB_QPT_SMI) { + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { ret = ERR_PTR(-EINVAL); goto error1; } @@ -225,14 +238,14 @@ /* Now, fill in the various structures */ memset(mad_agent_priv, 0, sizeof *mad_agent_priv); - mad_agent_priv->port_priv = port_priv; + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; mad_agent_priv->agent.context = context; - mad_agent_priv->agent.qp = port_priv->qp[qp_type]; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; mad_agent_priv->agent.port_num = port_num; spin_lock_irqsave(&port_priv->reg_lock, flags); @@ -256,6 +269,7 @@ } } } + ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); if (ret2) { ret = ERR_PTR(ret2); @@ -272,7 +286,6 @@ INIT_WORK(&mad_agent_priv->work, timeout_sends, mad_agent_priv); atomic_set(&mad_agent_priv->refcount, 1); init_waitqueue_head(&mad_agent_priv->wait); - mad_agent_priv->port_priv = port_priv; return &mad_agent_priv->agent; @@ -292,6 +305,7 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) { struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_port_private *port_priv; unsigned long flags; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, @@ -305,13 +319,14 @@ */ cancel_mads(mad_agent_priv); + port_priv = mad_agent_priv->qp_info->port_priv; cancel_delayed_work(&mad_agent_priv->work); - flush_workqueue(mad_agent_priv->port_priv->wq); + flush_workqueue(port_priv->wq); - spin_lock_irqsave(&mad_agent_priv->port_priv->reg_lock, flags); + spin_lock_irqsave(&port_priv->reg_lock, flags); remove_mad_reg_req(mad_agent_priv); list_del(&mad_agent_priv->agent_list); - spin_unlock_irqrestore(&mad_agent_priv->port_priv->reg_lock, flags); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); /* XXX: Cleanup pending RMPP receives for this agent */ @@ -326,30 +341,51 @@ } EXPORT_SYMBOL(ib_unregister_mad_agent); +static void queue_mad(struct ib_mad_queue *mad_queue, + struct ib_mad_list_head *mad_list) +{ + unsigned long flags; + + mad_list->mad_queue = mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_add_tail(&mad_list->list, &mad_queue->list); + mad_queue->count++; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_send_wr_private *mad_send_wr, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr) { - struct ib_mad_port_private *port_priv; - unsigned long flags; + struct ib_mad_qp_info *qp_info; int ret; - port_priv = mad_agent_priv->port_priv; - /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; mad_send_wr->wr_id = send_wr->wr_id; - send_wr->wr_id = (unsigned long)mad_send_wr; + send_wr->wr_id = (unsigned long)&mad_send_wr->mad_list; + queue_mad(&qp_info->send_queue, &mad_send_wr->mad_list); - spin_lock_irqsave(&port_priv->send_list_lock, flags); ret = ib_post_send(mad_agent_priv->agent.qp, send_wr, bad_send_wr); - if (!ret) { - list_add_tail(&mad_send_wr->send_list, - &port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count++; - } else + if (ret) { printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dequeue_mad(&mad_send_wr->mad_list); + *bad_send_wr = send_wr; + } return ret; } @@ -364,7 +400,6 @@ int ret; struct ib_send_wr *cur_send_wr, *next_send_wr; struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_port_private *port_priv; /* Validate supplied parameters */ if (!bad_send_wr) @@ -379,7 +414,6 @@ mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); - port_priv = mad_agent_priv->port_priv; /* Walk list of send WRs and post each on send list */ cur_send_wr = send_wr; @@ -421,6 +455,7 @@ cur_send_wr, bad_send_wr); if (ret) { /* Handle QP overrun separately... -ENOMEM */ + /* Handle posting when QP is in error state... */ /* Fail send request */ spin_lock_irqsave(&mad_agent_priv->lock, flags); @@ -587,7 +622,7 @@ if (!mad_reg_req) return 0; - private = priv->port_priv; + private = priv->qp_info->port_priv; mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); class = &private->version[mad_reg_req->mgmt_class_version]; if (!*class) { @@ -663,7 +698,7 @@ goto out; } - port_priv = agent_priv->port_priv; + port_priv = agent_priv->qp_info->port_priv; class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; if (!class) { printk(KERN_ERR PFX "No class table yet MAD registration " @@ -695,20 +730,6 @@ return; } -static int convert_qpnum(u32 qp_num) -{ - /* - * XXX: No redirection currently - * QP0 and QP1 only - * Ultimately, will need table of QP numbers and table index - * as QP numbers will not be packed once redirection supported - */ - if (qp_num > 1) { - return -1; - } - return qp_num; -} - static int response_mad(struct ib_mad *mad) { /* Trap represses are responses although response bit is reset */ @@ -913,55 +934,21 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { + struct ib_mad_qp_info *qp_info; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_recv_buf *rbuf; struct ib_mad_private *recv; - union ib_mad_recv_wrid wrid; - unsigned long flags; - u32 qp_num; + struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent = NULL; - int solicited, qpn; - - /* For receive, QP number is field in the WC WRID */ - wrid.wrid = wc->wr_id; - qp_num = wrid.wrid_field.qpn; - qpn = convert_qpnum(qp_num); - if (qpn == -1) { - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); - printk(KERN_ERR PFX "Packet received on unknown QPN %d\n", - qp_num); - return; - } - - /* - * Completion corresponds to first entry on - * posted MAD receive list based on WRID in completion - */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - if (!list_empty(&port_priv->recv_posted_mad_list[qpn])) { - rbuf = list_entry(port_priv->recv_posted_mad_list[qpn].next, - struct ib_mad_recv_buf, - list); - mad_priv_hdr = container_of(rbuf, struct ib_mad_private_header, - recv_buf); - recv = container_of(mad_priv_hdr, struct ib_mad_private, - header); - - /* Remove from posted receive MAD list */ - list_del(&recv->header.recv_buf.list); - port_priv->recv_posted_mad_count[qpn]--; - - } else { - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); - printk(KERN_ERR PFX "Receive completion WR ID 0x%Lx on QP %d " - "with no posted receive\n", - (unsigned long long) wc->wr_id, - qp_num); - return; - } - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + int solicited; + unsigned long flags; + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); pci_unmap_single(port_priv->device->dma_device, pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - @@ -976,7 +963,7 @@ recv->header.recv_buf.grh = &recv->grh; /* Validate MAD */ - if (!validate_mad(recv->header.recv_buf.mad, qp_num)) + if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) goto out; /* Snoop MAD ? */ @@ -1009,7 +996,7 @@ } /* Post another receive request for this QP */ - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); + ib_mad_post_receive_mad(qp_info); } static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) @@ -1030,7 +1017,8 @@ delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; - queue_delayed_work(mad_agent_priv->port_priv->wq, + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, &mad_agent_priv->work, delay); } } @@ -1060,7 +1048,7 @@ /* Reschedule a work item if we have a shorter timeout */ if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { cancel_delayed_work(&mad_agent_priv->work); - queue_delayed_work(mad_agent_priv->port_priv->wq, + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, &mad_agent_priv->work, delay); } } @@ -1114,39 +1102,15 @@ struct ib_wc *wc) { struct ib_mad_send_wr_private *mad_send_wr; - unsigned long flags; - - /* Completion corresponds to first entry on posted MAD send list */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (list_empty(&port_priv->send_posted_mad_list)) { - printk(KERN_ERR PFX "Send completion WR ID 0x%Lx but send " - "list is empty\n", (unsigned long long) wc->wr_id); - goto error; - } - - mad_send_wr = list_entry(port_priv->send_posted_mad_list.next, - struct ib_mad_send_wr_private, - send_list); - if (wc->wr_id != (unsigned long)mad_send_wr) { - printk(KERN_ERR PFX "Send completion WR ID 0x%Lx doesn't match " - "posted send WR ID 0x%lx\n", - (unsigned long long) wc->wr_id, - (unsigned long)mad_send_wr); - goto error; - } - - /* Remove from posted send MAD list */ - list_del(&mad_send_wr->send_list); - port_priv->send_posted_mad_count--; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + struct ib_mad_list_head *mad_list; + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + dequeue_mad(mad_list); /* Restore client wr_id in WC */ wc->wr_id = mad_send_wr->wr_id; ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); - return; - -error: - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); } /* @@ -1156,28 +1120,33 @@ { struct ib_mad_port_private *port_priv; struct ib_wc wc; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; port_priv = (struct ib_mad_port_private*)data; ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { if (wc.status != IB_WC_SUCCESS) { - printk(KERN_ERR PFX "Completion error %d WRID 0x%Lx\n", - wc.status, (unsigned long long) wc.wr_id); + /* Determine if failure was a send or receive. */ + mad_list = (struct ib_mad_list_head *) + (unsigned long)wc.wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->send_queue) + wc.opcode = IB_WC_SEND; + else + wc.opcode = IB_WC_RECV; + } + switch (wc.opcode) { + case IB_WC_SEND: ib_mad_send_done_handler(port_priv, &wc); - } else { - switch (wc.opcode) { - case IB_WC_SEND: - ib_mad_send_done_handler(port_priv, &wc); - break; - case IB_WC_RECV: - ib_mad_recv_done_handler(port_priv, &wc); - break; - default: - printk(KERN_ERR PFX "Wrong Opcode 0x%x on completion\n", - wc.opcode); - break; - } + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; } } } @@ -1307,7 +1276,8 @@ delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; - queue_delayed_work(mad_agent_priv->port_priv->wq, + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, &mad_agent_priv->work, delay); break; } @@ -1332,24 +1302,13 @@ queue_work(port_priv->wq, &port_priv->work); } -static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, - struct ib_qp *qp) +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info) { struct ib_mad_private *mad_priv; struct ib_sge sg_list; struct ib_recv_wr recv_wr; struct ib_recv_wr *bad_recv_wr; - unsigned long flags; int ret; - union ib_mad_recv_wrid wrid; - int qpn; - - - qpn = convert_qpnum(qp->qp_num); - if (qpn == -1) { - printk(KERN_ERR PFX "Post receive to invalid QPN %d\n", qp->qp_num); - return -EINVAL; - } /* * Allocate memory for receive buffer. @@ -1367,47 +1326,32 @@ } /* Setup scatter list */ - sg_list.addr = pci_map_single(port_priv->device->dma_device, + sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device, &mad_priv->grh, sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; - sg_list.lkey = (*port_priv->mr).lkey; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; /* Setup receive WR */ recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; - wrid.wrid_field.index = port_priv->recv_wr_index[qpn]++; - wrid.wrid_field.qpn = qp->qp_num; - recv_wr.wr_id = wrid.wrid; - - /* Link receive WR into posted receive MAD list */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - list_add_tail(&mad_priv->header.recv_buf.list, - &port_priv->recv_posted_mad_list[qpn]); - port_priv->recv_posted_mad_count[qpn]++; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); - /* Now, post receive WR */ - ret = ib_post_recv(qp, &recv_wr, &bad_recv_wr); + /* Post receive WR. */ + queue_mad(&qp_info->recv_queue, &mad_priv->header.mad_list); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); if (ret) { - - pci_unmap_single(port_priv->device->dma_device, + dequeue_mad(&mad_priv->header.mad_list); + pci_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&mad_priv->header, mapping), sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); - /* Unlink from posted receive MAD list */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - list_del(&mad_priv->header.recv_buf.list); - port_priv->recv_posted_mad_count[qpn]--; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx failed ret = %d\n", (unsigned long long) recv_wr.wr_id, ret); @@ -1420,79 +1364,72 @@ /* * Allocate receive MADs and post receive WRs for them */ -static int ib_mad_post_receive_mads(struct ib_mad_port_private *port_priv) +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info) { - int i, j; + int i, ret; for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - for (j = 0; j < IB_MAD_QPS_CORE; j++) { - if (ib_mad_post_receive_mad(port_priv, - port_priv->qp[j])) { - printk(KERN_ERR PFX "receive post %d failed " - "on %s port %d\n", i + 1, - port_priv->device->name, - port_priv->port_num); - } + ret = ib_mad_post_receive_mad(qp_info); + if (ret) { + printk(KERN_ERR PFX "receive post %d failed " + "on %s port %d\n", i + 1, + qp_info->port_priv->device->name, + qp_info->port_priv->port_num); + break; } } - - return 0; + return ret; } /* * Return all the posted receive MADs */ -static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *port_priv) +static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info *qp_info) { - int i; unsigned long flags; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_recv_buf *rbuf; struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - while (!list_empty(&port_priv->recv_posted_mad_list[i])) { + spin_lock_irqsave(&qp_info->recv_queue.lock, flags); + while (!list_empty(&qp_info->recv_queue.list)) { - rbuf = list_entry(port_priv->recv_posted_mad_list[i].next, - struct ib_mad_recv_buf, list); - mad_priv_hdr = container_of(rbuf, - struct ib_mad_private_header, - recv_buf); - recv = container_of(mad_priv_hdr, - struct ib_mad_private, header); + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); - /* Remove for posted receive MAD list */ - list_del(&recv->header.recv_buf.list); - - /* Undo PCI mapping */ - pci_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&recv->header, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), - PCI_DMA_FROMDEVICE); - - kmem_cache_free(ib_mad_cache, recv); - } + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); - INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); - port_priv->recv_posted_mad_count[i] = 0; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + /* Undo PCI mapping */ + pci_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + PCI_DMA_FROMDEVICE); + kmem_cache_free(ib_mad_cache, recv); } + + qp_info->recv_queue.count = 0; + spin_unlock_irqrestore(&qp_info->recv_queue.lock, flags); } /* * Return all the posted send MADs */ -static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *port_priv) +static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info) { unsigned long flags; - spin_lock_irqsave(&port_priv->send_list_lock, flags); - /* Just clear port send posted MAD list */ - INIT_LIST_HEAD(&port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count = 0; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + /* Just clear port send posted MAD list... revisit!!! */ + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + INIT_LIST_HEAD(&qp_info->send_queue.list); + qp_info->send_queue.count = 0; + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); } /* @@ -1618,35 +1555,21 @@ int ret, i, ret2; for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "INIT\n", i); - return ret; + goto error; } - } - - ret = ib_mad_post_receive_mads(port_priv); - if (ret) { - printk(KERN_ERR PFX "Couldn't post receive requests\n"); - goto error; - } - - ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); - if (ret) { - printk(KERN_ERR PFX "Failed to request completion notification\n"); - goto error; - } - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_rtr(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTR\n", i); goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTS\n", i); @@ -1654,17 +1577,31 @@ } } + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion notification\n"); + goto error; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive requests\n"); + goto error; + } + } return 0; + error: - ib_mad_return_posted_recv_mads(port_priv); for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret2 = ib_mad_change_qp_state_to_reset(port_priv->qp[i]); + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); + ret2 = ib_mad_change_qp_state_to_reset(port_priv-> + qp_info[i].qp); if (ret2) { printk(KERN_ERR PFX "ib_mad_port_start: Couldn't " "change QP%d state to RESET\n", i); } } - return ret; } @@ -1676,16 +1613,64 @@ int i, ret; for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset(port_priv->qp[i]); + ret = ib_mad_change_qp_state_to_reset(port_priv->qp_info[i].qp); if (ret) { printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change " "%s port %d QP%d state to RESET\n", port_priv->device->name, port_priv->port_num, i); } + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); + ib_mad_return_posted_send_mads(&port_priv->qp_info[i]); } +} - ib_mad_return_posted_recv_mads(port_priv); - ib_mad_return_posted_send_mads(port_priv); +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static int create_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = port_priv->cq; + qp_init_attr.recv_cq = port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = port_priv->port_num; + qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); } /* @@ -1694,7 +1679,7 @@ */ static int ib_mad_port_open(struct ib_device *device, int port_num) { - int ret, cq_size, i; + int ret, cq_size; u64 iova = 0; struct ib_phys_buf buf_list = { .addr = 0, @@ -1749,38 +1734,15 @@ goto error5; } - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - struct ib_qp_init_attr qp_init_attr; - - memset(&qp_init_attr, 0, sizeof qp_init_attr); - qp_init_attr.send_cq = port_priv->cq; - qp_init_attr.recv_cq = port_priv->cq; - qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; - qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; - qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; - qp_init_attr.qp_type = i; /* Relies on ib_qp_type enum ordering of IB_QPT_SMI and IB_QPT_GSI */ - qp_init_attr.port_num = port_priv->port_num; - port_priv->qp[i] = ib_create_qp(port_priv->pd, &qp_init_attr); - if (IS_ERR(port_priv->qp[i])) { - printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", i); - ret = PTR_ERR(port_priv->qp[i]); - if (i == 0) - goto error6; - else - goto error7; - } - } + ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; spin_lock_init(&port_priv->reg_lock); - spin_lock_init(&port_priv->recv_list_lock); - spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->agent_list); - INIT_LIST_HEAD(&port_priv->send_posted_mad_list); - for (i = 0; i < IB_MAD_QPS_CORE; i++) - INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); port_priv->wq = create_workqueue("ib_mad"); if (!port_priv->wq) { @@ -1798,15 +1760,14 @@ spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_mad_port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - return 0; error9: destroy_workqueue(port_priv->wq); error8: - ib_destroy_qp(port_priv->qp[1]); + destroy_mad_qp(&port_priv->qp_info[1]); error7: - ib_destroy_qp(port_priv->qp[0]); + destroy_mad_qp(&port_priv->qp_info[0]); error6: ib_dereg_mr(port_priv->mr); error5: @@ -1842,8 +1803,8 @@ ib_mad_port_stop(port_priv); flush_workqueue(port_priv->wq); destroy_workqueue(port_priv->wq); - ib_destroy_qp(port_priv->qp[1]); - ib_destroy_qp(port_priv->qp[0]); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); ib_dereg_mr(port_priv->mr); ib_dealloc_pd(port_priv->pd); ib_destroy_cq(port_priv->cq); Index: access/mad_priv.h =================================================================== --- access/mad_priv.h (revision 1116) +++ access/mad_priv.h (working copy) @@ -79,16 +79,13 @@ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 - -union ib_mad_recv_wrid { - u64 wrid; - struct { - u32 index; - u32 qpn; - } wrid_field; +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; }; struct ib_mad_private_header { + struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; struct ib_mad_recv_buf recv_buf; DECLARE_PCI_UNMAP_ADDR(mapping) @@ -108,7 +105,7 @@ struct list_head agent_list; struct ib_mad_agent agent; struct ib_mad_reg_req *reg_req; - struct ib_mad_port_private *port_priv; + struct ib_mad_qp_info *qp_info; spinlock_t lock; struct list_head send_list; @@ -122,7 +119,7 @@ }; struct ib_mad_send_wr_private { - struct list_head send_list; + struct ib_mad_list_head mad_list; struct list_head agent_list; struct ib_mad_agent *agent; u64 wr_id; /* client WR ID */ @@ -140,11 +137,25 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + /* struct ib_mad_queue overflow_queue; */ +}; + struct ib_mad_port_private { struct list_head port_list; struct ib_device *device; int port_num; - struct ib_qp *qp[IB_MAD_QPS_CORE]; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; @@ -154,15 +165,7 @@ struct list_head agent_list; struct workqueue_struct *wq; struct work_struct work; - - spinlock_t send_list_lock; - struct list_head send_posted_mad_list; - int send_posted_mad_count; - - spinlock_t recv_list_lock; - struct list_head recv_posted_mad_list[IB_MAD_QPS_CORE]; - int recv_posted_mad_count[IB_MAD_QPS_CORE]; - u32 recv_wr_index[IB_MAD_QPS_CORE]; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; #endif /* __IB_MAD_PRIV_H__ */ From krkumar at us.ibm.com Tue Nov 2 12:09:42 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 12:09:42 -0800 (PST) Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent using bitops Message-ID: I am not entirely sure that I understand the bitwise operator being used in the code. Following patch is assuming that I have got it right :-). thanks, - KK diff -ruNp 5/mad.c 6/mad.c --- 5/mad.c 2004-11-02 12:07:51.000000000 -0800 +++ 6/mad.c 2004-11-02 12:08:32.000000000 -0800 @@ -537,9 +537,13 @@ static int check_method_table(struct ib_ { int i; - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) - if (method->agent[i]) - return 1; + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + /* if we entered the loop, we have found an agent bit set */ + return 1; + } return 0; } @@ -561,11 +565,13 @@ static void remove_methods_mad_agent(str { int i; - /* Remove any methods for this mad agent */ - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { - if (method->agent[i] == agent) { - method->agent[i] = NULL; - } + /* Remove all methods for this mad agent */ + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + BUG_ON(method->agent[i] != agent); + method->agent[i] = NULL; } } From halr at voltaire.com Tue Nov 2 14:06:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 17:06:30 -0500 Subject: [openib-general] [PATCH] for review -- fix MAD completion handling In-Reply-To: <20041102120244.498d75f2.mshefty@ichips.intel.com> References: <20041028233000.19879b59.mshefty@ichips.intel.com> <20041102111906.431f78b0.mshefty@ichips.intel.com> <1099425905.4129.29.camel@hpc-1> <20041102120244.498d75f2.mshefty@ichips.intel.com> Message-ID: <1099433189.3266.0.camel@localhost.localdomain> On Tue, 2004-11-02 at 15:02, Sean Hefty wrote: > I think my mailer wrapped the lines after I hit send. Let me try again. That's better. Thanks. -- Hal From halr at voltaire.com Tue Nov 2 14:13:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 17:13:11 -0500 Subject: [openib-general] [PATCH] for review -- fix MAD completion handling In-Reply-To: <20041102120244.498d75f2.mshefty@ichips.intel.com> References: <20041028233000.19879b59.mshefty@ichips.intel.com> <20041102111906.431f78b0.mshefty@ichips.intel.com> <1099425905.4129.29.camel@hpc-1> <20041102120244.498d75f2.mshefty@ichips.intel.com> Message-ID: <1099433591.3266.4.camel@localhost.localdomain> On Tue, 2004-11-02 at 15:02, Sean Hefty wrote: > I think my mailer wrapped the lines after I hit send. Let me try again. Thanks! Applied. -- Hal From halr at voltaire.com Tue Nov 2 14:30:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 17:30:39 -0500 Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent using bitops In-Reply-To: References: Message-ID: <1099434639.3266.18.camel@localhost.localdomain> On Tue, 2004-11-02 at 15:09, Krishna Kumar wrote: > I am not entirely sure that I understand the bitwise operator being > used in the code. Following patch is assuming that I have got it > right :-). > > thanks, > > - KK > > diff -ruNp 5/mad.c 6/mad.c > --- 5/mad.c 2004-11-02 12:07:51.000000000 -0800 > +++ 6/mad.c 2004-11-02 12:08:32.000000000 -0800 > @@ -537,9 +537,13 @@ static int check_method_table(struct ib_ > { > int i; > > - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) > - if (method->agent[i]) > - return 1; > + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); > + i < IB_MGMT_MAX_METHODS; > + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, > + 1+i)) { > + /* if we entered the loop, we have found an agent bit set */ > + return 1; > + } > return 0; > } This is no longer checking the method table. It is checking the registration request. Also, a pointer to the registration request would need to be passed into this routine if it is to be used. > @@ -561,11 +565,13 @@ static void remove_methods_mad_agent(str > { > int i; > > - /* Remove any methods for this mad agent */ > - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { > - if (method->agent[i] == agent) { > - method->agent[i] = NULL; > - } > + /* Remove all methods for this mad agent */ > + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); > + i < IB_MGMT_MAX_METHODS; > + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, > + 1+i)) { > + BUG_ON(method->agent[i] != agent); > + method->agent[i] = NULL; > } > } Same compilation issue as above: A pointer to the registration request would need to be passed into this routine if it is to be used. -- Hal From krkumar at us.ibm.com Tue Nov 2 15:46:26 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 15:46:26 -0800 (PST) Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent using bitops In-Reply-To: <1099434639.3266.18.camel@localhost.localdomain> Message-ID: Hi Hal, I didn't fix the argument to this routine, I was trying to understand if the idea behind this will work, hence the RFC in the subject. Sorry for creating the confusion. I was trying to understand if what I was assuming is right or not. The first part of the patch checks if any bit is set in the method_mask, and if so, it means that a method (table?) is registered and hence it returns error. Actually there might be a better way to check if the bitmask is all-zero's and avoid the for loop, but I don't see any macros for that, and I didn't want to use "if (method_mask)". The add_mad_reg_req() code is doing : /* Finally, add in methods being registered */ for (i = find_first_bit(mad_reg_req->method_mask,IB_MGMT_MAX_METHODS); i < IB_MGMT_MAX_METHODS; i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, 1+i)) { (*method)->agent[i] = priv; } So the agent[0-128] is pointing to the priv when that particular bitmask is set in the method_mask (exact same bit number is used as index in agent). Do you think this model is correct ? The Get/Set/Repress functions can be checked faster by checking if the first/next bit being set rather than going through the entire array of 128 agents, or in one case whether the bitmask is zero or non-zero instead of looping 128 times. If this is true, I will recreate the patch and compile before sending in the final patch. thx, - KK On Tue, 2 Nov 2004, Hal Rosenstock wrote: > On Tue, 2004-11-02 at 15:09, Krishna Kumar wrote: > > I am not entirely sure that I understand the bitwise operator being > > used in the code. Following patch is assuming that I have got it > > right :-). > > > > thanks, > > > > - KK > > > > diff -ruNp 5/mad.c 6/mad.c > > --- 5/mad.c 2004-11-02 12:07:51.000000000 -0800 > > +++ 6/mad.c 2004-11-02 12:08:32.000000000 -0800 > > @@ -537,9 +537,13 @@ static int check_method_table(struct ib_ > > { > > int i; > > > > - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) > > - if (method->agent[i]) > > - return 1; > > + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); > > + i < IB_MGMT_MAX_METHODS; > > + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, > > + 1+i)) { > > + /* if we entered the loop, we have found an agent bit set */ > > + return 1; > > + } > > return 0; > > } > > This is no longer checking the method table. It is checking the > registration request. Also, a pointer to the registration request > would need to be passed into this routine if it is to be used. > > > @@ -561,11 +565,13 @@ static void remove_methods_mad_agent(str > > { > > int i; > > > > - /* Remove any methods for this mad agent */ > > - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { > > - if (method->agent[i] == agent) { > > - method->agent[i] = NULL; > > - } > > + /* Remove all methods for this mad agent */ > > + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); > > + i < IB_MGMT_MAX_METHODS; > > + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, > > + 1+i)) { > > + BUG_ON(method->agent[i] != agent); > > + method->agent[i] = NULL; > > } > > } > > Same compilation issue as above: > A pointer to the registration request would need to be passed into this > routine if it is to be used. > > -- Hal > > > From iod00d at hp.com Tue Nov 2 16:10:15 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 2 Nov 2004 16:10:15 -0800 Subject: [openib-general] ib_modify_qp() too many arguments Message-ID: <20041103001015.GA13563@cup.hp.com> Roland, I am trying to build roland-merge #1119 on top of 2.6.10-rc1 for ia64. And yes, the usage noted below doesn't match the declaration: CC [M] drivers/infiniband/core/cm_main.o In file included from drivers/infiniband/core/cm_main.c:24: drivers/infiniband/core/cm_priv.h: In function `ib_cm_qp_modify': drivers/infiniband/core/cm_priv.h:183: error: too many arguments to function `ib_modify_qp' Should I not (yet) be enabling CONFIG_INFINIBAND_CM? Trivial patch appended to "fix". Though I don't know if it's "right" since it seems ib_cm_qp_modify() could just go away. thanks, grant Index: src/linux-kernel/infiniband/core/cm_priv.h =================================================================== --- src/linux-kernel/infiniband/core/cm_priv.h (revision 1116) +++ src/linux-kernel/infiniband/core/cm_priv.h (working copy) @@ -178,9 +178,7 @@ struct ib_qp_attr *attr, int attr_mask) { - struct ib_qp_cap qp_cap; - - return qp ? ib_modify_qp(qp, attr, attr_mask, &qp_cap) : 0; + return qp ? ib_modify_qp(qp, attr, attr_mask) : 0; } int ib_cm_timeout_to_jiffies(int timeout); From halr at voltaire.com Tue Nov 2 18:25:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 21:25:34 -0500 Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent using bitops In-Reply-To: References: Message-ID: <1099448734.3266.239.camel@localhost.localdomain> On Tue, 2004-11-02 at 18:46, Krishna Kumar wrote: > I didn't fix the argument to this routine, I was trying to understand if > the idea behind this will work, hence the RFC in the subject. Sorry for > creating the confusion. I was trying to understand if what I was assuming > is right or not. > > The first part of the patch checks if any bit is set in the method_mask, > and if so, it means that a method (table?) is registered and hence it > returns error. Actually there might be a better way to check if the > bitmask is all-zero's and avoid the for loop, but I don't see any macros > for that, and I didn't want to use "if (method_mask)". > > The add_mad_reg_req() code is doing : > /* Finally, add in methods being registered */ > for (i = find_first_bit(mad_reg_req->method_mask,IB_MGMT_MAX_METHODS); > i < IB_MGMT_MAX_METHODS; > i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, > 1+i)) { > (*method)->agent[i] = priv; > } > So the agent[0-128] is pointing to the priv when that particular bitmask > is set in the method_mask (exact same bit number is used as index in > agent). > > Do you think this model is correct ? The Get/Set/Repress functions can be > checked faster by checking if the first/next bit being set rather than > going through the entire array of 128 agents, or in one case whether the > bitmask is zero or non-zero instead of looping 128 times. If this is true, > I will recreate the patch and compile before sending in the final patch. Sorry I misunderstood that it was RFC. In fact what I wrote was wrong. Checking the registration request is just as good as checking the method table. Perhaps the routine is now check_method_mask rather than check_method_table. The calling parameters are straightforward to fix. So the model seems fine to me. -- Hal From halr at voltaire.com Tue Nov 2 18:29:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 02 Nov 2004 21:29:08 -0500 Subject: [openib-general] ib_modify_qp() too many arguments In-Reply-To: <20041103001015.GA13563@cup.hp.com> References: <20041103001015.GA13563@cup.hp.com> Message-ID: <1099448948.3266.248.camel@localhost.localdomain> On Tue, 2004-11-02 at 19:10, Grant Grundler wrote: > I am trying to build roland-merge #1119 on top of 2.6.10-rc1 for ia64. > And yes, the usage noted below doesn't match the declaration: > > CC [M] drivers/infiniband/core/cm_main.o > In file included from drivers/infiniband/core/cm_main.c:24: > drivers/infiniband/core/cm_priv.h: In function `ib_cm_qp_modify': > drivers/infiniband/core/cm_priv.h:183: error: too many arguments to > function `ib_modify_qp' > > Should I not (yet) be enabling CONFIG_INFINIBAND_CM? > Trivial patch appended to "fix". There was recently a verbs change which caused this and it looks like we missed this as CM is not "used" yet. > Though I don't know if it's "right" Looks right to me. > since it seems ib_cm_qp_modify() could just go away. QP modify is needed by CM (as it walks a connection through it's states it modifies the QP state) and UD QPs as well. It can't go away. > Index: src/linux-kernel/infiniband/core/cm_priv.h > =================================================================== > --- src/linux-kernel/infiniband/core/cm_priv.h (revision 1116) > +++ src/linux-kernel/infiniband/core/cm_priv.h (working copy) > @@ -178,9 +178,7 @@ > struct ib_qp_attr *attr, > int attr_mask) > { > - struct ib_qp_cap qp_cap; > - > - return qp ? ib_modify_qp(qp, attr, attr_mask, &qp_cap) : 0; > + return qp ? ib_modify_qp(qp, attr, attr_mask) : 0; > } > > int ib_cm_timeout_to_jiffies(int timeout); -- Hal From krkumar at us.ibm.com Tue Nov 2 18:33:10 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 18:33:10 -0800 (PST) Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: <20041102102126.26746a63.mshefty@ichips.intel.com> Message-ID: Hi Sean, I guess you meant "even if solicited is NOT set". What you described is right, the race will mean that the remove_mad_reg_req() will free things like method/class, while the find_mad_agent looks through the version and class to find the mad_agent. This patch will fix it correctly. I have also cleaned up a hack in ib_mad_recv_done_handler() where a test for '!mad_agent' was being done to determine whether to free 'recv' or not :-). Couple of issues with the new code (same as old code, though) : 1. printk(KERN_ERR PFX "No client 0x%x for received MAD " "on port %d\n", hi_tid, port_priv->port_num); and printk(KERN_NOTICE PFX "No matching mad agent found for " "received MAD on port %d\n", port_priv->port_num); both get printed when mad_agent is not found in solicited case. 2. spin_unlock is performed after all the printk's, which is a bit icky. Compile-tested patch (not tested) follows at the end of the mail. Let me know if I should fix above problems too. Thanks, - KK On Tue, 2 Nov 2004, Sean Hefty wrote: > On Tue, 2 Nov 2004 09:59:14 -0800 (PST) > Krishna Kumar wrote: > > > Hi Sean, > > > > I think that is the best approach. And using this method, we can also > > avoid holding the lock if solicited is set. I will send a patch in a > > few minutes if this approach looks good. > > Sounds good. > > I think that you'll need to hold the lock even if solicited is set to > handle the case where a response is received after the sender > unregistered. > > - Sean diff -ruNp 7/mad.c 8/mad.c --- 7/mad.c 2004-11-02 16:13:05.000000000 -0800 +++ 8/mad.c 2004-11-02 18:30:19.000000000 -0800 @@ -747,13 +747,16 @@ find_mad_agent(struct ib_mad_port_privat struct ib_mad *mad, int solicited) { - struct ib_mad_agent_private *entry, *mad_agent = NULL; - struct ib_mad_mgmt_class_table *version; - struct ib_mad_mgmt_method_table *class; - u32 hi_tid; + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + + spin_lock_irqsave(&port_priv->reg_lock, flags); /* Whether MAD was solicited determines type of routing to MAD client */ if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + /* Routing is based on high 32 bits of transaction ID of MAD */ hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; list_for_each_entry(entry, &port_priv->agent_list, agent_list) { @@ -762,12 +765,14 @@ find_mad_agent(struct ib_mad_port_privat break; } } - if (!mad_agent) { + if (!mad_agent) printk(KERN_ERR PFX "No client 0x%x for received MAD " - "on port %d\n", hi_tid, port_priv->port_num); - goto out; - } + "on port %d\n", + hi_tid, port_priv->port_num); } else { + struct ib_mad_mgmt_class_table *version; + struct ib_mad_mgmt_method_table *class; + /* Routing is based on version, class, and method */ if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) { printk(KERN_ERR PFX "MAD received with unsupported " @@ -784,23 +789,30 @@ find_mad_agent(struct ib_mad_port_privat } class = version->method_table[convert_mgmt_class( mad->mad_hdr.mgmt_class)]; - if (!class) { + if (class) + mad_agent = class->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + else printk(KERN_ERR PFX "MAD received on port %d for class " "%d with no client\n", port_priv->port_num, mad->mad_hdr.mgmt_class); - goto out; - } - mad_agent = class->agent[mad->mad_hdr.method & - ~IB_MGMT_METHOD_RESP]; } out: - if (mad_agent && !mad_agent->agent.recv_handler) { - printk(KERN_ERR PFX "No receive handler for client " - "%p on port %d\n", - &mad_agent->agent, port_priv->port_num); - mad_agent = NULL; - } + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + mad_agent = NULL; + printk(KERN_ERR PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + } + } else + printk(KERN_NOTICE PFX "No matching mad agent found for " + "received MAD on port %d\n", port_priv->port_num); + + spin_unlock_irqrestore(&port_priv->reg_lock, flags); return mad_agent; } @@ -929,9 +941,8 @@ static void ib_mad_recv_done_handler(str struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_private *recv; struct ib_mad_list_head *mad_list; - struct ib_mad_agent_private *mad_agent = NULL; + struct ib_mad_agent_private *mad_agent; int solicited; - unsigned long flags; mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; qp_info = mad_list->mad_queue->qp_info; @@ -965,23 +976,17 @@ static void ib_mad_recv_done_handler(str recv->header.recv_buf.mad)) goto out; - spin_lock_irqsave(&port_priv->reg_lock, flags); /* Determine corresponding MAD agent for incoming receive MAD */ solicited = solicited_mad(recv->header.recv_buf.mad); mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, solicited); - if (!mad_agent) { - spin_unlock_irqrestore(&port_priv->reg_lock, flags); - printk(KERN_NOTICE PFX "No matching mad agent found for " - "received MAD on port %d\n", port_priv->port_num); - } else { - atomic_inc(&mad_agent->refcount); - spin_unlock_irqrestore(&port_priv->reg_lock, flags); + if (mad_agent) { ib_mad_complete_recv(mad_agent, recv, solicited); + recv = NULL; /* recv is freed up via ib_mad_complete_recv */ } out: - if (!mad_agent) { + if (recv) { /* Should this case be optimized ? */ kmem_cache_free(ib_mad_cache, recv); } From krkumar at us.ibm.com Tue Nov 2 18:35:16 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 2 Nov 2004 18:35:16 -0800 (PST) Subject: [openib-general] [RFC] [PATCH] Optimize access to method->agent using bitops In-Reply-To: <1099448734.3266.239.camel@localhost.localdomain> Message-ID: Hi Hal, Thanks for the update. I will recreate the patch and send it tomorrow. thanks, - KK On Tue, 2 Nov 2004, Hal Rosenstock wrote: > On Tue, 2004-11-02 at 18:46, Krishna Kumar wrote: > > I didn't fix the argument to this routine, I was trying to understand if > > the idea behind this will work, hence the RFC in the subject. Sorry for > > creating the confusion. I was trying to understand if what I was assuming > > is right or not. > > > > The first part of the patch checks if any bit is set in the method_mask, > > and if so, it means that a method (table?) is registered and hence it > > returns error. Actually there might be a better way to check if the > > bitmask is all-zero's and avoid the for loop, but I don't see any macros > > for that, and I didn't want to use "if (method_mask)". > > > > The add_mad_reg_req() code is doing : > > /* Finally, add in methods being registered */ > > for (i = find_first_bit(mad_reg_req->method_mask,IB_MGMT_MAX_METHODS); > > i < IB_MGMT_MAX_METHODS; > > i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, > > 1+i)) { > > (*method)->agent[i] = priv; > > } > > So the agent[0-128] is pointing to the priv when that particular bitmask > > is set in the method_mask (exact same bit number is used as index in > > agent). > > > > Do you think this model is correct ? The Get/Set/Repress functions can be > > checked faster by checking if the first/next bit being set rather than > > going through the entire array of 128 agents, or in one case whether the > > bitmask is zero or non-zero instead of looping 128 times. If this is true, > > I will recreate the patch and compile before sending in the final patch. > > Sorry I misunderstood that it was RFC. In fact what I wrote was wrong. > Checking the registration request is just as good as checking the method > table. Perhaps the routine is now check_method_mask rather than > check_method_table. The calling parameters are straightforward to fix. > So the model seems fine to me. > > -- Hal > > > From mashirle at us.ibm.com Tue Nov 2 20:00:09 2004 From: mashirle at us.ibm.com (Shirley Ma) Date: Tue, 2 Nov 2004 20:00:09 -0800 Subject: [openib-general] [PATCH] fix memory leak problem in agent_mad_send() Message-ID: <200411022000.09120.mashirle@us.ibm.com> Here is the patch. Please review it. diff -urN access/agent.c access.patch5/agent.c --- access/agent.c 2004-11-02 17:40:06.000000000 -0800 +++ access.patch5/agent.c 2004-11-02 18:43:47.534608536 -0800 @@ -357,12 +357,16 @@ if (!port_priv) { printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n", mad_agent); + kfree(mad); return; } agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); - if (!agent_send_wr) + if (!agent_send_wr) { + printk(KERN_ERR SPFX "No memory for agent work request\n"); + kfree(mad); return; + } agent_send_wr->mad = mad; /* PCI mapping */ @@ -407,6 +411,7 @@ if (IS_ERR(agent_send_wr->ah)) { printk(KERN_ERR SPFX "No memory for address handle\n"); kfree(mad); + kfree(agent_send_wr); return; } @@ -432,6 +437,8 @@ sizeof(struct ib_mad), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); + kfree(mad); + kfree(agent_send_wr); } else { list_add_tail(&agent_send_wr->send_list, &port_priv->send_posted_list); -- Thanks Shirley Ma IBM Linux Technology Center From roland at topspin.com Tue Nov 2 20:06:47 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 02 Nov 2004 20:06:47 -0800 Subject: [openib-general] ib_modify_qp() too many arguments In-Reply-To: <20041103001015.GA13563@cup.hp.com> (Grant Grundler's message of "Tue, 2 Nov 2004 16:10:15 -0800") References: <20041103001015.GA13563@cup.hp.com> Message-ID: <523bzrshig.fsf@topspin.com> Grant> Roland, I am trying to build roland-merge #1119 on top of Grant> 2.6.10-rc1 for ia64. And yes, the usage noted below Grant> doesn't match the declaration: Grant> Should I not (yet) be enabling CONFIG_INFINIBAND_CM? Grant> Trivial patch appended to "fix". Though I don't know if Grant> it's "right" since it seems ib_cm_qp_modify() could just go Grant> away. CONFIG_INFINIBAND_CM depends on CONFIG_BROKEN now, so I wouldn't expect it to build (it needs to be converted to the new MAD API). So for now don't try to build it. Your patch is a small step in the right direction so I applied it. The reason I have the ib_cm_qp_modify() function is to add the check for a NULL qp, which makes the CM logic simpler. - R. From roland at topspin.com Tue Nov 2 22:27:48 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 02 Nov 2004 22:27:48 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access Message-ID: <52y8hjqwez.fsf@topspin.com> I've just checked in an initial version of userspace MAD access (including documentation in docs/user_mad.txt). Unfortunately this is not quite ready for use underneath OpenSM, since it is not possible to register an agent for the SM classes (since they are currently grabbed by the kernel SMA first). All criticisms and comments greatly appreciated... Thanks, Roland Index: infiniband/include/ib_user_mad.h =================================================================== --- infiniband/include/ib_user_mad.h (revision 0) +++ infiniband/include/ib_user_mad.h (revision 0) @@ -0,0 +1,97 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef IB_USER_MAD_H +#define IB_USER_MAD_H + +#include +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + */ + +/** + * ib_user_mad - MAD packet + * @data - Contents of MAD + * @id - ID of agent MAD received with/to be sent with + * @qpn - Remote QP number received from/to be sent to + * @qkey - Remote Q_Key to be sent with (unset on receive) + * @lid - Remote lid received from/to be sent to + * @sl - Service level received with/to be sent with + * @path_bits - Local path bits received with/to be sent with + * @grh_present - If set, GRH was received/should be sent + * @gid_index - Local GID index to send with (unset on receive) + * @hop_limit - Hop limit in GRH + * @traffic_class - Traffic class in GRH + * @gid - Remote GID in GRH + * @flow_label - Flow label in GRH + * + * All multi-byte quantities are stored in network (big endian) byte order. + */ +struct ib_user_mad { + __u8 data[256]; + __u32 id; + __u32 qpn; + __u32 qkey; + __u16 lid; + __u8 sl; + __u8 path_bits; + __u8 grh_present; + __u8 gid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 gid[16]; + __u32 flow_label; +}; + +/** + * ib_user_mad_reg_req - MAD registration request + * @id - Set by the kernel; used to identify agent in future requests. + * @qpn - Queue pair number; must be 0 or 1. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + */ +struct ib_user_mad_reg_req { + __u32 id; + __u32 method_mask[4]; + __u8 qpn; + __u8 mgmt_class; + __u8 mgmt_class_version; +}; + +#define IB_IOCTL_MAGIC 0x1b + +#define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 0, \ + struct ib_user_mad_reg_req) + +#define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 1, __u32) + +#endif /* IB_USER_MAD_H */ Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 1086) +++ infiniband/core/Makefile (working copy) @@ -10,7 +10,8 @@ obj-$(CONFIG_INFINIBAND) += \ ib_core.o \ ib_mad.o \ - ib_sa.o + ib_sa.o \ + ib_umad.o obj-$(CONFIG_INFINIBAND_CM) += \ ib_cm.o @@ -36,6 +37,8 @@ ib_sa-objs := sa_query.o +ib_umad-objs := user_mad.o + ib_cm-objs := \ cm_main.o \ cm_api.o \ Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 0) +++ infiniband/core/user_mad.c (revision 0) @@ -0,0 +1,639 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand userspace MAD packet access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UMAD_MAX_PORTS = 256, + IB_UMAD_MAX_AGENTS = 32 +}; + +struct ib_umad_port { + int devnum; + struct cdev dev; + struct class_device *class_dev; + struct ib_device *ib_dev; + u8 port_num; +}; + +struct ib_umad_device { + int start_port, end_port; + struct ib_umad_port port[0]; +}; + +struct ib_umad_file { + struct ib_umad_port *port; + struct semaphore mutex; + struct list_head recv_list; + wait_queue_head_t recv_wait; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; +}; + +struct ib_umad_packet { + struct ib_user_mad mad; + struct ib_ah *ah; + struct list_head list; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static dev_t base_dev; +static spinlock_t map_lock = SPIN_LOCK_UNLOCKED; +static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS); + +static struct class_simple *umad_class; + +static void ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device); + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_umad_packet *packet = + (void *) (unsigned long) mad_send_wc->wr_id; + + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + PCI_DMA_TODEVICE); + ib_destroy_ah(packet->ah); + kfree(packet); +} + +static void recv_handler(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet; + + if (mad_recv_wc->wc->status != IB_WC_SUCCESS) + goto out; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + goto out; + + memset(packet, 0, sizeof *packet); + + memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data); + packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); + packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); + packet->mad.sl = mad_recv_wc->wc->sl; + packet->mad.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); + if (packet->mad.grh_present) { + /* XXX parse GRH */ + packet->mad.gid_index = 0; + packet->mad.hop_limit = 0; + packet->mad.traffic_class = 0; + memset(packet->mad.gid, 0, 16); + packet->mad.flow_label = 0; + } + + down(&file->mutex); + for (packet->mad.id = 0; + packet->mad.id < IB_UMAD_MAX_AGENTS; + packet->mad.id++) + if (agent == file->agent[packet->mad.id]) { + list_add_tail(&packet->list, &file->recv_list); + wake_up_interruptible(&file->recv_wait); + goto agent; + } + + kfree(packet); + +agent: + up(&file->mutex); + +out: + ib_free_recv_mad(mad_recv_wc); +} + +static ssize_t ib_umad_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + ssize_t ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + if (down_interruptible(&file->mutex)) + return -ERESTARTSYS; + + while (list_empty(&file->recv_list)) { + up(&file->mutex); + + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->recv_wait, + !list_empty(&file->recv_list))) + return -ERESTARTSYS; + + if (down_interruptible(&file->mutex)) + return -ERESTARTSYS; + } + + packet = list_entry(file->recv_list.next, struct ib_umad_packet, list); + list_del(&packet->list); + + up(&file->mutex); + + if (copy_to_user(buf, &packet->mad, sizeof packet->mad)) + ret = -EFAULT; + else + ret = sizeof packet->mad; + + kfree(packet); + return ret; +} + +static ssize_t ib_umad_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + struct ib_mad_agent *agent; + struct ib_ah_attr ah_attr; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + }; + int ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + return -ENOMEM; + + if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) { + kfree(packet); + return -EFAULT; + } + + if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) { + ret = -EINVAL; + goto err; + } + + if (down_interruptible(&file->mutex)) { + ret = -ERESTARTSYS; + goto err; + } + + agent = file->agent[packet->mad.id]; + if (!agent) { + ret = -EINVAL; + goto err_up; + } + + ((struct ib_mad_hdr *) packet->mad.data)->tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & + 0xffffffff)); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = be16_to_cpu(packet->mad.lid); + ah_attr.sl = packet->mad.sl; + ah_attr.src_path_bits = packet->mad.path_bits; + ah_attr.port_num = file->port->port_num; + /* XXX handle GRH */ + + packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); + if (IS_ERR(packet->ah)) { + ret = PTR_ERR(packet->ah); + goto err_up; + } + + gather_list.addr = pci_map_single(agent->device->dma_device, + packet->mad.data, + sizeof packet->mad.data, + PCI_DMA_TODEVICE); + gather_list.length = sizeof packet->mad.data; + gather_list.lkey = file->mr[packet->mad.id]->lkey; + pci_unmap_addr_set(packet, mapping, gather_list.addr); + + wr.wr.ud.mad_hdr = (struct ib_mad_hdr *) packet->mad.data; + wr.wr.ud.ah = packet->ah; + wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); + wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); + + wr.wr_id = (unsigned long) packet; + + ret = ib_post_send_mad(agent, &wr, &bad_wr); + if (ret) { + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + PCI_DMA_TODEVICE); + goto err_up; + } + + up(&file->mutex); + + return sizeof packet->mad; + +err_up: + up(&file->mutex); + +err: + kfree(packet); + return ret; +} + +static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ib_umad_file *file = filp->private_data; + + /* we will always be able to post a MAD send */ + unsigned int mask = POLLOUT | POLLWRNORM; + + poll_wait(filp, &file->recv_wait, wait); + + if (!list_empty(&file->recv_list)) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) +{ + struct ib_user_mad_reg_req ureq; + struct ib_mad_reg_req req; + struct ib_mad_agent *agent; + int agent_id; + int ret; + + if (down_interruptible(&file->mutex)) + return -EINTR; + + if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { + ret = -EFAULT; + goto out; + } + + if (ureq.qpn != 0 && ureq.qpn != 1) { + ret = -EINVAL; + goto out; + } + + for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) + if (!file->agent[agent_id]) + goto found; + + ret = -ENOMEM; + goto out; + +found: + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + + agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, + ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, + &req, 0, send_handler, recv_handler, + file); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto out; + } + + file->agent[agent_id] = agent; + + file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(file->mr[agent_id])) { + ret = -ENOMEM; + goto err; + } + + if (put_user(agent_id, + (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { + ret = -EFAULT; + goto err_mr; + } + + ret = 0; + goto out; + +err_mr: + ib_dereg_mr(file->mr[agent_id]); + +err: + file->agent[agent_id] = NULL; + ib_unregister_mad_agent(agent); + +out: + up(&file->mutex); + return ret; +} + +static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) +{ + u32 id; + int ret = 0; + + if (down_interruptible(&file->mutex)) + return -EINTR; + + if (get_user(id, (u32 __user *) arg)) { + ret = -EFAULT; + goto out; + } + + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + ret = -EINVAL; + goto out; + } + + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + +out: + up(&file->mutex); + return ret; +} + +static int ib_umad_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case IB_USER_MAD_REGISTER_AGENT: + return ib_umad_reg_agent(filp->private_data, arg); + case IB_USER_MAD_UNREGISTER_AGENT: + return ib_umad_unreg_agent(filp->private_data, arg); + default: + return -ENOIOCTLCMD; + } +} + +static int ib_umad_open(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = + container_of(inode->i_cdev, struct ib_umad_port, dev); + struct ib_umad_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + memset(file, 0, sizeof *file); + + init_MUTEX(&file->mutex); + INIT_LIST_HEAD(&file->recv_list); + init_waitqueue_head(&file->recv_wait); + + file->port = port; + filp->private_data = file; + + return 0; +} + +static int ib_umad_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_file *file = filp->private_data; + int i; + + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) { + ib_dereg_mr(file->mr[i]); + ib_unregister_mad_agent(file->agent[i]); + } + + kfree(file); + + return 0; +} + +static struct file_operations umad_fops = { + .owner = THIS_MODULE, + .read = ib_umad_read, + .write = ib_umad_write, + .poll = ib_umad_poll, + .ioctl = ib_umad_ioctl, + .open = ib_umad_open, + .release = ib_umad_close +}; + +static struct ib_client umad_client = { + .name = "umad", + .add = ib_umad_add_one, + .remove = ib_umad_remove_one +}; + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = class_get_devdata(class_dev); + + return sprintf(buf, "%s\n", port->ib_dev->name); +} +CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = class_get_devdata(class_dev); + + return sprintf(buf, "%d\n", port->port_num); +} +CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static void ib_umad_add_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + struct ib_device_attr attr; + if (ib_query_device(device, &attr)) + return; + + s = 1; + e = attr.phys_port_cnt; + } + + umad_dev = kmalloc(sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port), + GFP_KERNEL); + if (!umad_dev) + return; + + umad_dev->start_port = s; + umad_dev->end_port = e; + + for (i = s; i <= e; ++i) { + spin_lock(&map_lock); + umad_dev->port[i - s].devnum = + find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); + if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) { + spin_unlock(&map_lock); + goto err; + } + set_bit(umad_dev->port[i - s].devnum, dev_map); + spin_unlock(&map_lock); + + umad_dev->port[i - s].ib_dev = device; + umad_dev->port[i - s].port_num = i; + + cdev_init(&umad_dev->port[i - s].dev, &umad_fops); + umad_dev->port[i - s].dev.owner = THIS_MODULE; + kobject_set_name(&umad_dev->port[i - s].dev.kobj, + "umad%d", umad_dev->port[i - s].devnum); + if (cdev_add(&umad_dev->port[i - s].dev, base_dev + + umad_dev->port[i - s].devnum, 1)) + goto err; + + umad_dev->port[i - s].class_dev = + class_simple_device_add(umad_class, + umad_dev->port[i - s].dev.dev, + &device->dma_device->dev, + "umad%d", umad_dev->port[i - s].devnum); + if (IS_ERR(umad_dev->port[i - s].class_dev)) + goto err_class; + + class_set_devdata(umad_dev->port[i - s].class_dev, + &umad_dev->port[i - s]); + + class_device_create_file(umad_dev->port[i - s].class_dev, + &class_device_attr_ibdev); + class_device_create_file(umad_dev->port[i - s].class_dev, + &class_device_attr_port); + } + + ib_set_client_data(device, &umad_client, umad_dev); + + return; + +err_class: + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + +err: + while (--i >= s) { + class_simple_device_remove(umad_dev->port[i - s].dev.dev); + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + } + + kfree(umad_dev); +} + +static void ib_umad_remove_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + int i; + + if (!umad_dev) + return; + + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) { + class_simple_device_remove(umad_dev->port[i].dev.dev); + cdev_del(&umad_dev->port[i].dev); + clear_bit(umad_dev->port[i].devnum, dev_map); + } + + kfree(umad_dev); +} + +static int ib_umad_hotplug(struct class_device *dev, char **envp, + int num_envp, char *buffer, int buffer_size) +{ + return 0; +} + +static int __init ib_umad_init(void) +{ + int ret; + + ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS, + "infiniband_mad"); + if (ret) { + printk(KERN_ERR "user_mad: couldn't get device number\n"); + goto out; + } + + umad_class = class_simple_create(THIS_MODULE, "infiniband_mad"); + if (IS_ERR(umad_class)) { + printk(KERN_ERR "user_mad: couldn't create class_simple\n"); + ret = PTR_ERR(umad_class); + goto out_chrdev; + } + + ret = class_simple_set_hotplug(umad_class, ib_umad_hotplug); + if (ret) { + printk(KERN_ERR "user_mad: couldn't set class_simple hotplug\n"); + goto out_class; + } + + ret = ib_register_client(&umad_client); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ib_umad client\n"); + goto out_class; + } + + return 0; + +out_class: + class_simple_destroy(umad_class); + +out_chrdev: + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); + +out: + return ret; +} + +static void __exit ib_umad_cleanup(void) +{ + ib_unregister_client(&umad_client); + class_simple_destroy(umad_class); + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); +} + +module_init(ib_umad_init); +module_exit(ib_umad_cleanup); Index: docs/user_mad.txt =================================================================== --- docs/user_mad.txt (revision 0) +++ docs/user_mad.txt (revision 0) @@ -0,0 +1,70 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}" + + can be used. This will create nodes such as /dev/infiniband/mthca0/umad1 + for port 1 of device mthca0. From halr at voltaire.com Wed Nov 3 06:01:49 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 09:01:49 -0500 Subject: [openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad In-Reply-To: References: Message-ID: <1099490509.2831.19.camel@hpc-1> On Tue, 2004-11-02 at 21:33, Krishna Kumar wrote: > Hi Sean, > > I guess you meant "even if solicited is NOT set". What you described is > right, the race will mean that the remove_mad_reg_req() will free things > like method/class, while the find_mad_agent looks through the version > and class to find the mad_agent. This patch will fix it correctly. > > I have also cleaned up a hack in ib_mad_recv_done_handler() where a > test for '!mad_agent' was being done to determine whether to free 'recv' > or not :-). Thanks. Applied. -- Hal From halr at voltaire.com Wed Nov 3 06:06:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 09:06:28 -0500 Subject: [openib-general] [PATCH] fix memory leak problem in agent_mad_send() In-Reply-To: <200411022000.09120.mashirle@us.ibm.com> References: <200411022000.09120.mashirle@us.ibm.com> Message-ID: <1099490788.2831.26.camel@hpc-1> On Tue, 2004-11-02 at 23:00, Shirley Ma wrote: > Here is the patch. Please review it. Yes, these are memory leaks in agent_mad_send but it would be better to fix them with the deallocation being done at the function level where the allocation is done. Hence rather than agent_send_mad returning void, it should return int, etc. I will post a patch for this shortly. -- Hal > > diff -urN access/agent.c access.patch5/agent.c > --- access/agent.c 2004-11-02 17:40:06.000000000 -0800 > +++ access.patch5/agent.c 2004-11-02 18:43:47.534608536 -0800 > @@ -357,12 +357,16 @@ > if (!port_priv) { > printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n", > mad_agent); > + kfree(mad); > return; > } > > agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); > - if (!agent_send_wr) > + if (!agent_send_wr) { > + printk(KERN_ERR SPFX "No memory for agent work request\n"); > + kfree(mad); > return; > + } > agent_send_wr->mad = mad; > > /* PCI mapping */ > @@ -407,6 +411,7 @@ > if (IS_ERR(agent_send_wr->ah)) { > printk(KERN_ERR SPFX "No memory for address handle\n"); > kfree(mad); > + kfree(agent_send_wr); > return; > } > > @@ -432,6 +437,8 @@ > sizeof(struct ib_mad), > PCI_DMA_TODEVICE); > ib_destroy_ah(agent_send_wr->ah); > + kfree(mad); > + kfree(agent_send_wr); > } else { > list_add_tail(&agent_send_wr->send_list, > &port_priv->send_posted_list); > > From halr at voltaire.com Wed Nov 3 06:22:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 09:22:11 -0500 Subject: [openib-general] mthca dma_pool_destroy mthca_av busy message Message-ID: <1099491731.2831.45.camel@hpc-1> Hi Roland, When shutting down mthca after shutting down IPoIB, the following message appears on the console: ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, c03a6000 busy Is everything OK the next time mthca, etc. is started ? (It appears to be). Thanks. -- Hal From halr at voltaire.com Wed Nov 3 06:25:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 09:25:20 -0500 Subject: [openib-general] [PATCH] agent: Fix memory leaks associated with agent_mad_send errors Message-ID: <1099491920.2831.50.camel@hpc-1> agent: Fix memory leaks (identified by Shirley Ma) associated with agent_mad_send errors Index: agent.c =================================================================== --- agent.c (revision 1104) +++ agent.c (working copy) @@ -339,10 +339,10 @@ return entry; } -void agent_mad_send(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_grh *grh, - struct ib_mad_recv_wc *mad_recv_wc) +int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_mad *mad, + struct ib_grh *grh, + struct ib_mad_recv_wc *mad_recv_wc) { struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; @@ -351,18 +351,19 @@ struct ib_send_wr *bad_send_wr; struct ib_ah_attr ah_attr; unsigned long flags; + int ret = 1; /* Find matching MAD agent */ port_priv = ib_get_agent_mad(NULL, 0, mad_agent); if (!port_priv) { printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n", mad_agent); - return; + goto out; } agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); if (!agent_send_wr) - return; + goto out; agent_send_wr->mad = mad; /* PCI mapping */ @@ -406,8 +407,8 @@ agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); if (IS_ERR(agent_send_wr->ah)) { printk(KERN_ERR SPFX "No memory for address handle\n"); - kfree(mad); - return; + kfree(agent_send_wr); + goto out; } send_wr.wr.ud.ah = agent_send_wr->ah; @@ -432,11 +433,16 @@ sizeof(struct ib_mad), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); } else { list_add_tail(&agent_send_wr->send_list, &port_priv->send_posted_list); spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; } + +out: + return ret; } int smi_send_smp(struct ib_mad_agent *mad_agent, @@ -470,8 +476,9 @@ kfree(smp_response); return 0; } - agent_mad_send(mad_agent, smp_response, - NULL, mad_recv_wc); + if (agent_mad_send(mad_agent, smp_response, + NULL, mad_recv_wc)) + kfree(smp_response); } else kfree(smp_response); return 1; From roland at topspin.com Wed Nov 3 07:06:20 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 07:06:20 -0800 Subject: [openib-general] Re: mthca dma_pool_destroy mthca_av busy message In-Reply-To: <1099491731.2831.45.camel@hpc-1> (Hal Rosenstock's message of "Wed, 03 Nov 2004 09:22:11 -0500") References: <1099491731.2831.45.camel@hpc-1> Message-ID: <52u0s7q8er.fsf@topspin.com> Hal> Hi Roland, When shutting down mthca after shutting down Hal> IPoIB, the following message appears on the console: Hal> ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, c03a6000 busy Yes, this is because IPoIB currently leaks AVs (we need to hook into the neighbour destructor to know when we can destroy an AV). Hal> Is everything OK the next time mthca, etc. is started ? (It Hal> appears to be). I think so. - Roland From sean.hefty at intel.com Wed Nov 3 08:34:29 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 3 Nov 2004 08:34:29 -0800 Subject: [openib-general] [PATCH] Missing check for atomic_dec inib_post_send_mad In-Reply-To: Message-ID: >Couple of issues with the new code (same as old code, though) : > >1. printk(KERN_ERR PFX "No client 0x%x for received MAD " > "on port %d\n", > hi_tid, port_priv->port_num); > and printk(KERN_NOTICE PFX "No matching mad agent found for " > "received MAD on port %d\n", port_priv->port_num); > both get printed when mad_agent is not found in solicited case. > >2. spin_unlock is performed after all the printk's, which is a bit icky. > >Compile-tested patch (not tested) follows at the end of the mail. Let me >know if I should fix above problems too. Thanks for the patch. If you can do something with the printk's, that would be good. They should be KERN_NOTICE, but we may want to consider just removing them. - Sean From sean.hefty at intel.com Wed Nov 3 09:00:23 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 3 Nov 2004 09:00:23 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <52y8hjqwez.fsf@topspin.com> Message-ID: >I've just checked in an initial version of userspace MAD access >(including documentation in docs/user_mad.txt). > >Unfortunately this is not quite ready for use underneath OpenSM, since >it is not possible to register an agent for the SM classes (since they >are currently grabbed by the kernel SMA first). > >All criticisms and comments greatly appreciated... After a first review, the code looks really good. Is anyone willing to work on porting opensm to this? If not, I can start on this. Otherwise, I will continue working on adding MAD error/overrun handling. - Sean From roland at topspin.com Wed Nov 3 09:14:26 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 09:14:26 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: (Sean Hefty's message of "Wed, 3 Nov 2004 09:00:23 -0800") References: Message-ID: <52pt2urh1p.fsf@topspin.com> Sean> Is anyone willing to work on porting opensm to this? If Sean> not, I can start on this. Otherwise, I will continue Sean> working on adding MAD error/overrun handling. It would be great to work on that but we need to resolve how to handle the SM classes first. One option would be to extend the user_mad code to handle MADs timeouts and have OpenSM only receive solicited MADs (and have OpenSM register an agent with class == 0). This is only a temporary solution because ultimately OpenSM needs to receive SMInfo SMPs. I think this still requires some figuring for how to handle the DR SMI. Another option is to revise the kernel MAD code so that it does not need to register an agent for the SM classes (ie pass all MADs to low-level driver first). - R. From sean.hefty at intel.com Wed Nov 3 09:24:09 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 3 Nov 2004 09:24:09 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <52pt2urh1p.fsf@topspin.com> Message-ID: >Another option is to revise the kernel MAD code so that it does not >need to register an agent for the SM classes (ie pass all MADs to >low-level driver first). I thought that we had decided to go this route, and replace snoop_mad with calls to process_mad. If we're in agreement on this, I can do it first. - Sean From roland at topspin.com Wed Nov 3 09:31:24 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 09:31:24 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: (Sean Hefty's message of "Wed, 3 Nov 2004 09:24:09 -0800") References: Message-ID: <52lldirg9f.fsf@topspin.com> Sean> I thought that we had decided to go this route, and replace Sean> snoop_mad with calls to process_mad. If we're in agreement Sean> on this, I can do it first. That was my impression too, so I think that would be a good route to go. - R. From halr at voltaire.com Wed Nov 3 10:27:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 13:27:20 -0500 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: References: Message-ID: <1099506440.2831.60.camel@hpc-1> On Wed, 2004-11-03 at 12:00, Sean Hefty wrote: > Is anyone willing to work on porting opensm to this? If not, > I can start on this. Otherwise, I will continue working on > adding MAD error/overrun handling. Shahar from Voltaire will be doing this. I am working now on modifying the MAD layer and agents to make the changes that have been discussed on the list relative to supporting the SM. -- Hal From sean.hefty at intel.com Wed Nov 3 10:24:57 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 3 Nov 2004 10:24:57 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <1099506440.2831.60.camel@hpc-1> Message-ID: >On Wed, 2004-11-03 at 12:00, Sean Hefty wrote: >> Is anyone willing to work on porting opensm to this? If not, >> I can start on this. Otherwise, I will continue working on >> adding MAD error/overrun handling. > >Shahar from Voltaire will be doing this. I am working now on modifying >the MAD layer and agents to make the changes that have been discussed on >the list relative to supporting the SM. Then I shall return to my work of handling QP errors/overruns in the MAD layer. - Sean From halr at voltaire.com Wed Nov 3 10:32:49 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 13:32:49 -0500 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <52pt2urh1p.fsf@topspin.com> References: <52pt2urh1p.fsf@topspin.com> Message-ID: <1099506769.2831.66.camel@hpc-1> On Wed, 2004-11-03 at 12:14, Roland Dreier wrote: > Another option is to revise the kernel MAD code so that it does not > need to register an agent for the SM classes (ie pass all MADs to > low-level driver first). That's what I'm working on now (eliminate snoop_mad and replace with process_mad). I should have the first cut by COB today. After that, I will work on SMI restructure that needs to be done so that outgoing SMI updating becomes part of ib_post_send_mad rather than a precursor to it. -- Hal From iod00d at hp.com Wed Nov 3 10:50:17 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 3 Nov 2004 10:50:17 -0800 Subject: [openib-general] ib_modify_qp() too many arguments In-Reply-To: <523bzrshig.fsf@topspin.com> References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com> Message-ID: <20041103185017.GE17281@cup.hp.com> On Tue, Nov 02, 2004 at 08:06:47PM -0800, Roland Dreier wrote: > CONFIG_INFINIBAND_CM depends on CONFIG_BROKEN now, Sorry - I didn't see that. I only looked at the Makefile, not Kconfig. > Your patch is a small step in the right direction so I applied it. > The reason I have the ib_cm_qp_modify() function is to add the check > for a NULL qp, which makes the CM logic simpler. *nod*. I'll poke at other trivial things in the meantime... thanks, grant From krkumar at us.ibm.com Wed Nov 3 10:45:20 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 10:45:20 -0800 (PST) Subject: [openib-general] [PATCH] Cleanup spaces to tabs Message-ID: Entire openib cleaned up to remove 8 spaces to replace with tabs, just two files though :-) thx, - KK diff -ruN 1/agent.c 2/agent.c --- 1/agent.c 2004-11-03 10:42:29.000000000 -0800 +++ 2/agent.c 2004-11-03 10:43:24.000000000 -0800 @@ -207,7 +207,7 @@ if (hop_ptr == 1) { if (smp->dr_slid == IB_LID_PERMISSIVE) { /* giving SMP to SM - update hop_ptr */ - smp->hop_ptr--; + smp->hop_ptr--; return 1; } /* smp->hop_ptr updated when sending */ @@ -373,7 +373,7 @@ PCI_DMA_TODEVICE); gather_list.length = sizeof(struct ib_mad); gather_list.lkey = (*port_priv->mr).lkey; - + send_wr.next = NULL; send_wr.opcode = IB_WR_SEND; send_wr.sg_list = &gather_list; @@ -381,7 +381,7 @@ send_wr.wr.ud.remote_qpn = mad_recv_wc->wc->src_qp; /* DQPN */ send_wr.wr.ud.timeout_ms = 0; send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; - + ah_attr.dlid = mad_recv_wc->wc->slid; ah_attr.port_num = mad_agent->port_num; ah_attr.src_path_bits = mad_recv_wc->wc->dlid_path_bits; @@ -410,7 +410,7 @@ kfree(agent_send_wr); goto out; } - + send_wr.wr.ud.ah = agent_send_wr->ah; if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { send_wr.wr.ud.pkey_index = mad_recv_wc->wc->pkey_index; @@ -560,8 +560,8 @@ { struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; - struct list_head *send_wr; - unsigned long flags; + struct list_head *send_wr; + unsigned long flags; /* Find matching MAD agent */ port_priv = ib_get_agent_mad(NULL, 0, mad_agent); @@ -579,7 +579,7 @@ "is empty\n", (unsigned long long) mad_send_wc->wr_id); return; } - + agent_send_wr = list_entry(&port_priv->send_posted_list, struct ib_agent_send_wr, send_list); @@ -588,8 +588,8 @@ send_list); /* Remove from posted send MAD list */ - list_del(&agent_send_wr->send_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, @@ -694,11 +694,11 @@ goto error3; } - /* Obtain MAD agent for PerfMgmt class */ - reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; clear_bit(IB_MGMT_METHOD_TRAP_REPRESS, (unsigned long *)®_req.method_mask); - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, IB_QPT_GSI, ®_req, 0, &agent_send_handler, @@ -756,7 +756,7 @@ ib_unregister_mad_agent(port_priv->perf_mgmt_agent); ib_unregister_mad_agent(port_priv->lr_smp_agent); ib_unregister_mad_agent(port_priv->dr_smp_agent); - kfree(port_priv); + kfree(port_priv); return 0; } diff -ruN 1/mad.c 2/mad.c --- 1/mad.c 2004-11-03 10:42:29.000000000 -0800 +++ 2/mad.c 2004-11-03 10:43:12.000000000 -0800 @@ -1473,7 +1473,7 @@ struct ib_qp_attr *attr; int attr_mask; - attr = kmalloc(sizeof *attr, GFP_KERNEL); + attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n"); return -ENOMEM; From mashirle at us.ibm.com Wed Nov 3 10:56:29 2004 From: mashirle at us.ibm.com (Shirley Ma) Date: Wed, 3 Nov 2004 10:56:29 -0800 Subject: [openib-general] [PATCH] fix memory leak and return value associated with agent_mad_send(response) Message-ID: <200411031056.29522.mashirle@us.ibm.com> Here is the patch. Please review it. diff -urN access/agent.c access.patch6/agent.c --- access/agent.c 2004-11-03 10:34:17.941019320 -0800 +++ access.patch6/agent.c 2004-11-03 10:54:16.001886384 -0800 @@ -477,8 +477,10 @@ return 0; } if (agent_mad_send(mad_agent, smp_response, - NULL, mad_recv_wc)) + NULL, mad_recv_wc)) { kfree(smp_response); + return 0; + } } else kfree(smp_response); return 1; @@ -504,7 +506,10 @@ ret = mad_process_local(mad_agent, mad, response, slid); if (ret & IB_MAD_RESULT_SUCCESS) { grh = (void *)mad - sizeof(struct ib_grh); - agent_mad_send(mad_agent, response, grh, mad_recv_wc); + if (agent_mad_send(mad_agent, response, grh, mad_recv_wc)) { + kfree(response); + return 0; + } } else kfree(response); return 1; @@ -543,12 +548,12 @@ } else { /* PerfMgmt class */ if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - agent_mad_response(mad_agent, mad, mad_recv_wc, - mad_recv_wc->wc->slid); + return (agent_mad_response(mad_agent, mad, mad_recv_wc, + mad_recv_wc->wc->slid)); } else { printk(KERN_ERR "agent_recv_mad: Unexpected mgmt class 0x%x received\n", mad->mad_hdr.mgmt_class); + return 0; } - return 0; } /* Complete receive up stack */ -- Thanks Shirley Ma IBM Linux Technology Center From iod00d at hp.com Wed Nov 3 11:21:03 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 3 Nov 2004 11:21:03 -0800 Subject: [openib-general] ib_modify_qp() too many arguments In-Reply-To: <523bzrshig.fsf@topspin.com> References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com> Message-ID: <20041103192103.GF17281@cup.hp.com> On Tue, Nov 02, 2004 at 08:06:47PM -0800, Roland Dreier wrote: > CONFIG_INFINIBAND_CM depends on CONFIG_BROKEN now, ... > Your patch is a small step in the right direction so I applied it. "small" is a very generous assessment :^) It was almost irrelevant given how much code still needs work. Here's the link phase output with CM/DM/SRP/etc enabled: Building modules, stage 2. MODPOST *** Warning: "ib_client_query_cancel" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined! *** Warning: "tsIbSetOutofServiceNoticeHandler" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined! *** Warning: "tsIbPathRecordRequest" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined! *** Warning: "tsIbSetInServiceNoticeHandler" [drivers/infiniband/ulp/srp/ib_srp.ko] undefined! *** Warning: "ib_client_mad_handler_register" [drivers/infiniband/core/ib_dm_client.ko] undefined! *** Warning: "tsIbPortInfoTblQuery" [drivers/infiniband/core/ib_dm_client.ko] undefined! *** Warning: "tsIbPortInfoQuery" [drivers/infiniband/core/ib_dm_client.ko] undefined! *** Warning: "ib_client_query" [drivers/infiniband/core/ib_dm_client.ko] undefined! *** Warning: "ib_client_alloc_tid" [drivers/infiniband/core/ib_dm_client.ko] undefined! *** Warning: "ib_mad_send" [drivers/infiniband/core/ib_cm.ko] undefined! *** Warning: "ib_mad_handler_register" [drivers/infiniband/core/ib_cm.ko] undefined! *** Warning: "ib_mad_handler_deregister" [drivers/infiniband/core/ib_cm.ko] undefined! Can folks offer some guidance on the following issues: 1) drivers/infiniband/include/ still has alot of files still prefixed with "ts". Do they all need to be renamed? Or do some need to be reworked to match some new interfaces? E.g. ib_client_query_cancel is declared in ts_ib_client_query.h. I don't know if ts_ib_client_query.h needs additional work. Should I submit patches so all #include's only reference ib_client_query.h? Or maybe just client_query.h? 2) Can some of the offending include files be dropped outright? 3) Of the ts* symbols above, can someone point me at which header file contains the "right" interfaces to use? I might be able to fixup some of the warnings above. I'm thinking of tsIbSetOutofServiceNoticeHandler and similar functions. grant From roland at topspin.com Wed Nov 3 11:24:02 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 11:24:02 -0800 Subject: [openib-general] ib_modify_qp() too many arguments In-Reply-To: <20041103192103.GF17281@cup.hp.com> (Grant Grundler's message of "Wed, 3 Nov 2004 11:21:03 -0800") References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com> <20041103192103.GF17281@cup.hp.com> Message-ID: <528y9irb1p.fsf@topspin.com> Grant> 1) drivers/infiniband/include/ still has alot of files Grant> still prefixed with "ts". Do they all need to be renamed? Grant> Or do some need to be reworked to match some new Grant> interfaces? I think pretty much every ts_*.h file is obsolete. When we port the CM to the new world, we'll have to update ts_ib_cm.h but it needs quite a bit of work. We're going to create an actual gen2/trunk very soon with the first set of stuff for kernel submission, and I'll remove all the broken stuff from there (I'll keep the CM etc on my branch so that it can eventually be fixed and added to the trunk). - R. From roland at topspin.com Wed Nov 3 11:56:14 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 11:56:14 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <52y8hjqwez.fsf@topspin.com> (Roland Dreier's message of "Tue, 02 Nov 2004 22:27:48 -0800") References: <52y8hjqwez.fsf@topspin.com> Message-ID: <524qk6r9k1.fsf@topspin.com> By the way, buried down at the end of the patch is some documentation about creating device files: +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}" + + can be used. This will create nodes such as /dev/infiniband/mthca0/umad1 + for port 1 of device mthca0. Do the names /dev/infiniband/mthca0/umad1 and so on make sense to people? I thought that userspace verbs support would probably use a file like /dev/infiniband/mthca0/verbs, etc. In any case, now is probably the time to object before we have legacy issues to worry about.... - R. From iod00d at hp.com Wed Nov 3 12:07:24 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 3 Nov 2004 12:07:24 -0800 Subject: [openib-general] ib_modify_qp() too many arguments In-Reply-To: <528y9irb1p.fsf@topspin.com> References: <20041103001015.GA13563@cup.hp.com> <523bzrshig.fsf@topspin.com> <20041103192103.GF17281@cup.hp.com> <528y9irb1p.fsf@topspin.com> Message-ID: <20041103200724.GG17281@cup.hp.com> On Wed, Nov 03, 2004 at 11:24:02AM -0800, Roland Dreier wrote: > Grant> 1) drivers/infiniband/include/ still has alot of files > Grant> still prefixed with "ts". Do they all need to be renamed? > Grant> Or do some need to be reworked to match some new > Grant> interfaces? > > I think pretty much every ts_*.h file is obsolete. When we port the > CM to the new world, we'll have to update ts_ib_cm.h but it needs > quite a bit of work. ok > We're going to create an actual gen2/trunk very soon with the first > set of stuff for kernel submission, and I'll remove all the broken > stuff from there (I'll keep the CM etc on my branch so that it can > eventually be fixed and added to the trunk). I don't mind broken stuff in the candidate branch. Especially if it's something that we need anyway and just needs "the dots connected". Fixing missing symbols is usually a pretty mundane task. Goes something along the line of: 1) look for similar new symbol 2) find old symbol in old code 3) compare the two and see how they differ. 4) either use the new symbol and adjust code around the reference, drop reference to old symbol, or ask what to do. thanks, grant From krkumar at us.ibm.com Wed Nov 3 13:23:32 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 13:23:32 -0800 (PST) Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ Message-ID: Hi, I am not sure if this is a good idea, but since I am new to this area, here it goes :-) Section 11.2.6.3, C11-16 states that resize of qp must be permitted. In the patch I am submitting, I don't understand why so many parameters are expected by driver/verbs. I thought the qp_handle and ib_qp_attr is enough, atleast according to the spec. Along with this, I am going to submit another patch to catch "catastrophic" errors in return value of the resize operation. This is due to the need to check for 2 special cases : "CQ overrun" and "CQ inaccessible". For these two errors, I think the queues should be deallocated and error returned. This is in the second patch. I am not sure of the error numbers, I guessed it from mthca_eq.c and could be wrong here. Thanks, - KK diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-03 11:32:14.000000000 -0800 +++ 2/mad.c 2004-11-03 13:15:49.000000000 -0800 @@ -1629,6 +1629,14 @@ static void init_mad_queue(struct ib_mad INIT_LIST_HEAD(&mad_queue->list); } +/* + * Allocate one mad QP. + * + * If the return indicates success, the value returned is the new size + * of the queue pair that got created. + * + * Return > 0 on success and -(ERRNO) on failure. Zero should never happen. + */ static int create_mad_qp(struct ib_mad_port_private *port_priv, struct ib_mad_qp_info *qp_info, enum ib_qp_type qp_type) @@ -1652,15 +1660,23 @@ static int create_mad_qp(struct ib_mad_p qp_init_attr.qp_type = qp_type; qp_init_attr.port_num = port_priv->port_num; qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); - if (IS_ERR(qp_info->qp)) { - printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", - get_spl_qp_index(qp_type)); + if (!IS_ERR(qp_info->qp)) { + struct ib_qp_attr qp_attr; + + ret = ib_query_qp(qp_info->qp, &qp_attr, 0, &qp_init_attr); + if (ret < 0) { + /* + * For any error, use the same size we used to + * create the queue. + */ + ret = qp_init_attr.cap.max_send_wr + + qp_init_attr.cap.max_recv_wr; + } + } else { ret = PTR_ERR(qp_info->qp); - goto error; + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d err:%d\n", + get_spl_qp_index(qp_type), ret); } - return 0; - -error: return ret; } @@ -1682,6 +1698,7 @@ static int ib_mad_port_open(struct ib_de .size = (unsigned long) high_memory - PAGE_OFFSET }; struct ib_mad_port_private *port_priv; + int total_qp_size; unsigned long flags; /* First, check if port already open at MAD layer */ @@ -1731,11 +1748,25 @@ static int ib_mad_port_open(struct ib_de } ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI); - if (ret) + if (ret <= 0) goto error6; + total_qp_size = ret; + ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI); - if (ret) + if (ret <= 0) goto error7; + total_qp_size += ret; + + /* Resize if the total QP[0,1] size is greater than CQ size. */ + if (total_qp_size > cq_size) { + printk(KERN_DEBUG PFX "ib_mad_port_open: increasing size of " + "CQ from %d to %d\n", cq_size, total_qp_size); + if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) { + printk(KERN_DEBUG PFX "Couldn't increase CQ size - " + "err:%d\n", ret); + /* continue, not an error */ + } + } spin_lock_init(&port_priv->reg_lock); INIT_LIST_HEAD(&port_priv->agent_list); From krkumar at us.ibm.com Wed Nov 3 13:24:34 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 13:24:34 -0800 (PST) Subject: [openib-general] [PATCH 2/2] [RFC] Implement error handling in resize of CQ In-Reply-To: Message-ID: Again, this has been build-tested only. Thx, - KK diff -ruNp 2/mad.c 3/mad.c --- 2/mad.c 2004-11-03 13:15:49.000000000 -0800 +++ 3/mad.c 2004-11-03 13:16:47.000000000 -0800 @@ -1686,6 +1686,23 @@ static void destroy_mad_qp(struct ib_mad } /* + * Overrun and Inaccessible errors cannot be handled by QP resize operation. + */ +static inline int is_catastrophic_error(int err) +{ +#define CQ_OVERFLOW_ERROR 0x0f +#define CQ_ACCESS_ERROR 0x11 + + switch (err) { + default: /* OK */ + return 0; + case CQ_ACCESS_ERROR: + case CQ_OVERFLOW_ERROR: + return 1; + } +} + +/* * Open the port * Create the QP, PD, MR, and CQ if needed */ @@ -1764,6 +1781,10 @@ static int ib_mad_port_open(struct ib_de if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) { printk(KERN_DEBUG PFX "Couldn't increase CQ size - " "err:%d\n", ret); + if (is_catastrophic_error(ret)) { + /* Clean up qp_info[0,1] */ + goto error8; + } /* continue, not an error */ } } From halr at voltaire.com Wed Nov 3 13:54:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 16:54:34 -0500 Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes (both branches) Message-ID: <1099518873.2837.5.camel@hpc-1> mthca/mad/agent changes to eliminate snoop_mad and use process_mad to give mthca driver first crack at received MAD and change agents to be send only (and not register for receives which are now handled by mthca). There are some more optimizations that can be made but this is a working start. I will do some of this hopefully tomorrow but outgoing SMI is my primary goal (to incorporate it into ib_post_send_mad). Once that is done, the changes to the MAD layer for SM support are usable AFAIK. Index: openib-candidate/src/linux-kernel/infiniband/access/agent.c =================================================================== --- openib-candidate/src/linux-kernel/infiniband/access/agent.c (revision 1125) +++ openib-candidate/src/linux-kernel/infiniband/access/agent.c (working copy) @@ -29,7 +29,6 @@ #include - static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED; static LIST_HEAD(ib_agent_port_list); @@ -37,9 +36,9 @@ * Fixup a directed route SMP for sending. Return 0 if the SMP should be * discarded. */ -static int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) { u8 hop_ptr, hop_cnt; @@ -111,23 +110,6 @@ } /* - * Sender side handling of outgoing SMPs. Fixup the SMP as required by - * the spec. Return 0 if the SMP should be dropped. - */ -static int smi_handle_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) -{ - switch (smp->mgmt_class) - { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - return smi_handle_dr_smp_send(smp, node_type, port_num); - default: /* LR SM class */ - return 1; - } -} - -/* * Return 1 if the SMP should be handled by the local SMA via process_mad. */ static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, @@ -145,10 +127,10 @@ * Adjust information for a received SMP. Return 0 if the SMP should be * dropped. */ -static int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) { u8 hop_ptr, hop_cnt; @@ -221,29 +203,10 @@ } /* - * Receive side handling SMPs. Save receive information as required by - * the spec. Return 0 if the SMP should be dropped. - */ -static int smi_handle_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) -{ - switch (smp->mgmt_class) - { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - return smi_handle_dr_smp_recv(smp, node_type, - port_num, phys_port_cnt); - default: /* LR SM class */ - return 1; - } -} - -/* * Return 1 if the received DR SMP should be forwarded to the send queue. * Return 0 if the SMP should be completed up the stack. */ -static int smi_check_forward_dr_smp(struct ib_smp *smp) +int smi_check_forward_dr_smp(struct ib_smp *smp) { u8 hop_ptr, hop_cnt; @@ -274,31 +237,6 @@ return 0; } -/* - * Return 1 if the received SMP should be forwarded to the send queue. - * Return 0 if the SMP should be completed up the stack. - */ -static int smi_check_forward_smp(struct ib_smp *smp) -{ - switch (smp->mgmt_class) - { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - return smi_check_forward_dr_smp(smp); - default: /* LR SM class */ - return 1; - } -} - -static int mad_process_local(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_mad *mad_response, - u16 slid) -{ - return mad_agent->device->process_mad(mad_agent->device, 0, - mad_agent->port_num, - slid, mad, mad_response); -} - static inline struct ib_agent_port_private * __ib_get_agent_mad(struct ib_device *device, int port_num, struct ib_mad_agent *mad_agent) @@ -339,12 +277,28 @@ return entry; } -int agent_mad_send(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_grh *grh, - struct ib_mad_recv_wc *mad_recv_wc) +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) { struct ib_agent_port_private *port_priv; + + port_priv = ib_get_agent_mad(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_mad *mad, + struct ib_grh *grh, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; struct ib_sge gather_list; struct ib_send_wr send_wr; @@ -445,114 +399,41 @@ return ret; } -int smi_send_smp(struct ib_mad_agent *mad_agent, - struct ib_smp *smp, - struct ib_mad_recv_wc *mad_recv_wc, - u16 slid, - int phys_port_cnt) +int agent_send(struct ib_mad *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) { - struct ib_mad *smp_response; - int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + struct ib_mad_recv_wc mad_recv_wc; - if (!smi_handle_smp_send(smp, mad_agent->device->node_type, - mad_agent->port_num)) { - /* SMI failed send */ - return 0; - } - - if (smi_check_local_smp(mad_agent, smp)) { - smp_response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); - if (!smp_response) - return 0; - - ret = mad_process_local(mad_agent, (struct ib_mad *)smp, - smp_response, slid); - if (ret & IB_MAD_RESULT_SUCCESS) { - if (!smi_handle_smp_recv((struct ib_smp *)smp_response, - mad_agent->device->node_type, - mad_agent->port_num, - phys_port_cnt)) { - /* SMI failed receive */ - kfree(smp_response); - return 0; - } - if (agent_mad_send(mad_agent, smp_response, - NULL, mad_recv_wc)) - kfree(smp_response); - } else - kfree(smp_response); + port_priv = ib_get_agent_mad(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); return 1; } - /* Post the send on the QP */ - return 1; -} - -int agent_mad_response(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_mad_recv_wc *mad_recv_wc, - u16 slid) -{ - struct ib_mad *response; - struct ib_grh *grh; - int ret; - - response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); - if (!response) - return 0; - - ret = mad_process_local(mad_agent, mad, response, slid); - if (ret & IB_MAD_RESULT_SUCCESS) { - grh = (void *)mad - sizeof(struct ib_grh); - agent_mad_send(mad_agent, response, grh, mad_recv_wc); - } else - kfree(response); - return 1; -} - -int agent_recv_mad(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_mad_recv_wc *mad_recv_wc, - int phys_port_cnt) -{ - int port_num; - - /* SM Directed Route or LID Routed class */ - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE || - mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) { - if (mad_agent->device->node_type != IB_NODE_SWITCH) - port_num = mad_agent->port_num; - else - port_num = mad_recv_wc->wc->port_num; - if (!smi_handle_smp_recv((struct ib_smp *)mad, - mad_agent->device->node_type, - port_num, phys_port_cnt)) { - /* SMI failed receive */ - return 0; - } - - if (smi_check_forward_smp((struct ib_smp *)mad)) { - smi_send_smp(mad_agent, - (struct ib_smp *)mad, - mad_recv_wc, - mad_recv_wc->wc->slid, - phys_port_cnt); - return 0; - } - - } else { - /* PerfMgmt class */ - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - agent_mad_response(mad_agent, mad, mad_recv_wc, - mad_recv_wc->wc->slid); - } else { - printk(KERN_ERR "agent_recv_mad: Unexpected mgmt class 0x%x received\n", mad->mad_hdr.mgmt_class); - } - return 0; + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; } - /* Complete receive up stack */ - return 1; + /* Other fields don't matter so should change signature to just use wc */ + mad_recv_wc.wc = wc; + return agent_mad_send(mad_agent, mad, grh, &mad_recv_wc); } static void agent_send_handler(struct ib_mad_agent *mad_agent, @@ -603,26 +484,6 @@ kfree(agent_send_wr->mad); } -static void agent_recv_handler(struct ib_mad_agent *mad_agent, - struct ib_mad_recv_wc *mad_recv_wc) -{ - struct ib_agent_port_private *port_priv; - - /* Find matching MAD agent */ - port_priv = ib_get_agent_mad(NULL, 0, mad_agent); - if (!port_priv) { - printk(KERN_ERR SPFX "agent_recv_handler: no matching MAD agent %p\n", - mad_agent); - } else { - agent_recv_mad(mad_agent, - mad_recv_wc->recv_buf->mad, - mad_recv_wc, port_priv->phys_port_cnt); - } - - /* Free received MAD */ - ib_free_recv_mad(mad_recv_wc); -} - int ib_agent_port_open(struct ib_device *device, int port_num, int phys_port_cnt) { @@ -663,19 +524,12 @@ reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; reg_req.mgmt_class_version = 1; - /* SMA needs to receive Get, Set, and TrapRepress methods */ - bitmap_zero((unsigned long *)®_req.method_mask, IB_MGMT_MAX_METHODS); - set_bit(IB_MGMT_METHOD_GET, (unsigned long *)®_req.method_mask); - set_bit(IB_MGMT_METHOD_SET, (unsigned long *)®_req.method_mask); - set_bit(IB_MGMT_METHOD_TRAP_REPRESS, - (unsigned long *)®_req.method_mask); - port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, IB_QPT_SMI, - ®_req, 0, + NULL, 0, &agent_send_handler, - &agent_recv_handler, - NULL); + NULL, NULL); + if (IS_ERR(port_priv->dr_smp_agent)) { ret = PTR_ERR(port_priv->dr_smp_agent); goto error2; @@ -685,10 +539,9 @@ reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, IB_QPT_SMI, - ®_req, 0, + NULL, 0, &agent_send_handler, - &agent_recv_handler, - NULL); + NULL, NULL); if (IS_ERR(port_priv->lr_smp_agent)) { ret = PTR_ERR(port_priv->lr_smp_agent); goto error3; @@ -696,14 +549,11 @@ /* Obtain MAD agent for PerfMgmt class */ reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; - clear_bit(IB_MGMT_METHOD_TRAP_REPRESS, - (unsigned long *)®_req.method_mask); - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, IB_QPT_GSI, - ®_req, 0, + NULL, 0, &agent_send_handler, - &agent_recv_handler, - NULL); + NULL, NULL); if (IS_ERR(port_priv->perf_mgmt_agent)) { ret = PTR_ERR(port_priv->perf_mgmt_agent); goto error4; Index: openib-candidate/src/linux-kernel/infiniband/access/mad.c =================================================================== --- openib-candidate/src/linux-kernel/infiniband/access/mad.c (revision 1124) +++ openib-candidate/src/linux-kernel/infiniband/access/mad.c (working copy) @@ -781,21 +781,13 @@ goto out; } version = port_priv->version[mad->mad_hdr.class_version]; - if (!version) { - printk(KERN_ERR PFX "MAD received on port %d for class " - "version %d with no client\n", - port_priv->port_num, mad->mad_hdr.class_version); + if (!version) goto out; - } class = version->method_table[convert_mgmt_class( mad->mad_hdr.mgmt_class)]; if (class) mad_agent = class->agent[mad->mad_hdr.method & ~IB_MGMT_METHOD_RESP]; - else - printk(KERN_ERR PFX "MAD received on port %d for class " - "%d with no client\n", - port_priv->port_num, mad->mad_hdr.mgmt_class); } out: @@ -808,9 +800,7 @@ "%p on port %d\n", &mad_agent->agent, port_priv->port_num); } - } else - printk(KERN_NOTICE PFX "No matching mad agent found for " - "received MAD on port %d\n", port_priv->port_num); + } spin_unlock_irqrestore(&port_priv->reg_lock, flags); @@ -934,6 +924,23 @@ } } +extern int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); +extern int agent_send(struct ib_mad *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { @@ -942,6 +949,7 @@ struct ib_mad_private *recv; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; + struct ib_smp *smp; int solicited; mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; @@ -968,14 +976,69 @@ if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) goto out; - /* Snoop MAD ? */ - if (port_priv->device->snoop_mad) - if (port_priv->device->snoop_mad(port_priv->device, - (u8)port_priv->port_num, - wc->slid, - recv->header.recv_buf.mad)) + if (recv->header.recv_buf.mad->mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + smp = (struct ib_smp *)recv->header.recv_buf.mad; + if (!smi_handle_dr_smp_recv(smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->phys_port_cnt)) goto out; + if (!smi_check_forward_dr_smp(smp)) + goto out; + if (!smi_handle_dr_smp_send(smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(smp, + port_priv->device, + port_priv->port_num)) + goto out; + } + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + struct ib_mad *response; + struct ib_grh *grh; + int ret; + + response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* Is it better to assume that it wouldn't be processed ? */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + recv->header.recv_buf.mad, + response); + if ((ret & IB_MAD_RESULT_SUCCESS) && + (ret & IB_MAD_RESULT_REPLY)) { + if (response->mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv( + (struct ib_smp *)response, + port_priv->device->node_type, + port_priv->port_num, + port_priv->phys_port_cnt)) { + kfree(response); + goto out; + } + } + /* Send response */ + grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh); + if (agent_send(response, grh, wc, + port_priv->device, + port_priv->port_num)) { + kfree(response); + goto out; + } + } else + kfree(response); + } + /* Determine corresponding MAD agent for incoming receive MAD */ solicited = solicited_mad(recv->header.recv_buf.mad); mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, @@ -1673,7 +1736,9 @@ * Open the port * Create the QP, PD, MR, and CQ if needed */ -static int ib_mad_port_open(struct ib_device *device, int port_num) +static int ib_mad_port_open(struct ib_device *device, + int port_num, + int num_ports) { int ret, cq_size; u64 iova = 0; @@ -1702,6 +1767,7 @@ memset(port_priv, 0, sizeof *port_priv); port_priv->device = device; port_priv->port_num = port_num; + port_priv->phys_port_cnt = num_ports; spin_lock_init(&port_priv->reg_lock); cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; @@ -1836,7 +1902,7 @@ cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - ret = ib_mad_port_open(device, cur_port); + ret = ib_mad_port_open(device, cur_port, num_ports); if (ret) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, cur_port); Index: openib-candidate/src/linux-kernel/infiniband/access/mad_priv.h =================================================================== --- openib-candidate/src/linux-kernel/infiniband/access/mad_priv.h (revision 1119) +++ openib-candidate/src/linux-kernel/infiniband/access/mad_priv.h (working copy) @@ -156,6 +156,7 @@ struct list_head port_list; struct ib_device *device; int port_num; + int phys_port_cnt; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; Index: openib-candidate/src/linux-kernel/infiniband/include/ib_verbs.h =================================================================== --- openib-candidate/src/linux-kernel/infiniband/include/ib_verbs.h (revision 1108) +++ openib-candidate/src/linux-kernel/infiniband/include/ib_verbs.h (working copy) @@ -640,14 +640,10 @@ enum ib_mad_result { IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ - IB_MAD_RESULT_REPLY = 1 << 1 /* Reply packet needs to be sent */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ }; -enum ib_snoop_mad_result { - IB_SNOOP_MAD_IGNORED, - IB_SNOOP_MAD_CONSUMED -}; - #define IB_DEVICE_NAME_MAX 64 struct ib_device { Index: roland-merge/src/linux-kernel/infiniband/include/ib_verbs.h =================================================================== --- roland-merge/src/linux-kernel/infiniband/include/ib_verbs.h (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/include/ib_verbs.h (working copy) @@ -656,14 +656,10 @@ enum ib_mad_result { IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ - IB_MAD_RESULT_REPLY = 1 << 1 /* Reply packet needs to be sent */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ }; -enum ib_snoop_mad_result { - IB_SNOOP_MAD_IGNORED, - IB_SNOOP_MAD_CONSUMED -}; - #define IB_DEVICE_NAME_MAX 64 struct ib_device { Index: roland-merge/src/linux-kernel/infiniband/core/agent.c =================================================================== --- roland-merge/src/linux-kernel/infiniband/core/agent.c (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/core/agent.c (working copy) @@ -29,7 +29,6 @@ #include - static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED; static LIST_HEAD(ib_agent_port_list); @@ -37,9 +36,9 @@ * Fixup a directed route SMP for sending. Return 0 if the SMP should be * discarded. */ -static int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) { u8 hop_ptr, hop_cnt; @@ -111,23 +110,6 @@ } /* - * Sender side handling of outgoing SMPs. Fixup the SMP as required by - * the spec. Return 0 if the SMP should be dropped. - */ -static int smi_handle_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) -{ - switch (smp->mgmt_class) - { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - return smi_handle_dr_smp_send(smp, node_type, port_num); - default: /* LR SM class */ - return 1; - } -} - -/* * Return 1 if the SMP should be handled by the local SMA via process_mad. */ static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, @@ -145,10 +127,10 @@ * Adjust information for a received SMP. Return 0 if the SMP should be * dropped. */ -static int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) { u8 hop_ptr, hop_cnt; @@ -221,29 +203,10 @@ } /* - * Receive side handling SMPs. Save receive information as required by - * the spec. Return 0 if the SMP should be dropped. - */ -static int smi_handle_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) -{ - switch (smp->mgmt_class) - { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - return smi_handle_dr_smp_recv(smp, node_type, - port_num, phys_port_cnt); - default: /* LR SM class */ - return 1; - } -} - -/* * Return 1 if the received DR SMP should be forwarded to the send queue. * Return 0 if the SMP should be completed up the stack. */ -static int smi_check_forward_dr_smp(struct ib_smp *smp) +int smi_check_forward_dr_smp(struct ib_smp *smp) { u8 hop_ptr, hop_cnt; @@ -274,31 +237,6 @@ return 0; } -/* - * Return 1 if the received SMP should be forwarded to the send queue. - * Return 0 if the SMP should be completed up the stack. - */ -static int smi_check_forward_smp(struct ib_smp *smp) -{ - switch (smp->mgmt_class) - { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - return smi_check_forward_dr_smp(smp); - default: /* LR SM class */ - return 1; - } -} - -static int mad_process_local(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_mad *mad_response, - u16 slid) -{ - return mad_agent->device->process_mad(mad_agent->device, 0, - mad_agent->port_num, - slid, mad, mad_response); -} - static inline struct ib_agent_port_private * __ib_get_agent_mad(struct ib_device *device, int port_num, struct ib_mad_agent *mad_agent) @@ -339,30 +277,47 @@ return entry; } -void agent_mad_send(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_grh *grh, - struct ib_mad_recv_wc *mad_recv_wc) +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) { struct ib_agent_port_private *port_priv; + + port_priv = ib_get_agent_mad(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_mad *mad, + struct ib_grh *grh, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; struct ib_sge gather_list; struct ib_send_wr send_wr; struct ib_send_wr *bad_send_wr; struct ib_ah_attr ah_attr; unsigned long flags; + int ret = 1; /* Find matching MAD agent */ port_priv = ib_get_agent_mad(NULL, 0, mad_agent); if (!port_priv) { printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n", mad_agent); - return; + goto out; } agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); if (!agent_send_wr) - return; + goto out; agent_send_wr->mad = mad; /* PCI mapping */ @@ -406,8 +361,8 @@ agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); if (IS_ERR(agent_send_wr->ah)) { printk(KERN_ERR SPFX "No memory for address handle\n"); - kfree(mad); - return; + kfree(agent_send_wr); + goto out; } send_wr.wr.ud.ah = agent_send_wr->ah; @@ -432,120 +387,53 @@ sizeof(struct ib_mad), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); } else { list_add_tail(&agent_send_wr->send_list, &port_priv->send_posted_list); spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; } + +out: + return ret; } -int smi_send_smp(struct ib_mad_agent *mad_agent, - struct ib_smp *smp, - struct ib_mad_recv_wc *mad_recv_wc, - u16 slid, - int phys_port_cnt) +int agent_send(struct ib_mad *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) { - struct ib_mad *smp_response; - int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + struct ib_mad_recv_wc mad_recv_wc; - if (!smi_handle_smp_send(smp, mad_agent->device->node_type, - mad_agent->port_num)) { - /* SMI failed send */ - return 0; - } - - if (smi_check_local_smp(mad_agent, smp)) { - smp_response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); - if (!smp_response) - return 0; - - ret = mad_process_local(mad_agent, (struct ib_mad *)smp, - smp_response, slid); - if (ret & IB_MAD_RESULT_SUCCESS) { - if (!smi_handle_smp_recv((struct ib_smp *)smp_response, - mad_agent->device->node_type, - mad_agent->port_num, - phys_port_cnt)) { - /* SMI failed receive */ - kfree(smp_response); - return 0; - } - agent_mad_send(mad_agent, smp_response, - NULL, mad_recv_wc); - } else - kfree(smp_response); + port_priv = ib_get_agent_mad(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); return 1; } - /* Post the send on the QP */ - return 1; -} - -int agent_mad_response(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_mad_recv_wc *mad_recv_wc, - u16 slid) -{ - struct ib_mad *response; - struct ib_grh *grh; - int ret; - - response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); - if (!response) - return 0; - - ret = mad_process_local(mad_agent, mad, response, slid); - if (ret & IB_MAD_RESULT_SUCCESS) { - grh = (void *)mad - sizeof(struct ib_grh); - agent_mad_send(mad_agent, response, grh, mad_recv_wc); - } else - kfree(response); - return 1; -} - -int agent_recv_mad(struct ib_mad_agent *mad_agent, - struct ib_mad *mad, - struct ib_mad_recv_wc *mad_recv_wc, - int phys_port_cnt) -{ - int port_num; - - /* SM Directed Route or LID Routed class */ - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE || - mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) { - if (mad_agent->device->node_type != IB_NODE_SWITCH) - port_num = mad_agent->port_num; - else - port_num = mad_recv_wc->wc->port_num; - if (!smi_handle_smp_recv((struct ib_smp *)mad, - mad_agent->device->node_type, - port_num, phys_port_cnt)) { - /* SMI failed receive */ - return 0; - } - - if (smi_check_forward_smp((struct ib_smp *)mad)) { - smi_send_smp(mad_agent, - (struct ib_smp *)mad, - mad_recv_wc, - mad_recv_wc->wc->slid, - phys_port_cnt); - return 0; - } - - } else { - /* PerfMgmt class */ - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - agent_mad_response(mad_agent, mad, mad_recv_wc, - mad_recv_wc->wc->slid); - } else { - printk(KERN_ERR "agent_recv_mad: Unexpected mgmt class 0x%x received\n", mad->mad_hdr.mgmt_class); - } - return 0; + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; } - /* Complete receive up stack */ - return 1; + /* Other fields don't matter so should change signature to just use wc */ + mad_recv_wc.wc = wc; + return agent_mad_send(mad_agent, mad, grh, &mad_recv_wc); } static void agent_send_handler(struct ib_mad_agent *mad_agent, @@ -596,26 +484,6 @@ kfree(agent_send_wr->mad); } -static void agent_recv_handler(struct ib_mad_agent *mad_agent, - struct ib_mad_recv_wc *mad_recv_wc) -{ - struct ib_agent_port_private *port_priv; - - /* Find matching MAD agent */ - port_priv = ib_get_agent_mad(NULL, 0, mad_agent); - if (!port_priv) { - printk(KERN_ERR SPFX "agent_recv_handler: no matching MAD agent %p\n", - mad_agent); - } else { - agent_recv_mad(mad_agent, - mad_recv_wc->recv_buf->mad, - mad_recv_wc, port_priv->phys_port_cnt); - } - - /* Free received MAD */ - ib_free_recv_mad(mad_recv_wc); -} - int ib_agent_port_open(struct ib_device *device, int port_num, int phys_port_cnt) { @@ -656,19 +524,12 @@ reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; reg_req.mgmt_class_version = 1; - /* SMA needs to receive Get, Set, and TrapRepress methods */ - bitmap_zero((unsigned long *)®_req.method_mask, IB_MGMT_MAX_METHODS); - set_bit(IB_MGMT_METHOD_GET, (unsigned long *)®_req.method_mask); - set_bit(IB_MGMT_METHOD_SET, (unsigned long *)®_req.method_mask); - set_bit(IB_MGMT_METHOD_TRAP_REPRESS, - (unsigned long *)®_req.method_mask); - port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, IB_QPT_SMI, - ®_req, 0, + NULL, 0, &agent_send_handler, - &agent_recv_handler, - NULL); + NULL, NULL); + if (IS_ERR(port_priv->dr_smp_agent)) { ret = PTR_ERR(port_priv->dr_smp_agent); goto error2; @@ -678,10 +539,9 @@ reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, IB_QPT_SMI, - ®_req, 0, + NULL, 0, &agent_send_handler, - &agent_recv_handler, - NULL); + NULL, NULL); if (IS_ERR(port_priv->lr_smp_agent)) { ret = PTR_ERR(port_priv->lr_smp_agent); goto error3; @@ -689,14 +549,11 @@ /* Obtain MAD agent for PerfMgmt class */ reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; - clear_bit(IB_MGMT_METHOD_TRAP_REPRESS, - (unsigned long *)®_req.method_mask); - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, IB_QPT_GSI, - ®_req, 0, + NULL, 0, &agent_send_handler, - &agent_recv_handler, - NULL); + NULL, NULL); if (IS_ERR(port_priv->perf_mgmt_agent)) { ret = PTR_ERR(port_priv->perf_mgmt_agent); goto error4; Index: roland-merge/src/linux-kernel/infiniband/core/mad.c =================================================================== --- roland-merge/src/linux-kernel/infiniband/core/mad.c (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/core/mad.c (working copy) @@ -747,13 +747,16 @@ struct ib_mad *mad, int solicited) { - struct ib_mad_agent_private *entry, *mad_agent = NULL; - struct ib_mad_mgmt_class_table *version; - struct ib_mad_mgmt_method_table *class; - u32 hi_tid; + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + spin_lock_irqsave(&port_priv->reg_lock, flags); + /* Whether MAD was solicited determines type of routing to MAD client */ if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + /* Routing is based on high 32 bits of transaction ID of MAD */ hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; list_for_each_entry(entry, &port_priv->agent_list, agent_list) { @@ -762,12 +765,14 @@ break; } } - if (!mad_agent) { + if (!mad_agent) printk(KERN_ERR PFX "No client 0x%x for received MAD " - "on port %d\n", hi_tid, port_priv->port_num); - goto out; - } + "on port %d\n", + hi_tid, port_priv->port_num); } else { + struct ib_mad_mgmt_class_table *version; + struct ib_mad_mgmt_method_table *class; + /* Routing is based on version, class, and method */ if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) { printk(KERN_ERR PFX "MAD received with unsupported " @@ -776,32 +781,29 @@ goto out; } version = port_priv->version[mad->mad_hdr.class_version]; - if (!version) { - printk(KERN_ERR PFX "MAD received on port %d for class " - "version %d with no client\n", - port_priv->port_num, mad->mad_hdr.class_version); + if (!version) goto out; - } class = version->method_table[convert_mgmt_class( mad->mad_hdr.mgmt_class)]; - if (!class) { - printk(KERN_ERR PFX "MAD received on port %d for class " - "%d with no client\n", - port_priv->port_num, mad->mad_hdr.mgmt_class); - goto out; - } - mad_agent = class->agent[mad->mad_hdr.method & - ~IB_MGMT_METHOD_RESP]; + if (class) + mad_agent = class->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; } out: - if (mad_agent && !mad_agent->agent.recv_handler) { - printk(KERN_ERR PFX "No receive handler for client " - "%p on port %d\n", - &mad_agent->agent, port_priv->port_num); - mad_agent = NULL; + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + mad_agent = NULL; + printk(KERN_ERR PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + } } + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + return mad_agent; } @@ -922,6 +924,23 @@ } } +extern int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); +extern int agent_send(struct ib_mad *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { @@ -929,9 +948,9 @@ struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_private *recv; struct ib_mad_list_head *mad_list; - struct ib_mad_agent_private *mad_agent = NULL; + struct ib_mad_agent_private *mad_agent; + struct ib_smp *smp; int solicited; - unsigned long flags; mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; qp_info = mad_list->mad_queue->qp_info; @@ -957,31 +976,80 @@ if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) goto out; - /* Snoop MAD ? */ - if (port_priv->device->snoop_mad) - if (port_priv->device->snoop_mad(port_priv->device, - (u8)port_priv->port_num, - wc->slid, - recv->header.recv_buf.mad)) + if (recv->header.recv_buf.mad->mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + smp = (struct ib_smp *)recv->header.recv_buf.mad; + if (!smi_handle_dr_smp_recv(smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->phys_port_cnt)) goto out; + if (!smi_check_forward_dr_smp(smp)) + goto out; + if (!smi_handle_dr_smp_send(smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(smp, + port_priv->device, + port_priv->port_num)) + goto out; + } - spin_lock_irqsave(&port_priv->reg_lock, flags); + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + struct ib_mad *response; + struct ib_grh *grh; + int ret; + + response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* Is it better to assume that it wouldn't be processed ? */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + recv->header.recv_buf.mad, + response); + if ((ret & IB_MAD_RESULT_SUCCESS) && + (ret & IB_MAD_RESULT_REPLY)) { + if (response->mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv( + (struct ib_smp *)response, + port_priv->device->node_type, + port_priv->port_num, + port_priv->phys_port_cnt)) { + kfree(response); + goto out; + } + } + /* Send response */ + grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh); + if (agent_send(response, grh, wc, + port_priv->device, + port_priv->port_num)) { + kfree(response); + goto out; + } + } else + kfree(response); + } + /* Determine corresponding MAD agent for incoming receive MAD */ solicited = solicited_mad(recv->header.recv_buf.mad); mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, solicited); - if (!mad_agent) { - spin_unlock_irqrestore(&port_priv->reg_lock, flags); - printk(KERN_NOTICE PFX "No matching mad agent found for " - "received MAD on port %d\n", port_priv->port_num); - } else { - atomic_inc(&mad_agent->refcount); - spin_unlock_irqrestore(&port_priv->reg_lock, flags); + if (mad_agent) { ib_mad_complete_recv(mad_agent, recv, solicited); + recv = NULL; /* recv is freed up via ib_mad_complete_recv */ } out: - if (!mad_agent) { + if (recv) { /* Should this case be optimized ? */ kmem_cache_free(ib_mad_cache, recv); } @@ -1668,7 +1736,9 @@ * Open the port * Create the QP, PD, MR, and CQ if needed */ -static int ib_mad_port_open(struct ib_device *device, int port_num) +static int ib_mad_port_open(struct ib_device *device, + int port_num, + int num_ports) { int ret, cq_size; u64 iova = 0; @@ -1697,6 +1767,7 @@ memset(port_priv, 0, sizeof *port_priv); port_priv->device = device; port_priv->port_num = port_num; + port_priv->phys_port_cnt = num_ports; spin_lock_init(&port_priv->reg_lock); cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; @@ -1831,7 +1902,7 @@ cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - ret = ib_mad_port_open(device, cur_port); + ret = ib_mad_port_open(device, cur_port, num_ports); if (ret) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, cur_port); Index: roland-merge/src/linux-kernel/infiniband/core/mad_priv.h =================================================================== --- roland-merge/src/linux-kernel/infiniband/core/mad_priv.h (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/core/mad_priv.h (working copy) @@ -156,6 +156,7 @@ struct list_head port_list; struct ib_device *device; int port_num; + int phys_port_cnt; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; Index: roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -349,10 +349,6 @@ u16 slid, struct ib_mad *in_mad, struct ib_mad *out_mad); -enum ib_snoop_mad_result mthca_snoop_mad(struct ib_device *ibdev, - u8 port_num, - u16 slid, - struct ib_mad *mad); static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) { Index: roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -600,7 +600,6 @@ dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; - dev->ib_dev.snoop_mad = mthca_snoop_mad; ret = ib_register_device(&dev->ib_dev); if (ret) Index: roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c =================================================================== --- roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c (revision 1125) +++ roland-merge/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c (working copy) @@ -79,6 +79,16 @@ int err; u8 status; + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED && + in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + + /* XXX: forward locally generated MAD to SM */ + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + /* * Only handle SM gets, sets and trap represses for SM class * @@ -137,21 +147,6 @@ return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; } -enum ib_snoop_mad_result mthca_snoop_mad(struct ib_device *ibdev, - u8 port_num, - u16 slid, - struct ib_mad *mad) -{ - if (mad->mad_hdr.method != IB_MGMT_METHOD_TRAP || - mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE || - slid != 0) - return IB_SNOOP_MAD_IGNORED; - - /* XXX: forward locally generated MAD to SM */ - - return IB_SNOOP_MAD_CONSUMED; -} - /* * Local Variables: * c-file-style: "linux" From halr at voltaire.com Wed Nov 3 13:59:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 16:59:21 -0500 Subject: [openib-general] [PATCH] Cleanup spaces to tabs In-Reply-To: References: Message-ID: <1099519161.2837.8.camel@hpc-1> On Wed, 2004-11-03 at 13:45, Krishna Kumar wrote: > Entire openib cleaned up to remove 8 spaces to replace with > tabs, just two files though :-) Any chance I could get you to regenerate this patch with the latest code ? I just made a major change to both mad.c and agent.c so this doesn't apply too easily and I'm not sure I could manually fix it right now. Thanks in advance. -- Hal From krkumar at us.ibm.com Wed Nov 3 13:49:17 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 13:49:17 -0800 (PST) Subject: [openib-general] [PATCH] Cleanup spaces to tabs In-Reply-To: <1099518873.2837.5.camel@hpc-1> Message-ID: Hal, Sure, I will regenerate this patch and send in about an hour's time. - KK From halr at voltaire.com Wed Nov 3 14:02:09 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 17:02:09 -0500 Subject: [openib-general] [PATCH] fix memory leak and return value associated with agent_mad_send(response) In-Reply-To: <200411031056.29522.mashirle@us.ibm.com> References: <200411031056.29522.mashirle@us.ibm.com> Message-ID: <1099519329.2837.12.camel@hpc-1> On Wed, 2004-11-03 at 13:56, Shirley Ma wrote: > Here is the patch. Please review it. As I just made a major change to both mad.c and agent.c which changed how this works. Could I prevail on you to review the latest and provide a patch to that ? Thanks. -- Hal From mshefty at ichips.intel.com Wed Nov 3 14:33:58 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 03 Nov 2004 14:33:58 -0800 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: References: Message-ID: <41895CD6.4020807@ichips.intel.com> Krishna Kumar wrote: > Hi, > > I am not sure if this is a good idea, but since I am new to this area, > here it goes :-) I think that the idea is valid. :) > Section 11.2.6.3, C11-16 states that resize of qp must be permitted. > In the patch I am submitting, I don't understand why so many parameters > are expected by driver/verbs. I thought the qp_handle and ib_qp_attr is > enough, atleast according to the spec. I didn't follow what you were trying to reference here. Are you referring to the QP or CQ? > Along with this, I am going to submit another patch to catch "catastrophic" > errors in return value of the resize operation. This is due to the need > to check for 2 special cases : "CQ overrun" and "CQ inaccessible". For > these two errors, I think the queues should be deallocated and error > returned. This is in the second patch. I am not sure of the error numbers, > I guessed it from mthca_eq.c and could be wrong here. I'm adding in code to handle QP errors and overrun. If we are unable to resize the CQ, we can prevent CQ overrun by limited the number of work requests posted to the corresponding QPs, rather than completely disabling the port. I'll have a better idea of what we can do in this case when I get more of the code in place. From mshefty at ichips.intel.com Wed Nov 3 14:39:03 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 03 Nov 2004 14:39:03 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <524qk6r9k1.fsf@topspin.com> References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com> Message-ID: <41895E07.6080804@ichips.intel.com> Roland Dreier wrote: > Do the names /dev/infiniband/mthca0/umad1 and so on make sense to > people? I thought that userspace verbs support would probably use a > file like /dev/infiniband/mthca0/verbs, etc. I think that this approach is good. - Sean From johannes at erdfelt.com Wed Nov 3 14:43:37 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Wed, 3 Nov 2004 14:43:37 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <524qk6r9k1.fsf@topspin.com> References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com> Message-ID: <20041103224337.GS17669@sventech.com> On Wed, Nov 03, 2004, Roland Dreier wrote: > By the way, buried down at the end of the patch is some documentation > about creating device files: > > +/dev files > + > + To create the appropriate character device files automatically with > + udev, a rule like > + > + KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}" > + > + can be used. This will create nodes such as /dev/infiniband/mthca0/umad1 > + for port 1 of device mthca0. > > Do the names /dev/infiniband/mthca0/umad1 and so on make sense to > people? I thought that userspace verbs support would probably use a > file like /dev/infiniband/mthca0/verbs, etc. > > In any case, now is probably the time to object before we have legacy > issues to worry about.... Does the device name need to have the HCA driver name in it? Also, the u in umad is implied. Wouldn't it be more appropriate to do something like this: /dev/infiniband/hca0/mad1 or maybe even: /dev/ib/hca0/mad1 JE From xma at us.ibm.com Wed Nov 3 15:00:13 2004 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 3 Nov 2004 15:00:13 -0800 Subject: [openib-general] [PATCH] fix memory leak and return value associated with agent_mad_send(response) In-Reply-To: <1099519329.2837.12.camel@hpc-1> Message-ID: > As I just made a major change to both mad.c and agent.c which changed how this works. Could I prevail on you to review the latest and provide a patch to that ? I will take a look at the most recent bit. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From krkumar at us.ibm.com Wed Nov 3 15:52:16 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 15:52:16 -0800 (PST) Subject: [openib-general] [PATCH] Cleanup spaces to tabs In-Reply-To: <1099519161.2837.8.camel@hpc-1> Message-ID: Hi Hal, The same patch on latest bits .... thx, - KK diff -ruNp 1/agent.c 2/agent.c --- 1/agent.c 2004-11-03 15:50:04.000000000 -0800 +++ 2/agent.c 2004-11-03 15:50:47.000000000 -0800 @@ -189,7 +189,7 @@ int smi_handle_dr_smp_recv(struct ib_smp if (hop_ptr == 1) { if (smp->dr_slid == IB_LID_PERMISSIVE) { /* giving SMP to SM - update hop_ptr */ - smp->hop_ptr--; + smp->hop_ptr--; return 1; } /* smp->hop_ptr updated when sending */ @@ -327,7 +327,7 @@ static int agent_mad_send(struct ib_mad_ PCI_DMA_TODEVICE); gather_list.length = sizeof(struct ib_mad); gather_list.lkey = (*port_priv->mr).lkey; - + send_wr.next = NULL; send_wr.opcode = IB_WR_SEND; send_wr.sg_list = &gather_list; @@ -335,7 +335,7 @@ static int agent_mad_send(struct ib_mad_ send_wr.wr.ud.remote_qpn = mad_recv_wc->wc->src_qp; /* DQPN */ send_wr.wr.ud.timeout_ms = 0; send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; - + ah_attr.dlid = mad_recv_wc->wc->slid; ah_attr.port_num = mad_agent->port_num; ah_attr.src_path_bits = mad_recv_wc->wc->dlid_path_bits; @@ -364,7 +364,7 @@ static int agent_mad_send(struct ib_mad_ kfree(agent_send_wr); goto out; } - + send_wr.wr.ud.ah = agent_send_wr->ah; if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { send_wr.wr.ud.pkey_index = mad_recv_wc->wc->pkey_index; @@ -441,8 +441,8 @@ static void agent_send_handler(struct ib { struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; - struct list_head *send_wr; - unsigned long flags; + struct list_head *send_wr; + unsigned long flags; /* Find matching MAD agent */ port_priv = ib_get_agent_mad(NULL, 0, mad_agent); @@ -460,7 +460,7 @@ static void agent_send_handler(struct ib "is empty\n", (unsigned long long) mad_send_wc->wr_id); return; } - + agent_send_wr = list_entry(&port_priv->send_posted_list, struct ib_agent_send_wr, send_list); @@ -469,8 +469,8 @@ static void agent_send_handler(struct ib send_list); /* Remove from posted send MAD list */ - list_del(&agent_send_wr->send_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, @@ -547,8 +547,8 @@ int ib_agent_port_open(struct ib_device goto error3; } - /* Obtain MAD agent for PerfMgmt class */ - reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, IB_QPT_GSI, NULL, 0, @@ -606,7 +606,7 @@ int ib_agent_port_close(struct ib_device ib_unregister_mad_agent(port_priv->perf_mgmt_agent); ib_unregister_mad_agent(port_priv->lr_smp_agent); ib_unregister_mad_agent(port_priv->dr_smp_agent); - kfree(port_priv); + kfree(port_priv); return 0; } diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-03 15:50:04.000000000 -0800 +++ 2/mad.c 2004-11-03 15:50:50.000000000 -0800 @@ -1536,7 +1536,7 @@ static inline int ib_mad_change_qp_state struct ib_qp_attr *attr; int attr_mask; - attr = kmalloc(sizeof *attr, GFP_KERNEL); + attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n"); return -ENOMEM; On Wed, 3 Nov 2004, Hal Rosenstock wrote: > On Wed, 2004-11-03 at 13:45, Krishna Kumar wrote: > > Entire openib cleaned up to remove 8 spaces to replace with > > tabs, just two files though :-) > > Any chance I could get you to regenerate this patch with the latest code > ? I just made a major change to both mad.c and agent.c so this doesn't > apply too easily and I'm not sure I could manually fix it right now. > > Thanks in advance. > > -- Hal > > > From mshefty at ichips.intel.com Wed Nov 3 16:13:27 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 03 Nov 2004 16:13:27 -0800 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: References: Message-ID: <41897427.4060604@ichips.intel.com> Krishna Kumar wrote: > qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); > - if (IS_ERR(qp_info->qp)) { > - printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", > - get_spl_qp_index(qp_type)); > + if (!IS_ERR(qp_info->qp)) { > + struct ib_qp_attr qp_attr; > + > + ret = ib_query_qp(qp_info->qp, &qp_attr, 0, &qp_init_attr); Note that the qp_init_attr parameter passed into ib_create_qp should return the actual size of the QP that was created. The call to ib_query_qp shouldn't be needed. - Sean From krkumar at us.ibm.com Wed Nov 3 16:27:59 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 16:27:59 -0800 (PST) Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: <41895CD6.4020807@ichips.intel.com> Message-ID: On Wed, 3 Nov 2004, Sean Hefty wrote: > I didn't follow what you were trying to reference here. Are you > referring to the QP or CQ? QP. When I do a query for the QP, all I really need is the qp ptr and the qp_attr structure to fill in values. What I didn't figure out is why an attr_mask and ib_qp_init_attr is needed. BTW, I had thought that ib_qp_init_attr was used for initialization type of attributes, exactly once the device is passed init attributes, then onwards ib_qp_attr should be used. So ib_qp_init_attr seems redundant. Or I have understood the code wrong. > I'm adding in code to handle QP errors and overrun. If we are unable to > resize the CQ, we can prevent CQ overrun by limited the number of work > requests posted to the corresponding QPs, rather than completely Actually I read it wrong in this case, probably the code needs to check only for "inaccessible" which is a critical error since the CEQ cannot be posted to the CQ even though the CQ is not full. If you are not already adding the exact same functionality, please let me know if the following looks correct. I recreated both patches after Hal's checkin (Patch1 and Patch2 below). Also, I saw your other mail, and I had looked at the driver and it didn't modify the final size of the new QP in the init_attr. It used the structure to do it's work but doesn't update it. I was initially planning on not using query() and instead rely on this structure getting updated. The verb interface cannot do it since it qp doesn't contain the size. We cannot change the driver to change the init structure since potentially other drivers may not do it, so the reason to do a query to figure the correct size. verb create_qp(): if (!IS_ERR(qp)) { qp->device = pd->device; qp->pd = pd; qp->send_cq = qp_init_attr->send_cq; qp->recv_cq = qp_init_attr->recv_cq; qp->srq = qp_init_attr->srq; qp->qp_context = qp_init_attr->qp_context; atomic_inc(&pd->usecnt); atomic_inc(&qp_init_attr->send_cq->usecnt); atomic_inc(&qp_init_attr->recv_cq->usecnt); if (qp_init_attr->srq) atomic_inc(&qp_init_attr->srq->usecnt); } driver create_qp(): case IB_QPT_SMI: case IB_QPT_GSI: { qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); if (!qp) return ERR_PTR(-ENOMEM); qp->sq.max = init_attr->cap.max_send_wr; qp->rq.max = init_attr->cap.max_recv_wr; qp->sq.max_gs = init_attr->cap.max_send_sge; qp->rq.max_gs = init_attr->cap.max_recv_sge; qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0:1; err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), to_mcq(init_attr->send_cq), to_mcq(init_attr->recv_cq), init_attr->sq_sig_type, init_attr->rq_sig_type, qp->ibqp.qp_num, init_attr->port_num, to_msqp(qp)); break; } thanks, - KK -------------------------------------------------------------------------- PATCH1 -------------------------------------------------------------------------- diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-03 16:03:25.000000000 -0800 +++ 2/mad.c 2004-11-03 16:03:43.000000000 -0800 @@ -1692,6 +1692,14 @@ static void init_mad_queue(struct ib_mad INIT_LIST_HEAD(&mad_queue->list); } +/* + * Allocate one mad QP. + * + * If the return indicates success, the value returned is the new size + * of the queue pair that got created. + * + * Return > 0 on success and -(ERRNO) on failure. Zero should never happen. + */ static int create_mad_qp(struct ib_mad_port_private *port_priv, struct ib_mad_qp_info *qp_info, enum ib_qp_type qp_type) @@ -1715,15 +1723,23 @@ static int create_mad_qp(struct ib_mad_p qp_init_attr.qp_type = qp_type; qp_init_attr.port_num = port_priv->port_num; qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); - if (IS_ERR(qp_info->qp)) { - printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", - get_spl_qp_index(qp_type)); + if (!IS_ERR(qp_info->qp)) { + struct ib_qp_attr qp_attr; + + ret = ib_query_qp(qp_info->qp, &qp_attr, 0, &qp_init_attr); + if (ret < 0) { + /* + * For any error, use the same size we used to + * create the queue. + */ + ret = qp_init_attr.cap.max_send_wr + + qp_init_attr.cap.max_recv_wr; + } + } else { ret = PTR_ERR(qp_info->qp); - goto error; + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d err:%d\n", + get_spl_qp_index(qp_type), ret); } - return 0; - -error: return ret; } @@ -1747,6 +1763,7 @@ static int ib_mad_port_open(struct ib_de .size = (unsigned long) high_memory - PAGE_OFFSET }; struct ib_mad_port_private *port_priv; + int total_qp_size; unsigned long flags; /* First, check if port already open at MAD layer */ @@ -1797,11 +1814,25 @@ static int ib_mad_port_open(struct ib_de } ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI); - if (ret) + if (ret <= 0) goto error6; + total_qp_size = ret; + ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI); - if (ret) + if (ret <= 0) goto error7; + total_qp_size += ret; + + /* Resize if the total QP[0,1] size is greater than CQ size. */ + if (total_qp_size > cq_size) { + printk(KERN_DEBUG PFX "ib_mad_port_open: increasing size of " + "CQ from %d to %d\n", cq_size, total_qp_size); + if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) { + printk(KERN_DEBUG PFX "Couldn't increase CQ size - " + "err:%d\n", ret); + /* continue, not an error */ + } + } spin_lock_init(&port_priv->reg_lock); INIT_LIST_HEAD(&port_priv->agent_list); ---------------------------------------------------------------------------- PATCH2 ---------------------------------------------------------------------------- diff -ruNp 2/mad.c 3/mad.c --- 2/mad.c 2004-11-03 16:03:43.000000000 -0800 +++ 3/mad.c 2004-11-03 16:17:54.000000000 -0800 @@ -1749,6 +1749,21 @@ static void destroy_mad_qp(struct ib_mad } /* + * Overrun and Inaccessible errors cannot be handled by QP resize operation. + */ +static inline int is_catastrophic_error(int err) +{ +#define CQ_ACCESS_ERROR 0x11 + + switch (err) { + default: /* OK */ + return 0; + case CQ_ACCESS_ERROR: + return 1; + } +} + +/* * Open the port * Create the QP, PD, MR, and CQ if needed */ @@ -1830,6 +1845,10 @@ static int ib_mad_port_open(struct ib_de if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) { printk(KERN_DEBUG PFX "Couldn't increase CQ size - " "err:%d\n", ret); + if (is_catastrophic_error(ret)) { + /* Clean up qp_info[0,1] */ + goto error8; + } /* continue, not an error */ } } From mshefty at ichips.intel.com Wed Nov 3 16:54:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 03 Nov 2004 16:54:46 -0800 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: References: Message-ID: <41897DD6.8060100@ichips.intel.com> Krishna Kumar wrote: > QP. When I do a query for the QP, all I really need is the qp ptr and > the qp_attr structure to fill in values. What I didn't figure out is > why an attr_mask and ib_qp_init_attr is needed. BTW, I had thought that > ib_qp_init_attr was used for initialization type of attributes, exactly > once the device is passed init attributes, then onwards ib_qp_attr should > be used. So ib_qp_init_attr seems redundant. Or I have understood the > code wrong. The mask allows the query to be a little more selective about what data it is trying to access, which can potentially avoid accessing the hardware. The qp_attr and qp_init_attr contain different data, so both are returned from the query call. To have ib_query_qp return only qp_attr, we would need to add the fields from qp_init_attr to it. > If you are not already adding the exact same functionality, please let me > know if the following looks correct. I recreated both patches after Hal's > checkin (Patch1 and Patch2 below). I am not adding this same functionality, and I'm coding around where your patch would go. > Also, I saw your other mail, and I had looked at the driver and it > didn't modify the final size of the new QP in the init_attr. It used the > structure to do it's work but doesn't update it. I was initially planning > on not using query() and instead rely on this structure getting updated. > The verb interface cannot do it since it qp doesn't contain the size. We > cannot change the driver to change the init structure since potentially > other drivers may not do it, so the reason to do a query to figure the > correct size. The original call to ib_create_qp took a third parameter, a qp_cap structure, for output. This structure contained the actual QP settings returned from the ib_create_qp call. I assumed that by removing this parameter, the capabilities would be returned directly in the qp_init_attr structure. If this is not the case, then the driver should probably change to do that. This matches what is defined by verbs, so I think that it's safe to do it. > qp->sq.max = init_attr->cap.max_send_wr; > qp->rq.max = init_attr->cap.max_recv_wr; > qp->sq.max_gs = init_attr->cap.max_send_sge; > qp->rq.max_gs = init_attr->cap.max_recv_sge; > > err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), I haven't looked at the mthca_alloc_sqp call in more detail, but if it doesn't create a QP larger than that specified, then it wouldn't need to change the qp_cap fields. From krkumar at us.ibm.com Wed Nov 3 17:33:36 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 17:33:36 -0800 (PST) Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: <41897DD6.8060100@ichips.intel.com> Message-ID: Hi Sean, I just checked the spec for create_qp on pg567 and found that it expects the verb/driver to modify this field, as you indicated. So I will go ahead and submit a new patch to fix this. Thanks for your input, - KK On Wed, 3 Nov 2004, Sean Hefty wrote: > Krishna Kumar wrote: > > QP. When I do a query for the QP, all I really need is the qp ptr and > > the qp_attr structure to fill in values. What I didn't figure out is > > why an attr_mask and ib_qp_init_attr is needed. BTW, I had thought that > > ib_qp_init_attr was used for initialization type of attributes, exactly > > once the device is passed init attributes, then onwards ib_qp_attr should > > be used. So ib_qp_init_attr seems redundant. Or I have understood the > > code wrong. > > The mask allows the query to be a little more selective about what data > it is trying to access, which can potentially avoid accessing the hardware. > > The qp_attr and qp_init_attr contain different data, so both are > returned from the query call. To have ib_query_qp return only qp_attr, > we would need to add the fields from qp_init_attr to it. > > > If you are not already adding the exact same functionality, please let me > > know if the following looks correct. I recreated both patches after Hal's > > checkin (Patch1 and Patch2 below). > > I am not adding this same functionality, and I'm coding around where > your patch would go. > > > Also, I saw your other mail, and I had looked at the driver and it > > didn't modify the final size of the new QP in the init_attr. It used the > > structure to do it's work but doesn't update it. I was initially planning > > on not using query() and instead rely on this structure getting updated. > > The verb interface cannot do it since it qp doesn't contain the size. We > > cannot change the driver to change the init structure since potentially > > other drivers may not do it, so the reason to do a query to figure the > > correct size. > > The original call to ib_create_qp took a third parameter, a qp_cap > structure, for output. This structure contained the actual QP settings > returned from the ib_create_qp call. I assumed that by removing this > parameter, the capabilities would be returned directly in the > qp_init_attr structure. If this is not the case, then the driver should > probably change to do that. This matches what is defined by verbs, so > I think that it's safe to do it. > > > qp->sq.max = init_attr->cap.max_send_wr; > > qp->rq.max = init_attr->cap.max_recv_wr; > > qp->sq.max_gs = init_attr->cap.max_send_sge; > > qp->rq.max_gs = init_attr->cap.max_recv_sge; > > > > err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), > > I haven't looked at the mthca_alloc_sqp call in more detail, but if it > doesn't create a QP larger than that specified, then it wouldn't need to > change the qp_cap fields. > > > From krkumar at us.ibm.com Wed Nov 3 17:44:01 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 17:44:01 -0800 (PST) Subject: [openib-general] [PATCH 1/2] Resize CQ Message-ID: This is after incorporating feedback from Sean. Compiles cleanly. Thanks, - KK diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-03 16:03:25.000000000 -0800 +++ 2/mad.c 2004-11-03 17:37:31.000000000 -0800 @@ -1692,6 +1692,14 @@ static void init_mad_queue(struct ib_mad INIT_LIST_HEAD(&mad_queue->list); } +/* + * Allocate one mad QP. + * + * If the return indicates success, the value returned is the new size + * of the queue pair that got created. + * + * Return > 0 on success and -(ERRNO) on failure. Zero should never happen. + */ static int create_mad_qp(struct ib_mad_port_private *port_priv, struct ib_mad_qp_info *qp_info, enum ib_qp_type qp_type) @@ -1715,15 +1723,18 @@ static int create_mad_qp(struct ib_mad_p qp_init_attr.qp_type = qp_type; qp_init_attr.port_num = port_priv->port_num; qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); - if (IS_ERR(qp_info->qp)) { - printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", - get_spl_qp_index(qp_type)); + if (!IS_ERR(qp_info->qp)) { + /* + * Driver should have modified the cap max_* fields + * if it increased the qp send/recv size. + */ + ret = qp_init_attr.cap.max_send_wr + + qp_init_attr.cap.max_recv_wr; + } else { ret = PTR_ERR(qp_info->qp); - goto error; + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d err:%d\n", + get_spl_qp_index(qp_type), ret); } - return 0; - -error: return ret; } @@ -1747,6 +1758,7 @@ static int ib_mad_port_open(struct ib_de .size = (unsigned long) high_memory - PAGE_OFFSET }; struct ib_mad_port_private *port_priv; + int total_qp_size; unsigned long flags; /* First, check if port already open at MAD layer */ @@ -1797,11 +1809,25 @@ static int ib_mad_port_open(struct ib_de } ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI); - if (ret) + if (ret <= 0) goto error6; + total_qp_size = ret; + ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI); - if (ret) + if (ret <= 0) goto error7; + total_qp_size += ret; + + /* Resize if the total size of QP[0,1] is greater than CQ size. */ + if (total_qp_size > cq_size) { + printk(KERN_DEBUG PFX "ib_mad_port_open: Increasing size of " + "CQ from %d to %d\n", cq_size, total_qp_size); + if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) { + printk(KERN_DEBUG PFX "Couldn't increase CQ size - " + "err:%d\n", ret); + /* continue, not an error */ + } + } spin_lock_init(&port_priv->reg_lock); INIT_LIST_HEAD(&port_priv->agent_list); From krkumar at us.ibm.com Wed Nov 3 17:48:52 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 17:48:52 -0800 (PST) Subject: [openib-general] [PATCH 2/2] Implement error handling in resize failure. Message-ID: The only issue is whether the code below for CQ_ACCESS_ERROR is correct. I have taken it from mthca_eq.c : MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11 thanks, - KK diff -ruNp 2/mad.c 3/mad.c --- 2/mad.c 2004-11-03 17:37:31.000000000 -0800 +++ 3/mad.c 2004-11-03 17:38:40.000000000 -0800 @@ -1744,6 +1744,21 @@ static void destroy_mad_qp(struct ib_mad } /* + * "Inaccessible" error cannot be handled by QP resize operation. + */ +static inline int is_catastrophic_error(int err) +{ +#define CQ_ACCESS_ERROR 0x11 + + switch (err) { + default: /* OK */ + return 0; + case CQ_ACCESS_ERROR: + return 1; + } +} + +/* * Open the port * Create the QP, PD, MR, and CQ if needed */ @@ -1825,6 +1840,10 @@ static int ib_mad_port_open(struct ib_de if ((ret = ib_resize_cq(port_priv->cq, total_qp_size)) < 0) { printk(KERN_DEBUG PFX "Couldn't increase CQ size - " "err:%d\n", ret); + if (is_catastrophic_error(ret)) { + /* Clean up qp[0,1] */ + goto error8; + } /* continue, not an error */ } } From krkumar at us.ibm.com Wed Nov 3 18:23:58 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Wed, 3 Nov 2004 18:23:58 -0800 (PST) Subject: [openib-general] [PATCH] Reorganize and clean up debug messages in find_mad_agent() In-Reply-To: Message-ID: Now messages are printed if either : 1. mad_agent is not found. 2. mad_agent is found but doesn't have a handler. Printing messages during errors in the process of finding the mad_agent has been removed. Thanks, - KK On Wed, 3 Nov 2004, Sean Hefty wrote: > Thanks for the patch. If you can do something with the printk's, that > would be good. They should be KERN_NOTICE, but we may want to consider > just removing them. diff -ruNp 5/mad.c 6/mad.c --- 5/mad.c 2004-11-03 17:56:54.000000000 -0800 +++ 6/mad.c 2004-11-03 18:17:04.000000000 -0800 @@ -752,34 +752,33 @@ find_mad_agent(struct ib_mad_port_privat spin_lock_irqsave(&port_priv->reg_lock, flags); - /* Whether MAD was solicited determines type of routing to MAD client */ + /* + * Whether MAD was solicited determines type of routing to + * MAD client. + */ if (solicited) { u32 hi_tid; struct ib_mad_agent_private *entry; - /* Routing is based on high 32 bits of transaction ID of MAD */ + /* + * Routing is based on high 32 bits of transaction ID + * of MAD. + */ hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; - list_for_each_entry(entry, &port_priv->agent_list, agent_list) { + list_for_each_entry(entry, &port_priv->agent_list, + agent_list) { if (entry->agent.hi_tid == hi_tid) { mad_agent = entry; break; } } - if (!mad_agent) - printk(KERN_ERR PFX "No client 0x%x for received MAD " - "on port %d\n", - hi_tid, port_priv->port_num); } else { struct ib_mad_mgmt_class_table *version; struct ib_mad_mgmt_method_table *class; /* Routing is based on version, class, and method */ - if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) { - printk(KERN_ERR PFX "MAD received with unsupported " - "class version %d on port %d\n", - mad->mad_hdr.class_version, port_priv->port_num); + if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) goto out; - } version = port_priv->version[mad->mad_hdr.class_version]; if (!version) goto out; @@ -790,18 +789,19 @@ find_mad_agent(struct ib_mad_port_privat ~IB_MGMT_METHOD_RESP]; } -out: if (mad_agent) { if (mad_agent->agent.recv_handler) atomic_inc(&mad_agent->refcount); else { - mad_agent = NULL; - printk(KERN_ERR PFX "No receive handler for client " + printk(KERN_NOTICE PFX "No receive handler for client " "%p on port %d\n", &mad_agent->agent, port_priv->port_num); + mad_agent = NULL; } - } - + } else + printk(KERN_NOTICE PFX "No client for received MAD on " + "port %d\n", port_priv->port_num); +out: spin_unlock_irqrestore(&port_priv->reg_lock, flags); return mad_agent; From roland at topspin.com Wed Nov 3 18:54:09 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 18:54:09 -0800 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: (Krishna Kumar's message of "Wed, 3 Nov 2004 13:23:32 -0800 (PST)") References: Message-ID: <52zn1ypbn2.fsf@topspin.com> Not sure what the goal is here, but I should point out that current mthca code does not implement resizing either CQs or QPs. However I'm not sure I understand why the MAD layer wants to resize these objects -- given that the number of QPs is known in advance and that the MAD layer can choose how many work requests to post per QP, I'm not sure what is gained by trying to resize things dynamically. - Roland From roland at topspin.com Wed Nov 3 18:59:06 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 18:59:06 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <20041103224337.GS17669@sventech.com> (Johannes Erdfelt's message of "Wed, 3 Nov 2004 14:43:37 -0800") References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com> <20041103224337.GS17669@sventech.com> Message-ID: <52vfcmpbet.fsf@topspin.com> Johannes> Does the device name need to have the HCA driver name in Johannes> it? Also, the u in umad is implied. Good point, I'll change the docs to suggest no "u." Johannes> Wouldn't it be more appropriate to do something like Johannes> this: Johannes> /dev/infiniband/hca0/mad1 Maybe, but: - How does userspace know which device hca0 corresponds to? Right now, mthca0 can be looked up under /sys/class/infiniband. - Do we need to do switch0 etc. for switches? Of course the mthca driver could be updated to register itself using hcaN names instead of mthcaN names, which would solve things fairly transparently. Johannes> or maybe even: Johannes> /dev/ib/hca0/mad1 I don't like /dev/ib/ because I think "ib" is a little generic. So I prefer the more verbose but unambiguous "infiniband" name. - Roland From roland at topspin.com Wed Nov 3 20:13:35 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 03 Nov 2004 20:13:35 -0800 Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes (both branches) In-Reply-To: <1099518873.2837.5.camel@hpc-1> (Hal Rosenstock's message of "Wed, 03 Nov 2004 16:54:34 -0500") References: <1099518873.2837.5.camel@hpc-1> Message-ID: <52ekjap7yo.fsf@topspin.com> Can you resend either with a different mailer or as an attachment? The patch was pretty line-wrapped. - R. From halr at voltaire.com Wed Nov 3 20:44:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 03 Nov 2004 23:44:36 -0500 Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes (both branches) In-Reply-To: <52ekjap7yo.fsf@topspin.com> References: <1099518873.2837.5.camel@hpc-1> <52ekjap7yo.fsf@topspin.com> Message-ID: <1099543475.10754.3.camel@hpc-1> On Wed, 2004-11-03 at 23:13, Roland Dreier wrote: > Can you resend either with a different mailer or as an attachment? > The patch was pretty line-wrapped. Sorry. My mailer in some cases in Evolution 1.2.2-4. Not sure if this is fixed in newer versions or whether this is a configuration thing. Anhow, let's try as an attachment for now. I'm sure you know this but you will want to skip the changes to openib-candidate as they have already been applied. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: patch-plm Type: text/x-patch Size: 41063 bytes Desc: not available URL: From noohgnas at gmail.com Wed Nov 3 21:52:06 2004 From: noohgnas at gmail.com (Sang-Hoon,Lee) Date: Thu, 4 Nov 2004 14:52:06 +0900 Subject: [openib-general] question for ib_srp Message-ID: hi all.. I've question for ib_srp usage. I could't use ib_srp module. As you know, I think modprobe ib_srp is common usage. I'd run like this... # modprobe ib_srp target_bindings="0002c90108a06551" srp_tracelevel=4 use_srp_indirect_addressing=1 ib_ports_mask=1 then, some errors were shown at /var/log/message below: kernel: ib_srp: module license 'unspecified' taints kernel. kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:206]max targets reported to scsi 64 kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:207]max luns reported to scsi 256 kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:209]max cmds(including aborts) per lun 32 kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:211]max outstanding ios per target 259 kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:226]Found HCA 0 ee743220 kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:256]SRP Initiator GUID: 2c901081e67c0 for hca 1 kernel: [SRPTP][srptp_init_module][drivers/infiniband/ulp/srp/srptp.c:309]Pool Create max pages 0x12 pool size 0x4000 kernel: [SRPTP][srp_dm_init][drivers/infiniband/ulp/srp/srp_dm.c:1614]Registering async events handler for HCA 0 kernel: Target Binding 0 to Target 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1131]Refreshing HCA/port info kernel: [SRPTP][srp_register_out_of_service][drivers/infiniband/ulp/srp/srp_dm.c:1333]Registering hca 1 local port 1 for IB out of service traps kernel: [SRPTP][srp_register_in_service][drivers/infiniband/ulp/srp/srp_dm.c:1283]Registering hca 1 local port 1 for IB in service traps kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM Query Initiated on hca 1 local port 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][srp_out_of_service_completion][drivers/infiniband/ulp/srp/srp_dm.c:1226]Out of service trap for hca 1 port 1 complete kernel: [SRPTP][srp_in_service_completion][drivers/infiniband/ulp/srp/srp_dm.c:1258]In service trap for hca 1 port 1 complete kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 5 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 0, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 0 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 2, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 2 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 3, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 3 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 4, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 4 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 5, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 5 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 6, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 6 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 7, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 7 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 8, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 8 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 9, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 9 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 10, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 10 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 11, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 11 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 12, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 12 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 13, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 13 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 14, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 14 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 15, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 15 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 16, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 16 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 17, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 17 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 18, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 18 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 19, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 19 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 20, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 20 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 21, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 21 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 22, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 22 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 23, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 23 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 24, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 24 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 25, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 25 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 26, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 26 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 27, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 27 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 28, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 28 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 29, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 29 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 30, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 30 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 31, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 31 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 32, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 32 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 33, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 33 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 34, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 34 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 35, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 35 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 36, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 36 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 37, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 37 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 38, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 38 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 39, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 39 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 40, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 40 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 41, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 41 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 42, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 42 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 43, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 43 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 44, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 44 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 45, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 45 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 46, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 46 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 47, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 47 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 48, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 48 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 49, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 49 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 50, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 50 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 51, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 51 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 52, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 52 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 53, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 53 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 54, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 54 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 55, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 55 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 56, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 56 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 57, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 57 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 58, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 58 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 59, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 59 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 60, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 60 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 61, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 61 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 62, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 62 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 63, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 63 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 4 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 5 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM Client timeout on hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM Client Query complete hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:584]Restarting DM Query on hca 1 port 1 timeout, retry count 1 kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM Query Initiated on hca 1 local port 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 5 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 6 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 6 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM Client timeout on hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM Client Query complete hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:584]Restarting DM Query on hca 1 port 1 timeout, retry count 2 kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM Query Initiated on hca 1 local port 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 6 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 6 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 5 times kernel: [SRPTP][srp_host_init][drivers/infiniband/ulp/srp/srp_host.c:1607]0 active connections 0 pending connections kernel: kernel: scsi3 : kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM Client timeout on hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM Client Query complete hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:584]Restarting DM Query on hca 1 port 1 timeout, retry count 3 kernel: [SRPTP][srp_dm_query][drivers/infiniband/ulp/srp/srp_dm.c:1375]DM Query Initiated on hca 1 local port 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 5 times kernel: [SRPTP][srp_host_abort_eh][drivers/infiniband/ulp/srp/srp_host.c:2522]Abort SCpnt ec912200 on target 1 kernel: [SRPTP][srp_host_device_reset_eh][drivers/infiniband/ulp/srp/srp_host.c:2697]Device reset...target 1 kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 1 kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:767] Sending IO back to scsi from pending list ioq eba4d290 kernel: bad: scheduling while atomic! kernel: Call Trace: kernel: [] schedule+0x5b2/0x5b7 kernel: [] process_timeout+0x0/0x9 kernel: [] wake_up_process+0x1e/0x22 kernel: [] recalc_task_prio+0xb2/0x1ea kernel: [] __down+0x99/0x112 kernel: [] default_wake_function+0x0/0x12 kernel: [] __down_failed+0x8/0xc kernel: [] .text.lock.scsi_error+0x23/0x46 kernel: [] scsi_eh_done+0x0/0x49 kernel: [] scsi_eh_times_out+0x0/0x1d kernel: [] scsi_eh_tur+0x93/0xc8 kernel: [] scsi_eh_bus_device_reset+0xc8/0xd2 kernel: [] scsi_eh_ready_devs+0x4f/0x93 kernel: [] scsi_unjam_host+0xc8/0xd1 kernel: [] scsi_error_handler+0xd1/0x10a kernel: [] scsi_error_handler+0x0/0x10a kernel: [] kernel_thread_helper+0x5/0xb kernel: kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 6 times kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 kernel: [SRPTP][srp_host_abort_eh][drivers/infiniband/ulp/srp/srp_host.c:2522]Abort SCpnt ec912200 on target 1 kernel: [SRPTP][srp_host_reset_eh][drivers/infiniband/ulp/srp/srp_host.c:2749]Host reset kernel: bad: scheduling while atomic! kernel: Call Trace: kernel: [] schedule+0x5b2/0x5b7 kernel: [] __wake_up_common+0x31/0x50 kernel: [] __down+0x99/0x112 kernel: [] default_wake_function+0x0/0x12 kernel: [] __down_failed+0x8/0xc kernel: [] .text.lock.scsi_error+0x37/0x46 kernel: [] scsi_sleep_done+0x0/0x11 kernel: [] scsi_try_host_reset+0x8f/0xc6 kernel: [] scsi_eh_host_reset+0x48/0xae kernel: [] scsi_eh_ready_devs+0x73/0x93 kernel: [] scsi_unjam_host+0xc8/0xd1 kernel: [] scsi_error_handler+0xd1/0x10a kernel: [] scsi_error_handler+0x0/0x10a kernel: [] kernel_thread_helper+0x5/0xb kernel: kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1186]Number of active dm_queries 1 last message repeated 3 times kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:600]DM Client timeout on hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:570]DM Client Query complete hca 1 port 1 kernel: [SRPTP][srp_host_dm_completion][drivers/infiniband/ulp/srp/srp_dm.c:594]DM Client timeout on hca 1 port 1, retry count 4 exceeded kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:971]Sweeping all targets, that are in need of a connection kernel: [SRPTP][sweep_targets][drivers/infiniband/ulp/srp/srp_host.c:989]target 1, no active connection kernel: [SRPTP][pick_connection_path][drivers/infiniband/ulp/srp/srp_dm.c:239]Target 1, no paths available kernel: bad: scheduling while atomic! kernel: Call Trace: kernel: [] schedule+0x5b2/0x5b7 kernel: [] __wake_up_locked+0x22/0x26 kernel: [] __down+0x99/0x112 kernel: [] default_wake_function+0x0/0x12 kernel: [] __down_failed+0x8/0xc kernel: [] .text.lock.scsi_error+0x23/0x46 kernel: [] scsi_eh_done+0x0/0x49 kernel: [] scsi_eh_times_out+0x0/0x1d kernel: [] scsi_eh_tur+0x93/0xc8 kernel: [] scsi_eh_host_reset+0x8d/0xae kernel: [] scsi_eh_ready_devs+0x73/0x93 kernel: [] scsi_unjam_host+0xc8/0xd1 kernel: [] scsi_error_handler+0xd1/0x10a kernel: [] scsi_error_handler+0x0/0x10a kernel: [] kernel_thread_helper+0x5/0xb kernel: kernel: [SRPTP][srp_dm_poll_thread][drivers/infiniband/ulp/srp/srp_host.c:1322]Target 1, no connection timeout kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:744]Flushing the pending queue back to scsi for target 1 kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:767] Sending IO back to scsi from pending list ioq eba4d290 kernel: [SRPTP][srp_pending_to_scsi][drivers/infiniband/ulp/srp/srp_host.c:767] Sending IO back to scsi from pending list ioq eba4de10 kernel: scsi: Device offlined - not ready after error recovery: host 3 channel 0 id 1 lun 0 kernel: bad: scheduling while atomic! kernel: Call Trace: kernel: [] schedule+0x5b2/0x5b7 kernel: [] as_next_request+0x33/0x3c kernel: [] elv_next_request+0x10/0xfc kernel: [] __down_interruptible+0xbd/0x14e kernel: [] default_wake_function+0x0/0x12 kernel: [] __down_failed_interruptible+0x7/0xc kernel: [] .text.lock.scsi_error+0x41/0x46 kernel: [] scsi_error_handler+0x0/0x10a kernel: [] kernel_thread_helper+0x5/0xb kernel: kernel: srp_host: target_bindings=2c90108a0655.1 The infiniband device drivers were loaded on both initiator and target machine. What's the problem whthin initiator or target? This is common loaded module list on init and target. ib_dm_client 24764 0 ib_cm 52312 0 ib_useraccess 12484 0 ib_ipoib 66188 0 ib_sa_client 30216 3 ib_srp,ib_dm_client,ib_ipoib ib_client_query 15392 4 ib_srp,ib_dm_client,ib_ipoib,ib_sa_client ib_poll 17080 3 ib_dm_client,ib_cm,ib_client_query ib_tavor 33284 5 mod_vapi 157688 1 ib_tavor mod_vipkl 223932 1 mod_vapi ib_mad 25100 4 ib_cm,ib_useraccess,ib_client_query,ib_tavor mod_mpga 24576 1 mod_vapi mod_thh 272160 1 mod_vapi mod_vapi_common 87808 4 ib_tavor,mod_vapi,mod_vipkl,mod_thh mosal 126792 5 mod_vapi,mod_vipkl,mod_mpga,mod_thh,mod_vapi_common mod_hh 16696 2 mod_vipkl,mod_thh ib_core 247316 8 ib_srp,ib_dm_client,ib_cm,ib_useraccess,ib_ipoib,ib_sa_client,ib_tavor,ib_mad ib_services 17860 11 ib_srp,ib_dm_client,ib_cm,ib_useraccess,ib_ipoib,ib_sa_client,ib_client_query,ib_poll,ib_tavor,ib_mad,ib_core Would you tell me about what *I'll do* to normal operate on the ib_srp ? /best regards From mst at mellanox.co.il Thu Nov 4 02:41:20 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 12:41:20 +0200 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: <52zn1ypbn2.fsf@topspin.com> References: <52zn1ypbn2.fsf@topspin.com> Message-ID: <20041104104120.GA2177@mellanox.co.il> If the max. number of QPs is very big, you may want the actual CQ size to grow gradually with demand. Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ": > Not sure what the goal is here, but I should point out that current > mthca code does not implement resizing either CQs or QPs. > > However I'm not sure I understand why the MAD layer wants to resize > these objects -- given that the number of QPs is known in advance and > that the MAD layer can choose how many work requests to post per QP, > I'm not sure what is gained by trying to resize things dynamically. > > - Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Nov 4 02:46:05 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 12:46:05 +0200 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <20041103224337.GS17669@sventech.com> References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com> <20041103224337.GS17669@sventech.com> Message-ID: <20041104104605.GB2177@mellanox.co.il> Hello! Quoting r. Johannes Erdfelt (johannes at erdfelt.com) "Re: [openib-general] [PATCH] Initial checkin of userspace MAD access": > On Wed, Nov 03, 2004, Roland Dreier wrote: > > By the way, buried down at the end of the patch is some documentation > > about creating device files: > > > > +/dev files > > + > > + To create the appropriate character device files automatically with > > + udev, a rule like > > + > > + KERNEL="umad*", NAME="infiniband/%s{ibdev}/umad%s{port}" > > + > > + can be used. This will create nodes such as /dev/infiniband/mthca0/umad1 > > + for port 1 of device mthca0. > > > > Do the names /dev/infiniband/mthca0/umad1 and so on make sense to > > people? I thought that userspace verbs support would probably use a > > file like /dev/infiniband/mthca0/verbs, etc. > > > > In any case, now is probably the time to object before we have legacy > > issues to worry about.... > > Does the device name need to have the HCA driver name in it? Also, the u > in umad is implied. > > Wouldn't it be more appropriate to do something like this: > > /dev/infiniband/hca0/mad1 > > or maybe even: > > /dev/ib/hca0/mad1 > > JE I'd suggest /dev/ib/hca0/ports/1/mad Then the user can give the file name like /dev/ib/hca0/ports/1 to opensm directly, open sm will just append "/mad", and it is also easier to find out how many ports in hca0 without switching to sysfs. MST From mst at mellanox.co.il Thu Nov 4 05:03:05 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 15:03:05 +0200 Subject: [openib-general] announcement: mstflint flash burning package uploaded Message-ID: <20041104130305.GA2735@mellanox.co.il> Hello! I have uploaded an mstflint flash burning package to openib.org. You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/ This is an update to the original flint utility that makes it possible to perform flash burning without loading special kernel level drivers, performing device pci memory access by means of the inding the device physical address in /proc/bus/pci/devices and then performing access through the standard /dev/mem file. There is also support for access through the configuration space, by writes to special files in /proc/bus/pci. See https://openib.org/svn/trunk/contrib/mellanox/mstflint/README for installation details. Feedback wellcome. MST From halr at voltaire.com Thu Nov 4 06:24:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 09:24:18 -0500 Subject: [openib-general] [PATCH] Cleanup spaces to tabs In-Reply-To: References: Message-ID: <1099578258.15107.1.camel@hpc-1> On Wed, 2004-11-03 at 18:52, Krishna Kumar wrote: > Hi Hal, > > The same patch on latest bits .... Thanks. Applied. -- Hal From halr at voltaire.com Thu Nov 4 06:51:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 09:51:58 -0500 Subject: [openib-general] [PATCH] Reorganize and clean up debug messages in find_mad_agent() In-Reply-To: References: Message-ID: <1099579918.2943.8.camel@hpc-1> On Wed, 2004-11-03 at 21:23, Krishna Kumar wrote: > Now messages are printed if either : > > 1. mad_agent is not found. > 2. mad_agent is found but doesn't have a handler. > > Printing messages during errors in the process of finding the > mad_agent has been removed. Thanks. Applied. -- Hal From roland at topspin.com Thu Nov 4 07:08:04 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 07:08:04 -0800 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: <20041104104120.GA2177@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 4 Nov 2004 12:41:20 +0200") References: <52zn1ypbn2.fsf@topspin.com> <20041104104120.GA2177@mellanox.co.il> Message-ID: <521xf9ps8b.fsf@topspin.com> Michael> If the max. number of QPs is very big, you may want the Michael> actual CQ size to grow gradually with demand. sure but there are only 2 soecial qps per port. From halr at voltaire.com Thu Nov 4 07:15:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 10:15:11 -0500 Subject: [openib-general] [PATCH] agent: Change calling argument to agent_mad_send Message-ID: <1099581311.2943.16.camel@hpc-1> agent: Change calling argument to agent_mad_send Rather than taking a struct ib_mad_recv_wc *, take a struct ib_wc * Index: agent.c =================================================================== --- agent.c (revision 1131) +++ agent.c (working copy) @@ -296,7 +296,7 @@ static int agent_mad_send(struct ib_mad_agent *mad_agent, struct ib_mad *mad, struct ib_grh *grh, - struct ib_mad_recv_wc *mad_recv_wc) + struct ib_wc *wc) { struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; @@ -332,17 +332,17 @@ send_wr.opcode = IB_WR_SEND; send_wr.sg_list = &gather_list; send_wr.num_sge = 1; - send_wr.wr.ud.remote_qpn = mad_recv_wc->wc->src_qp; /* DQPN */ + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ send_wr.wr.ud.timeout_ms = 0; send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; - ah_attr.dlid = mad_recv_wc->wc->slid; + ah_attr.dlid = wc->slid; ah_attr.port_num = mad_agent->port_num; - ah_attr.src_path_bits = mad_recv_wc->wc->dlid_path_bits; - ah_attr.sl = mad_recv_wc->wc->sl; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; ah_attr.static_rate = 0; if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - if (mad_recv_wc->wc->wc_flags & IB_WC_GRH) { + if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; ah_attr.grh.sgid_index = 0; /* Should sgid be looked up ? */ @@ -351,7 +351,7 @@ ah_attr.grh.traffic_class = (be32_to_cpup(&grh->version_tclass_flow) >> 20) & 0xff; memcpy(ah_attr.grh.dgid.raw, grh->sgid.raw, sizeof(struct ib_grh)); } else { - ah_attr.ah_flags = 0; /* No GRH */ + ah_attr.ah_flags = 0; /* No GRH for SM class */ } } else { /* Directed route or LID routed SM class */ @@ -367,7 +367,7 @@ send_wr.wr.ud.ah = agent_send_wr->ah; if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - send_wr.wr.ud.pkey_index = mad_recv_wc->wc->pkey_index; + send_wr.wr.ud.pkey_index = wc->pkey_index; send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; } else { send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */ @@ -407,7 +407,6 @@ { struct ib_agent_port_private *port_priv; struct ib_mad_agent *mad_agent; - struct ib_mad_recv_wc mad_recv_wc; port_priv = ib_get_agent_mad(device, port_num, NULL); if (!port_priv) { @@ -431,9 +430,7 @@ return 1; } - /* Other fields don't matter so should change signature to just use wc */ - mad_recv_wc.wc = wc; - return agent_mad_send(mad_agent, mad, grh, &mad_recv_wc); + return agent_mad_send(mad_agent, mad, grh, wc); } static void agent_send_handler(struct ib_mad_agent *mad_agent, From halr at voltaire.com Thu Nov 4 07:40:47 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 10:40:47 -0500 Subject: [openib-general] [PATCH] agent: Minor modifications to smi_check_local_xxx routines Message-ID: <1099582847.2837.2.camel@hpc-1> agent: Minor modifications to smi_check_local_xxx routines Index: agent.c =================================================================== -- agent.c (revision 1133) +++ agent.c (working copy) @@ -117,8 +117,7 @@ { /* C14-9:3 -- We're at the end of the DR segment of path */ /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ - return ((smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) || - (mad_agent->device->process_mad && + return ((mad_agent->device->process_mad && !ib_get_smp_direction(smp) && (smp->hop_ptr == smp->hop_cnt + 1))); } @@ -283,6 +282,8 @@ { struct ib_agent_port_private *port_priv; + if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + return 1; port_priv = ib_get_agent_mad(device, port_num, NULL); if (!port_priv) { printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not open\n", From mst at mellanox.co.il Thu Nov 4 07:44:01 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 17:44:01 +0200 Subject: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ In-Reply-To: <521xf9ps8b.fsf@topspin.com> References: <52zn1ypbn2.fsf@topspin.com> <20041104104120.GA2177@mellanox.co.il> <521xf9ps8b.fsf@topspin.com> Message-ID: <20041104154400.GB3499@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH 1/2] [RFC] Implement resize of CQ": > Michael> If the max. number of QPs is very big, you may want the > Michael> actual CQ size to grow gradually with demand. > > sure but there are only 2 soecial qps per port. Of course, it is only relevant for regular qps. mst From rminnich at lanl.gov Thu Nov 4 07:57:30 2004 From: rminnich at lanl.gov (Ronald G. Minnich) Date: Thu, 4 Nov 2004 08:57:30 -0700 (MST) Subject: [openib-general] announcement: mstflint flash burning package uploaded In-Reply-To: <20041104130305.GA2735@mellanox.co.il> References: <20041104130305.GA2735@mellanox.co.il> Message-ID: On Thu, 4 Nov 2004, Michael S. Tsirkin wrote: > I have uploaded an mstflint flash burning package to openib.org. > You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/ neat. How does this differ from tvflash that Roland wrote? thanks ron From halr at voltaire.com Thu Nov 4 08:04:27 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 11:04:27 -0500 Subject: [openib-general] SM and smi In-Reply-To: <52lldltdve.fsf@topspin.com> References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com> Message-ID: <1099584267.2837.14.camel@hpc-1> On Mon, 2004-11-01 at 17:15, Roland Dreier wrote: > I think SMI processing should be applied to all DR SMPs passed to > ib_post_send_mad(). This requires the MAD layer to peek into the outgoing MAD but all it has is the DMA address in the sg-list and the last time I tried to do this (go from DMA address to a VA), the approach used to do this was in the process of being deprecated. Last time, I think the need to do this was obviated. Is there an acceptable alternative ? > This is what the Topspin stack does and I believe it is what OpenSM expects. I think we could change what OpenSM expected if needed so this does not appear to me to be a determining factor. -- Hal From roland at topspin.com Thu Nov 4 08:14:28 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 08:14:28 -0800 Subject: [openib-general] SM and smi In-Reply-To: <1099584267.2837.14.camel@hpc-1> (Hal Rosenstock's message of "Thu, 04 Nov 2004 11:04:27 -0500") References: <1099334442.3074.45.camel@hpc-1> <52lldltdve.fsf@topspin.com> <1099584267.2837.14.camel@hpc-1> Message-ID: <52wtx1oal7.fsf@topspin.com> Hal> This requires the MAD layer to peek into the outgoing MAD but Hal> all it has is the DMA address in the sg-list and the last Hal> time I tried to do this (go from DMA address to a VA), the Hal> approach used to do this was in the process of being Hal> deprecated. Last time, I think the need to do this was Hal> obviated. Is there an acceptable alternative ? I thought the wr.ud.mad_hdr member was added for this sort of thing... - R. From roland at topspin.com Thu Nov 4 08:15:21 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 08:15:21 -0800 Subject: [openib-general] announcement: mstflint flash burning package uploaded In-Reply-To: (Ronald G. Minnich's message of "Thu, 4 Nov 2004 08:57:30 -0700 (MST)") References: <20041104130305.GA2735@mellanox.co.il> Message-ID: <52sm7poajq.fsf@topspin.com> Ronald> neat. How does this differ from tvflash that Roland wrote? correction: I just cleaned up the code. Kamen and Johannes here at Topspin did most of the real work in writing tvflash... - R From roland at topspin.com Thu Nov 4 08:15:54 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 08:15:54 -0800 Subject: [openib-general] [PATCH] Initial checkin of userspace MAD access In-Reply-To: <20041104104605.GB2177@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 4 Nov 2004 12:46:05 +0200") References: <52y8hjqwez.fsf@topspin.com> <524qk6r9k1.fsf@topspin.com> <20041103224337.GS17669@sventech.com> <20041104104605.GB2177@mellanox.co.il> Message-ID: <52oeidoait.fsf@topspin.com> Michael> /dev/ib/hca0/ports/1/mad I like this idea. Thanks, Roland From mst at mellanox.co.il Thu Nov 4 08:17:25 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 18:17:25 +0200 Subject: [openib-general] announcement: mstflint flash burning package uploaded In-Reply-To: References: <20041104130305.GA2735@mellanox.co.il> Message-ID: <20041104161725.GB2550@mellanox.co.il> Hello! Quoting r. Ronald G. Minnich (rminnich at lanl.gov) "Re: [openib-general] announcement: mstflint flash burning package uploaded": > > > On Thu, 4 Nov 2004, Michael S. Tsirkin wrote: > > > I have uploaded an mstflint flash burning package to openib.org. > > You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/ > > neat. How does this differ from tvflash that Roland wrote? > > thanks > > ron Supports wider range of cards produced by mellanox, supports integration with ib management tools that we are developing, is based on flint code that we use in production environment. MST From mst at mellanox.co.il Thu Nov 4 08:23:33 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 18:23:33 +0200 Subject: [openib-general] announcement: mstflint flash burning package uploaded In-Reply-To: <52sm7poajq.fsf@topspin.com> References: <20041104130305.GA2735@mellanox.co.il> <52sm7poajq.fsf@topspin.com> Message-ID: <20041104162333.GE2550@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] announcement: mstflint flash burning package uploaded": > Ronald> neat. How does this differ from tvflash that Roland wrote? > > correction: I just cleaned up the code. Something mstflint could benefit from, too :) MST From mst at mellanox.co.il Thu Nov 4 08:25:47 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Nov 2004 18:25:47 +0200 Subject: [openib-general] announcement: mstflint flash burning package uploaded In-Reply-To: References: <20041104130305.GA2735@mellanox.co.il> Message-ID: <20041104162547.GF2550@mellanox.co.il> Hello! Quoting r. Ronald G. Minnich (rminnich at lanl.gov) "Re: [openib-general] announcement: mstflint flash burning package uploaded": > > > On Thu, 4 Nov 2004, Michael S. Tsirkin wrote: > > > I have uploaded an mstflint flash burning package to openib.org. > > You can find it here: https://openib.org/svn/trunk/contrib/mellanox/mstflint/ > > neat. How does this differ from tvflash that Roland wrote? Clarification: its basically (a later revision of) the same flint utility you already have. All I did was replace calls to kernel driver by access to /dev/mem and friends. My code is in mtcr.h, flint.cpp is taken from production flint as is. MST From roland at topspin.com Thu Nov 4 09:37:18 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 09:37:18 -0800 Subject: [openib-general] [PATCH] mthca/mad/agent process_mad changes (both branches) In-Reply-To: <1099543475.10754.3.camel@hpc-1> (Hal Rosenstock's message of "Wed, 03 Nov 2004 23:44:36 -0500") References: <1099518873.2837.5.camel@hpc-1> <52ekjap7yo.fsf@topspin.com> <1099543475.10754.3.camel@hpc-1> Message-ID: <527jp1o6r5.fsf@topspin.com> OK, I merged the MAD code in my branch up to r1135 and applied this patch (there was one missing chunk in ib_verbs.h to remove the snoop_mad method from struct ib_device, which I added by hand). Thanks, Roland From roland at topspin.com Thu Nov 4 09:46:00 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 09:46:00 -0800 Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk Message-ID: <52y8hhmrs7.fsf@topspin.com> I have just copied the roland-merge branch to https://openib.org/svn/gen2/trunk This tree will become the main development tree and will be used to create the tree we will submit to the kernel for inclusion. Please use this tree for testing and as the base for all patches. I will be cleaning up this tree (mostly deleting code that does not build any more, etc) over the next few days. Thanks, Roland From krkumar at us.ibm.com Thu Nov 4 09:44:28 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Thu, 4 Nov 2004 09:44:28 -0800 (PST) Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: Message-ID: On Thu, 4 Nov 2004, Roland Dreier wrote: > Not sure what the goal is here, but I should point out that current > mthca code does not implement resizing either CQs or QPs. Yes, I agree on that. Infact the verbs layer will return ENOSYS for mthca driver. But I was assuming that any other driver by a different hardware vendor can support this call (mthca over time could support this call too ?). > However I'm not sure I understand why the MAD layer wants to resize > these objects -- given that the number of QPs is known in advance and > that the MAD layer can choose how many work requests to post per QP, > I'm not sure what is gained by trying to resize things dynamically. Actually, I haven't really implemented the "dynamically" part, where you resize the CQ during operation. The spec said that when you create a QP, it can be larger than what you specified. If so, I see good value in increasing the size of the associated CQ, if it is supported by the driver. Thanks, - KK From xma at us.ibm.com Thu Nov 4 10:14:49 2004 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 4 Nov 2004 10:14:49 -0800 Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk In-Reply-To: <52y8hhmrs7.fsf@topspin.com> Message-ID: So everybody should start working on this tree. What's the difference between openib-candidate and trunk under gen2? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 Roland Dreier To Sent by: openib-general at openib.org openib-general-bo cc unces at openib.org Subject [openib-general] [ANNOUNCE] Opening 11/04/2004 09:46 of gen2 trunk AM I have just copied the roland-merge branch to https://openib.org/svn/gen2/trunk This tree will become the main development tree and will be used to create the tree we will submit to the kernel for inclusion. Please use this tree for testing and as the base for all patches. I will be cleaning up this tree (mostly deleting code that does not build any more, etc) over the next few days. Thanks, Roland _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic22236.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From halr at voltaire.com Thu Nov 4 10:24:47 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 13:24:47 -0500 Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk In-Reply-To: References: Message-ID: <1099592687.3890.2.camel@perr-t30.us.voltaire.com> On Thu, 2004-11-04 at 13:14, Shirley Ma wrote: > So everybody should start working on this tree. What's the difference > between openib-candidate and trunk under gen2? trunk is much more complete with IPoIB. This is what is heading towards being pushed to the 2.6 kernel. Yes, you should use this tree :-) -- Hal From mashirle at us.ibm.com Thu Nov 4 12:44:03 2004 From: mashirle at us.ibm.com (Shirley Ma) Date: Thu, 4 Nov 2004 12:44:03 -0800 Subject: [openib-general] [PATCH]fix memory leak associated with agent_send_handler() in gen2/trunk Message-ID: <200411041244.03710.mashirle@us.ibm.com> Please review this patch. diff -urN infiniband/core/agent.c infiniband.patch/core/agent.c --- infiniband/core/agent.c 2004-11-04 10:35:20.000000000 -0800 +++ infiniband.patch/core/agent.c 2004-11-04 12:35:55.916027072 -0800 @@ -480,6 +480,7 @@ /* Release allocated memory */ kfree(agent_send_wr->mad); + kfree(agent_send_wr); } int ib_agent_port_open(struct ib_device *device, int port_num, -- Thanks Shirley Ma IBM Linux Technology Center From halr at voltaire.com Thu Nov 4 13:02:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 16:02:39 -0500 Subject: [openib-general] [PATCH]fix memory leak associated with agent_send_handler() in gen2/trunk In-Reply-To: <200411041244.03710.mashirle@us.ibm.com> References: <200411041244.03710.mashirle@us.ibm.com> Message-ID: <1099602159.3110.3.camel@hpc-1> On Thu, 2004-11-04 at 15:44, Shirley Ma wrote: > Please review this patch. Thanks. Applied. -- Hal From halr at voltaire.com Thu Nov 4 13:34:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 16:34:06 -0500 Subject: [openib-general] [PATCH] mad: Remove print of "No client for received MAD" Message-ID: <1099604046.3110.6.camel@hpc-1> mad: Removed print of "No client for received MAD" as this can be a normal case Index: mad.c =================================================================== --- mad.c (revision 1139) +++ mad.c (working copy) @@ -798,9 +798,7 @@ &mad_agent->agent, port_priv->port_num); mad_agent = NULL; } - } else - printk(KERN_NOTICE PFX "No client for received MAD on " - "port %d\n", port_priv->port_num); + } out: spin_unlock_irqrestore(&port_priv->reg_lock, flags); From krkumar at us.ibm.com Thu Nov 4 13:31:28 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Thu, 4 Nov 2004 13:31:28 -0800 (PST) Subject: [openib-general] [PATCH] fix memory leak in ib_mad_recv_done_handler Message-ID: Also, updated a comment so that it is known that the recv_handler (when it is implemented) is in charge of freeing up recv during it's processing. Applies to gen2/trunk. Thanks, - KK diff -ruNp 1/mad.c 2/mad.c --- 1/mad.c 2004-11-04 10:38:30.000000000 -0800 +++ 2/mad.c 2004-11-04 13:26:39.000000000 -0800 @@ -1045,14 +1045,16 @@ static void ib_mad_recv_done_handler(str solicited); if (mad_agent) { ib_mad_complete_recv(mad_agent, recv, solicited); - recv = NULL; /* recv is freed up via ib_mad_complete_recv */ + /* + * recv is freed up in error cases in ib_mad_complete_recv + * or via recv_handler in ib_mad_complete_recv(). + */ + recv = NULL; } out: - if (recv) { - /* Should this case be optimized ? */ - kmem_cache_free(ib_mad_cache, recv); - } + if (recv) + ib_free_recv_mad(&recv->header.recv_wc); /* Post another receive request for this QP */ ib_mad_post_receive_mad(qp_info); From halr at voltaire.com Thu Nov 4 14:17:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 04 Nov 2004 17:17:32 -0500 Subject: [openib-general] [PATCH] fix memory leak in ib_mad_recv_done_handler In-Reply-To: References: Message-ID: <1099606652.2834.0.camel@hpc-1> On Thu, 2004-11-04 at 16:31, Krishna Kumar wrote: > Also, updated a comment so that it is known that the recv_handler > (when it is implemented) is in charge of freeing up recv during it's > processing. Applies to gen2/trunk. Thanks. Applied the commentary part of the change. The memory leak "fix" needs some work as the MAD layer now oops on a NULL pointer dereference at virtual address 0. I omitted this one line change (for now): - kmem_cache_free(ib_mad_cache, recv); + ib_free_recv_mad(&recv->header.recv_wc); -- Hal From mshefty at ichips.intel.com Thu Nov 4 14:12:15 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 04 Nov 2004 14:12:15 -0800 Subject: [openib-general] Reusing receive MADs Message-ID: <418AA93F.1060602@ichips.intel.com> Is there any interest among people to reuse receive MADs? I.e. once allocated and mapped, the receive MAD and work request would be re-posted to the QP when freed. I ask because if people are interested in such an optimization at some point in the future, it will affect how I structure send queue overrun handling. - Sean From roland at topspin.com Thu Nov 4 14:20:01 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 04 Nov 2004 14:20:01 -0800 Subject: [openib-general] Reusing receive MADs In-Reply-To: <418AA93F.1060602@ichips.intel.com> (Sean Hefty's message of "Thu, 04 Nov 2004 14:12:15 -0800") References: <418AA93F.1060602@ichips.intel.com> Message-ID: <52u0s5l0j2.fsf@topspin.com> Sean> Is there any interest among people to reuse receive MADs? Sean> I.e. once allocated and mapped, the receive MAD and work Sean> request would be re-posted to the QP when freed. I'm not sure this is that useful... MAD processing is not such a super-hot path that we need to keep per-CPU lists of cache-hot buffers (as is done for sk_buffs), and the kernel slab code should do a pretty good job of reusing buffers anyway. (The receive buffer needs to be unmapped before passing to the consumer anyway so there's not a saving there) - R. From halr at voltaire.com Fri Nov 5 05:31:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 08:31:24 -0500 Subject: [openib-general] [PATCH] mad: Restructure smi as shared between mad and agent Message-ID: <1099661482.14234.3.camel@hpc-1> mad: Restructure smi as shared between mad and agent (adds new files smi.h, smi.c, and agent.h) Index: agent.c =================================================================== --- agent.c (revision 1161) +++ agent.c (working copy) @@ -24,6 +24,7 @@ */ #include +#include "smi.h" #include "agent_priv.h" #include "mad_priv.h" #include @@ -32,210 +33,7 @@ static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED; static LIST_HEAD(ib_agent_port_list); -/* - * Fixup a directed route SMP for sending. Return 0 if the SMP should be - * discarded. - */ -int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) -{ - u8 hop_ptr, hop_cnt; - hop_ptr = smp->hop_ptr; - hop_cnt = smp->hop_cnt; - - /* See section 14.2.2.2, Vol 1 IB spec */ - if (!ib_get_smp_direction(smp)) { - /* C14-9:1 */ - if (hop_cnt && hop_ptr == 0) { - smp->hop_ptr++; - return (smp->initial_path[smp->hop_ptr] == - port_num); - } - - /* C14-9:2 */ - if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) - return 0; - - /* smp->return_path set when received */ - smp->hop_ptr++; - return (smp->initial_path[smp->hop_ptr] == - port_num); - } - - /* C14-9:3 -- We're at the end of the DR segment of path */ - if (hop_ptr == hop_cnt) { - /* smp->return_path set when received */ - smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || - smp->dr_dlid == IB_LID_PERMISSIVE); - } - - /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ - /* C14-9:5 -- Fail unreasonable hop pointer. */ - return (hop_ptr == hop_cnt + 1); - - } else { - /* C14-13:1 */ - if (hop_cnt && hop_ptr == hop_cnt + 1) { - smp->hop_ptr--; - return (smp->return_path[smp->hop_ptr] == - port_num); - } - - /* C14-13:2 */ - if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) - return 0; - - smp->hop_ptr--; - return (smp->return_path[smp->hop_ptr] == - port_num); - } - - /* C14-13:3 -- at the end of the DR segment of path */ - if (hop_ptr == 1) { - smp->hop_ptr--; - /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || - smp->dr_slid == IB_LID_PERMISSIVE); - } - - /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */ - /* C14-13:5 -- Check for unreasonable hop pointer. */ - return 0; - } -} - -/* - * Return 1 if the SMP should be handled by the local SMA via process_mad. - */ -static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, - struct ib_smp *smp) -{ - /* C14-9:3 -- We're at the end of the DR segment of path */ - /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ - return ((mad_agent->device->process_mad && - !ib_get_smp_direction(smp) && - (smp->hop_ptr == smp->hop_cnt + 1))); -} - -/* - * Adjust information for a received SMP. Return 0 if the SMP should be - * dropped. - */ -int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) -{ - u8 hop_ptr, hop_cnt; - - hop_ptr = smp->hop_ptr; - hop_cnt = smp->hop_cnt; - - /* See section 14.2.2.2, Vol 1 IB spec */ - if (!ib_get_smp_direction(smp)) { - /* C14-9:1 -- sender should have incremented hop_ptr */ - if (hop_cnt && hop_ptr == 0) - return 0; - - /* C14-9:2 -- intermediate hop */ - if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) - return 0; - - smp->return_path[hop_ptr] = port_num; - /* smp->hop_ptr updated when sending */ - return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); - } - - /* C14-9:3 -- We're at the end of the DR segment of path */ - if (hop_ptr == hop_cnt) { - if (hop_cnt) - smp->return_path[hop_ptr] = port_num; - /* smp->hop_ptr updated when sending */ - - return (node_type == IB_NODE_SWITCH || - smp->dr_dlid == IB_LID_PERMISSIVE); - } - - /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ - /* C14-9:5 -- fail unreasonable hop pointer. */ - return (hop_ptr == hop_cnt + 1); - - } else { - - /* C14-13:1 */ - if (hop_cnt && hop_ptr == hop_cnt + 1) { - smp->hop_ptr--; - return (smp->return_path[smp->hop_ptr] == - port_num); - } - - /* C14-13:2 */ - if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) - return 0; - - /* smp->hop_ptr updated when sending */ - return (smp->return_path[hop_ptr-1] <= phys_port_cnt); - } - - /* C14-13:3 -- We're at the end of the DR segment of path */ - if (hop_ptr == 1) { - if (smp->dr_slid == IB_LID_PERMISSIVE) { - /* giving SMP to SM - update hop_ptr */ - smp->hop_ptr--; - return 1; - } - /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); - } - - /* C14-13:4 -- hop_ptr = 0 -> give to SM. */ - /* C14-13:5 -- Check for unreasonable hop pointer. */ - return (hop_ptr == 0); - } -} - -/* - * Return 1 if the received DR SMP should be forwarded to the send queue. - * Return 0 if the SMP should be completed up the stack. - */ -int smi_check_forward_dr_smp(struct ib_smp *smp) -{ - u8 hop_ptr, hop_cnt; - - hop_ptr = smp->hop_ptr; - hop_cnt = smp->hop_cnt; - - if (!ib_get_smp_direction(smp)) { - /* C14-9:2 -- intermediate hop */ - if (hop_ptr && hop_ptr < hop_cnt) - return 1; - - /* C14-9:3 -- at the end of the DR segment of path */ - if (hop_ptr == hop_cnt) - return (smp->dr_dlid == IB_LID_PERMISSIVE); - - /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ - if (hop_ptr == hop_cnt + 1) - return 1; - } else { - /* C14-13:2 */ - if (2 <= hop_ptr && hop_ptr <= hop_cnt) - return 1; - - /* C14-13:3 -- at the end of the DR segment of path */ - if (hop_ptr == 1) - return (smp->dr_slid != IB_LID_PERMISSIVE); - } - return 0; -} - static inline struct ib_agent_port_private * __ib_get_agent_mad(struct ib_device *device, int port_num, struct ib_mad_agent *mad_agent) Index: mad.c =================================================================== --- mad.c (revision 1161) +++ mad.c (working copy) @@ -56,6 +56,8 @@ #include #include "mad_priv.h" +#include "smi.h" +#include "agent.h" #include #include @@ -922,23 +924,6 @@ } } -extern int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt); -extern int smi_check_forward_dr_smp(struct ib_smp *smp); -extern int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num); -extern int smi_check_local_dr_smp(struct ib_smp *smp, - struct ib_device *device, - int port_num); -extern int agent_send(struct ib_mad *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num); - static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { @@ -1877,12 +1862,6 @@ return 0; } - -extern int ib_agent_port_open(struct ib_device *device, int port_num, - int phys_port_cnt); -extern int ib_agent_port_close(struct ib_device *device, int port_num); - - static void ib_mad_init_device(struct ib_device *device) { int ret, num_ports, cur_port, i, ret2; Index: agent.h =================================================================== --- agent.h (revision 0) +++ agent.h (revision 0) @@ -0,0 +1,41 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __AGENT_H_ +#define __AGENT_H_ + +extern int ib_agent_port_open(struct ib_device *device, + int port_num, + int phys_port_cnt); + +extern int ib_agent_port_close(struct ib_device *device, int port_num); + +extern int agent_send(struct ib_mad *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + +#endif /* __AGENT_H_ */ Index: smi.c =================================================================== --- smi.c (revision 0) +++ smi.c (revision 0) @@ -0,0 +1,219 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + + +/* + * Fixup a directed route SMP for sending. Return 0 if the SMP should be + * discarded. + */ +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + /* smp->return_path set when received */ + smp->hop_ptr++; + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ + /* C14-9:5 -- Fail unreasonable hop pointer. */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (node_type == IB_NODE_SWITCH || + smp->dr_slid == IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */ + /* C14-13:5 -- Check for unreasonable hop pointer. */ + return 0; + } +} + +/* + * Adjust information for a received SMP. Return 0 if the SMP should be + * dropped. + */ +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) + return 0; + + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ + /* C14-9:5 -- fail unreasonable hop pointer. */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; + } + /* smp->hop_ptr updated when sending */ + return (node_type == IB_NODE_SWITCH); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM. */ + /* C14-13:5 -- Check for unreasonable hop pointer. */ + return (hop_ptr == 0); + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue. + * Return 0 if the SMP should be completed up the stack. + */ +int smi_check_forward_dr_smp(struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) + return 1; + + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) + return 1; + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} + Index: Makefile =================================================================== --- Makefile (revision 1161) +++ Makefile (working copy) @@ -17,6 +17,7 @@ ib_mad-objs := \ mad.o \ + smi.o \ agent.o ib_sa-objs := sa_query.o Index: smi.h =================================================================== --- smi.h (revision 0) +++ smi.h (revision 0) @@ -0,0 +1,54 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __SMI_H_ +#define __SMI_H_ + +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); + +/* + * Return 1 if the SMP should be handled by the local SMA via process_mad. + */ +static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ + return ((mad_agent->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1))); +} + +#endif /* __SMI_H_ */ From halr at voltaire.com Fri Nov 5 10:49:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 13:49:39 -0500 Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in ib_post_send_mad Message-ID: <1099680579.2965.7.camel@hpc-1> mad: Handle outgoing SMPs in ib_post_send_mad The MAD layer is now ready to support the SM :-) I have not yet handled the additional special cases: hop count increment done by device, use send queue rather than process MAD for 0 hop SMPs). Index: mad_priv.h =================================================================== --- mad_priv.h (revision 1161) +++ mad_priv.h (working copy) @@ -115,6 +115,7 @@ atomic_t refcount; wait_queue_head_t wait; + int phys_port_cnt; u8 rmpp_version; }; Index: mad.c =================================================================== --- mad.c (revision 1162) +++ mad.c (working copy) @@ -89,6 +89,7 @@ static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); +static int solicited_mad(struct ib_mad *mad); /* * Returns a ib_mad_port_private structure or NULL for a device/port. @@ -243,6 +244,7 @@ mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->phys_port_cnt = port_priv->phys_port_cnt; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; @@ -368,6 +370,105 @@ spin_unlock_irqrestore(&mad_queue->lock, flags); } +/* + * Return 0 if SMP is to be sent + * Return 1 if SMP was consumed locally (whether or not solicited) + * Return < 0 if error + */ +static int handle_outgoing_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp, + struct ib_send_wr *send_wr) +{ + int ret; + + if (!smi_handle_dr_smp_send(smp, + mad_agent->device->node_type, + mad_agent->port_num)) { + ret = -EINVAL; + printk(KERN_ERR "Invalid directed route\n"); + goto error1; + } + if (smi_check_local_dr_smp(smp, + mad_agent->device, + mad_agent->port_num)) { + struct ib_mad_private *mad_priv; + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wc mad_send_wc; + + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + goto error1; + } + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + ret = mad_agent->device->process_mad(mad_agent->device, + 0, + mad_agent->port_num, + smp->dr_slid, /* ? */ + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + if ((ret & IB_MAD_RESULT_SUCCESS) && + (ret & IB_MAD_RESULT_REPLY)) { + if (!smi_handle_dr_smp_recv((struct ib_smp *)&mad_priv->mad, + mad_agent->device->node_type, + mad_agent->port_num, + mad_agent_priv->phys_port_cnt)) { + ret = -EINVAL; + kmem_cache_free(ib_mad_cache, mad_priv); + goto error1; + } + } + + /* See if response is solicited and there is a recv handler */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) { + struct ib_wc wc; + + /* Defined behavior is to complete response before request */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = 0; /* IB_QPT_SMI ? */ + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); + mad_priv->header.recv_buf.grh = NULL; + mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; + mad_priv->header.recv_wc.recv_buf = &mad_priv->header.recv_buf; + mad_agent_priv->agent.recv_handler(mad_agent, + &mad_priv->header.recv_wc); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + + if (mad_agent_priv->agent.send_handler) { + /* Now, complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = send_wr->wr_id; + mad_agent_priv->agent.send_handler(mad_agent, &mad_send_wc); + ret = 1; + } else + ret = -EINVAL; + } else + ret = 0; + +error1: + return ret; +} + static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_send_wr_private *mad_send_wr, struct ib_send_wr *send_wr, @@ -422,9 +523,27 @@ while (cur_send_wr) { unsigned long flags; struct ib_mad_send_wr_private *mad_send_wr; + struct ib_smp *smp; + if (!cur_send_wr->wr.ud.mad_hdr) { + *bad_send_wr = cur_send_wr; + printk(KERN_ERR PFX "MAD header must be supplied in WR %p\n", cur_send_wr); + goto error1; + } + next_send_wr = (struct ib_send_wr *)cur_send_wr->next; + smp = (struct ib_smp *)cur_send_wr->wr.ud.mad_hdr; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_smp(mad_agent, smp, cur_send_wr); + if (ret < 0) { /* error */ + *bad_send_wr = cur_send_wr; + goto error1; + } else if (ret == 1) { /* locally consumed */ + goto next; + } + } + /* Allocate MAD send WR tracking structure */ mad_send_wr = kmalloc(sizeof *mad_send_wr, (in_atomic() || irqs_disabled()) ? @@ -467,7 +586,8 @@ atomic_dec(&mad_agent_priv->refcount); return ret; } - cur_send_wr= next_send_wr; +next: + cur_send_wr = next_send_wr; } return 0; From halr at voltaire.com Fri Nov 5 10:54:47 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 13:54:47 -0500 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: References: Message-ID: <1099680887.2965.18.camel@hpc-1> On Thu, 2004-11-04 at 12:44, Krishna Kumar wrote: > On Thu, 4 Nov 2004, Roland Dreier wrote: > > > Not sure what the goal is here, but I should point out that current > > mthca code does not implement resizing either CQs or QPs. > > Yes, I agree on that. Infact the verbs layer will return ENOSYS for > mthca driver. But I was assuming that any other driver by a different > hardware vendor can support this call (mthca over time could support > this call too ?). Is this a driver or firmware issue ? > > However I'm not sure I understand why the MAD layer wants to resize > > these objects -- given that the number of QPs is known in advance and > > that the MAD layer can choose how many work requests to post per QP, > > I'm not sure what is gained by trying to resize things dynamically. > > Actually, I haven't really implemented the "dynamically" part, where you > resize the CQ during operation. The spec said that when you create a QP, > it can be larger than what you specified. If so, I see good value in > increasing the size of the associated CQ, if it is supported by the > driver. Might this be useful for redirected QPs ? Should the incorporation of this functionality be deferred until either there is hardware which supports this or we find some use for it ? -- Hal From halr at voltaire.com Fri Nov 5 10:56:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 13:56:52 -0500 Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk In-Reply-To: <52y8hhmrs7.fsf@topspin.com> References: <52y8hhmrs7.fsf@topspin.com> Message-ID: <1099681012.2965.22.camel@hpc-1> On Thu, 2004-11-04 at 12:46, Roland Dreier wrote: > I have just copied the roland-merge branch to > > https://openib.org/svn/gen2/trunk > > This tree will become the main development tree and will be used to > create the tree we will submit to the kernel for inclusion. Please > use this tree for testing and as the base for all patches. > > I will be cleaning up this tree (mostly deleting code that does not > build any more, etc) over the next few days. This looks great. I have just 2 minor questions: 1. Are there changes planned for core/cache.c ? 2. Shouldn't src/userspace/tools/libsdp be removed for now ? Thanks. -- Hal From roland at topspin.com Fri Nov 5 11:20:21 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 11:20:21 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <1099680887.2965.18.camel@hpc-1> (Hal Rosenstock's message of "Fri, 05 Nov 2004 13:54:47 -0500") References: <1099680887.2965.18.camel@hpc-1> Message-ID: <52is8khzm2.fsf@topspin.com> Hal> Is this a driver or firmware issue ? Driver issue. I just haven't implemented CQ resize yet, and it's not a high priority for me. Hal> Might this be useful for redirected QPs ? I don't think so, since the redirected QP will not be attached to the MAD layer's CQ. Hal> Should the incorporation of this functionality be deferred Hal> until either there is hardware which supports this or we find Hal> some use for it ? I think so. - R. From roland at topspin.com Fri Nov 5 11:21:12 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 11:21:12 -0800 Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk In-Reply-To: <1099681012.2965.22.camel@hpc-1> (Hal Rosenstock's message of "Fri, 05 Nov 2004 13:56:52 -0500") References: <52y8hhmrs7.fsf@topspin.com> <1099681012.2965.22.camel@hpc-1> Message-ID: <52ekj8hzkn.fsf@topspin.com> Hal> 1. Are there changes planned for core/cache.c ? I've cleaned it up a little but I'm really not sure exactly what should be done with it. Hal> 2. Shouldn't src/userspace/tools/libsdp be removed for now ? Yeah, I'll do that. -R. From mshefty at ichips.intel.com Fri Nov 5 11:21:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 05 Nov 2004 11:21:17 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <1099680887.2965.18.camel@hpc-1> References: <1099680887.2965.18.camel@hpc-1> Message-ID: <418BD2AD.6030807@ichips.intel.com> Hal Rosenstock wrote: >>>However I'm not sure I understand why the MAD layer wants to resize >>>these objects -- given that the number of QPs is known in advance and >>>that the MAD layer can choose how many work requests to post per QP, >>>I'm not sure what is gained by trying to resize things dynamically. >> >>Actually, I haven't really implemented the "dynamically" part, where you >>resize the CQ during operation. The spec said that when you create a QP, >>it can be larger than what you specified. If so, I see good value in >>increasing the size of the associated CQ, if it is supported by the >>driver. > > > Might this be useful for redirected QPs ? Since the client allocates the QP and CQ in this case, they would be responsible for resizing the CQ appropriately. The MAD layer could provide queuing to prevent send queue overflow, or not, depending on how we want to implement it. > Should the incorporation of this functionality be deferred until either > there is hardware which supports this or we find some use for it ? I think we should go ahead an put this code in. We need to handle the case where the QP is sized larger than what we request anyway, to ensure that we don't overrun the CQ. From krkumar at us.ibm.com Fri Nov 5 11:57:53 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Fri, 5 Nov 2004 11:57:53 -0800 (PST) Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query. Message-ID: Current code frees up memory in error case and dereferences it later, plus the success case doesn't (seem to) free it up. (do you guys need patches to be rooted from a particular directory to be more efficient/convenient ?) - KK diff -ruNp 7/sa_query.c 8/sa_query.c --- 7/sa_query.c 2004-11-05 11:37:44.000000000 -0800 +++ 8/sa_query.c 2004-11-05 11:51:06.000000000 -0800 @@ -544,12 +544,14 @@ int ib_sa_path_rec_get(struct ib_device rec, query->sa_query.mad->data); ret = send_mad(&query->sa_query, timeout_ms); - if (ret) - kfree(query); - - *sa_query = &query->sa_query; - return ret ? ret : query->sa_query.id; + if (!ret) { + /* Success, return the SA Query and ID. */ + ret = query->sa_query.id; + *sa_query = &query->sa_query; + } + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); @@ -617,12 +619,14 @@ int ib_sa_mcmember_rec_query(struct ib_d rec, query->sa_query.mad->data); ret = send_mad(&query->sa_query, timeout_ms); - if (ret) - kfree(query); - - *sa_query = &query->sa_query; + if (!ret) { + /* Success, return the SA Query and ID. */ + ret = query->sa_query.id; + *sa_query = &query->sa_query; + } + kfree(query); + return ret; - return ret ? ret : query->sa_query.id; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); From halr at voltaire.com Fri Nov 5 12:09:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 15:09:42 -0500 Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query. In-Reply-To: References: Message-ID: <1099685382.3278.56.camel@localhost.localdomain> On Fri, 2004-11-05 at 14:57, Krishna Kumar wrote: > (do you guys need patches to be rooted from a particular directory > to be more efficient/convenient ?) We should be patching against gen2/trunk now. -- Hal From halr at voltaire.com Fri Nov 5 12:17:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 15:17:11 -0500 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <418BD2AD.6030807@ichips.intel.com> References: <1099680887.2965.18.camel@hpc-1> <418BD2AD.6030807@ichips.intel.com> Message-ID: <1099685830.3278.65.camel@localhost.localdomain> On Fri, 2004-11-05 at 14:21, Sean Hefty wrote: > I think we should go ahead an put this code in. We need to handle the > case where the QP is sized larger than what we request anyway, to ensure > that we don't overrun the CQ. Does the driver do this (QP is sized larger than what was requested) now ? Or is this a spec thing ? If so, just to make sure I have this straight, what is/are the specific patch(es) ? Is it the 2 patches from Wednesday entitled "PATCH 1/2 Resize CQ" and "PATCH 2/2 Implement error handling in resize failure". Thanks. -- Hal From roland at topspin.com Fri Nov 5 12:22:09 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 12:22:09 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <1099685830.3278.65.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 05 Nov 2004 15:17:11 -0500") References: <1099680887.2965.18.camel@hpc-1> <418BD2AD.6030807@ichips.intel.com> <1099685830.3278.65.camel@localhost.localdomain> Message-ID: <52654khwr2.fsf@topspin.com> Hal> Does the driver do this (QP is sized larger than what was Hal> requested) now ? Or is this a spec thing ? Unless my memory is playing tricks on me, I don't think mthca will create a QP larger than requested. - R. From ftillier at infiniconsys.com Fri Nov 5 12:22:13 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 5 Nov 2004 12:22:13 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <418BD2AD.6030807@ichips.intel.com> Message-ID: <000001c4c375$26bbac40$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, November 05, 2004 11:21 AM > > Hal Rosenstock wrote: > >>>However I'm not sure I understand why the MAD layer wants to resize > >>>these objects -- given that the number of QPs is known in advance and > >>>that the MAD layer can choose how many work requests to post per QP, > >>>I'm not sure what is gained by trying to resize things dynamically. > >> > >>Actually, I haven't really implemented the "dynamically" part, where you > >>resize the CQ during operation. The spec said that when you create a QP, > >>it can be larger than what you specified. If so, I see good value in > >>increasing the size of the associated CQ, if it is supported by the > >>driver. > > > > > > Might this be useful for redirected QPs ? > > Since the client allocates the QP and CQ in this case, they would be > responsible for resizing the CQ appropriately. The MAD layer could > provide queuing to prevent send queue overflow, or not, depending on how > we want to implement it. If the MAD layer did provide queuing to prevent overflow for the requested (not allocated) depth, then the CQ resize is unnecessary. I would expect that whatever code manages the QP/CQ should provide queuing so that MAD agents don't all have to implement queueing with respect to one another. - Fab From ftillier at infiniconsys.com Fri Nov 5 12:22:13 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 5 Nov 2004 12:22:13 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <52is8khzm2.fsf@topspin.com> Message-ID: <000101c4c375$327c66a0$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, November 05, 2004 11:20 AM > > Hal> Is this a driver or firmware issue ? > > Driver issue. I just haven't implemented CQ resize yet, and it's not > a high priority for me. > As far as I know, Tavor does not support QP resize. - Fab From halr at voltaire.com Fri Nov 5 12:22:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 05 Nov 2004 15:22:35 -0500 Subject: [openib-general] [ANNOUNCE] Opening of gen2 trunk In-Reply-To: <52ekj8hzkn.fsf@topspin.com> References: <52y8hhmrs7.fsf@topspin.com> <1099681012.2965.22.camel@hpc-1> <52ekj8hzkn.fsf@topspin.com> Message-ID: <1099686155.3278.71.camel@localhost.localdomain> On Fri, 2004-11-05 at 14:21, Roland Dreier wrote: > Hal> 1. Are there changes planned for core/cache.c ? > > I've cleaned it up a little but I'm really not sure exactly what > should be done with it. One more thing would be to rename ts_ib_core.h to something like ib_cache.h. -- Hal From mshefty at ichips.intel.com Fri Nov 5 12:27:00 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 05 Nov 2004 12:27:00 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <000001c4c375$26bbac40$655aa8c0@infiniconsys.com> References: <000001c4c375$26bbac40$655aa8c0@infiniconsys.com> Message-ID: <418BE214.3070302@ichips.intel.com> Fab Tillier wrote: > If the MAD layer did provide queuing to prevent overflow for the requested > (not allocated) depth, then the CQ resize is unnecessary. I would expect > that whatever code manages the QP/CQ should provide queuing so that MAD > agents don't all have to implement queueing with respect to one another. Resizing the CQ is an optimization only. If the resize fails, the MAD layer will simply restrict the number of outstanding sends/receives. The MAD layer will queue sends, but not receives. - Sean From krkumar at us.ibm.com Fri Nov 5 12:56:15 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Fri, 5 Nov 2004 12:56:15 -0800 (PST) Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: <1099685830.3278.65.camel@localhost.localdomain> Message-ID: Hi Hal, I think others have answered it, but my 2 cents : 1. Current driver doesn't do this, it is a spec thing, a potential driver could do this in future. It is useful in the sense that the CQ is always atleast the size of the QP's that are using it in case the driver implements it, as Sean mentioned. 2. The patches that you are referring to are correct. If people agree that it is useful to have, I can regenerate against latest bits (that is if it doesn't apply). thanks, - KK On Fri, 5 Nov 2004, Hal Rosenstock wrote: > On Fri, 2004-11-05 at 14:21, Sean Hefty wrote: > > I think we should go ahead an put this code in. We need to handle the > > case where the QP is sized larger than what we request anyway, to ensure > > that we don't overrun the CQ. > > Does the driver do this (QP is sized larger than what was requested) now > ? Or is this a spec thing ? > > If so, just to make sure I have this straight, what is/are the specific > patch(es) ? Is it the 2 patches from Wednesday entitled "PATCH 1/2 > Resize CQ" and "PATCH 2/2 Implement error handling in resize failure". > > Thanks. > > -- Hal > > > From roland at topspin.com Fri Nov 5 13:30:13 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 13:30:13 -0800 Subject: [openib-general] [PATCH 1/2][RFC] Implement resize of CQ In-Reply-To: (Krishna Kumar's message of "Fri, 5 Nov 2004 12:56:15 -0800 (PST)") References: Message-ID: <521xf8htlm.fsf@topspin.com> I guess my bottom line is that these patches add complexity and can't be tested at the moment, so my inclination would be to leave them out. - R. From krkumar at us.ibm.com Fri Nov 5 13:22:46 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Fri, 5 Nov 2004 13:22:46 -0800 (PST) Subject: [openib-general] [PATCH] Extra kfrees, clean up unregisters, etc ... Message-ID: 1. Don't kfree sa_dev twice. 2. Unnecessary kref_put : In failure case, we don't seem to have a reference until update_sm_ah is called in success case. 3. Clean up code which looks like a hack (i++ in failure). 4. Too many "i -s", no need to keep recalculating this :-) 5. Potential extra cleanup : I could have set index = 0 instead of index = i - s, I kept it this way to be quite identical to existing code. Patch applies with -p1 on trunk directory, on top of my previous patch. Thanks, - KK diff -ruNp trunk/src/linux-kernel/infiniband/core/sa_query.c.org trunk/src/linux-kernel/infiniband/core/sa_query.c --- trunk/src/linux-kernel/infiniband/core/sa_query.c.org 2004-11-05 11:51:06.000000000 -0800 +++ trunk/src/linux-kernel/infiniband/core/sa_query.c 2004-11-05 13:10:50.000000000 -0800 @@ -682,6 +682,7 @@ static void ib_sa_add_one(struct ib_devi { struct ib_sa_device *sa_dev; int s, e, i; + int index; if (device->node_type == IB_NODE_SWITCH) s = e = 0; @@ -703,29 +704,29 @@ static void ib_sa_add_one(struct ib_devi sa_dev->start_port = s; sa_dev->end_port = e; - for (i = s; i <= e; ++i) { - sa_dev->port[i - s].mr = NULL; - sa_dev->port[i - s].sm_ah = NULL; - sa_dev->port[i - s].port_num = i; - spin_lock_init(&sa_dev->port[i - s].ah_lock); + for (i = s, index = i - s; i <= e; ++i, ++index) { + sa_dev->port[index].mr = NULL; + sa_dev->port[index].sm_ah = NULL; + sa_dev->port[index].port_num = i; + spin_lock_init(&sa_dev->port[index].ah_lock); - sa_dev->port[i - s].agent = + sa_dev->port[index].agent = ib_register_mad_agent(device, i, IB_QPT_GSI, NULL, 0, send_handler, recv_handler, sa_dev); - if (IS_ERR(sa_dev->port[i - s].agent)) + if (IS_ERR(sa_dev->port[index].agent)) goto err; - sa_dev->port[i - s].mr = ib_get_dma_mr(sa_dev->port[i - s].agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(sa_dev->port[i - s].mr)) { - /* Bump i so agent from this iter. is freed */ - ++i; + sa_dev->port[index].mr = + ib_get_dma_mr(sa_dev->port[index].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[index].mr)) { + ib_unregister_mad_agent(sa_dev->port[index].agent); goto err; } - INIT_WORK(&sa_dev->port[i - s].update_task, - update_sm_ah, &sa_dev->port[i - s]); + INIT_WORK(&sa_dev->port[index].update_task, + update_sm_ah, &sa_dev->port[index]); } /* @@ -736,27 +737,20 @@ static void ib_sa_add_one(struct ib_devi */ INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); - if (ib_register_event_handler(&sa_dev->event_handler)) { - kfree(sa_dev); + if (ib_register_event_handler(&sa_dev->event_handler)) goto err; - } - for (i = s; i <= e; ++i) - update_sm_ah(&sa_dev->port[i - s]); + while (--index >= 0) + update_sm_ah(&sa_dev->port[index]); ib_set_client_data(device, &sa_client, sa_dev); return; err: - while (--i >= s) { - if (sa_dev->port[i - s].mr && !IS_ERR(sa_dev->port[i - s].mr)) - ib_dereg_mr(sa_dev->port[i - s].mr); - - if (sa_dev->port[i - s].sm_ah) - kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); - - ib_unregister_mad_agent(sa_dev->port[i - s].agent); + while (--index >= 0) { + ib_dereg_mr(sa_dev->port[index].mr); + ib_unregister_mad_agent(sa_dev->port[index].agent); } kfree(sa_dev); From krkumar at us.ibm.com Fri Nov 5 13:48:56 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Fri, 5 Nov 2004 13:48:56 -0800 (PST) Subject: [openib-general] [PATCH] mad doesn't get freed up after send_mad is called Message-ID: Applies on top of my previous patch... diff -ruNp trunk/src/linux-kernel/infiniband/core/sa_query.c.org trunk/src/linux-kernel/infiniband/core/sa_query.c --- trunk/src/linux-kernel/infiniband/core/sa_query.c.org 2004-11-05 13:13:12.000000000 -0800 +++ trunk/src/linux-kernel/infiniband/core/sa_query.c 2004-11-05 13:43:10.000000000 -0800 @@ -550,6 +550,7 @@ int ib_sa_path_rec_get(struct ib_device ret = query->sa_query.id; *sa_query = &query->sa_query; } + kfree(query->sa_query.mad); kfree(query); return ret; } @@ -624,6 +625,7 @@ int ib_sa_mcmember_rec_query(struct ib_d ret = query->sa_query.id; *sa_query = &query->sa_query; } + kfree(query->sa_query.mad); kfree(query); return ret; From krkumar at us.ibm.com Fri Nov 5 14:01:22 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Fri, 5 Nov 2004 14:01:22 -0800 (PST) Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c Message-ID: diff -ruNp trunk/src/linux-kernel/infiniband/core/sa_query.c.org trunk/src/linux-kernel/infiniband/core/sa_query.c --- trunk/src/linux-kernel/infiniband/core/sa_query.c.org 2004-11-05 13:43:10.000000000 -0800 +++ trunk/src/linux-kernel/infiniband/core/sa_query.c 2004-11-05 13:58:47.000000000 -0800 @@ -387,18 +387,21 @@ static void ib_sa_event(struct ib_event_ } } -void ib_sa_cancel_query(int id, struct ib_sa_query *query) +static inline struct ib_sa_query *ib_sa_find_idr(int id) { - unsigned long flags; + struct ib_sa_query *query + unsigned long flags; spin_lock_irqsave(&idr_lock, flags); - if (idr_find(&query_idr, query->id) != query) { - spin_unlock_irqrestore(&idr_lock, flags); - return; - } + query = idr_find(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); + return query; +} - ib_cancel_mad(query->port->agent, query->id); +void ib_sa_cancel_query(int id, struct ib_sa_query *query) +{ + if (ib_sa_find_idr(id) == query) + ib_cancel_mad(query->port->agent, query->id); } EXPORT_SYMBOL(ib_sa_cancel_query); @@ -638,10 +641,7 @@ static void send_handler(struct ib_mad_a struct ib_sa_query *query; unsigned long flags; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_send_wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); - + query = ib_sa_find_idr(mad_send_wc->wr_id); if (!query) return; @@ -661,12 +661,8 @@ static void recv_handler(struct ib_mad_a struct ib_mad_recv_wc *mad_recv_wc) { struct ib_sa_query *query; - unsigned long flags; - - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); + query = ib_sa_find_idr(mad_recv_wc->wc->wr_id); if (query) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, From mshefty at ichips.intel.com Fri Nov 5 15:14:09 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 05 Nov 2004 15:14:09 -0800 Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in ib_post_send_mad In-Reply-To: <1099680579.2965.7.camel@hpc-1> References: <1099680579.2965.7.camel@hpc-1> Message-ID: <418C0941.2080904@ichips.intel.com> Hal Rosenstock wrote: > mad: Handle outgoing SMPs in ib_post_send_mad > The MAD layer is now ready to support the SM :-) > > I have not yet handled the additional special cases: hop count increment > done by device, use send queue rather than process MAD for 0 hop SMPs). Hal, can you check that your code stays within 80 characters per line? - Sean From roland at topspin.com Fri Nov 5 18:44:36 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 18:44:36 -0800 Subject: [openib-general] [PATCH] Extra kfrees, clean up unregisters, etc ... In-Reply-To: (Krishna Kumar's message of "Fri, 5 Nov 2004 13:22:46 -0800 (PST)") References: Message-ID: <52fz3nhf1n.fsf@topspin.com> Thanks for the audit. I applied this version of your patch. Does this still look correct? Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1164) +++ infiniband/core/sa_query.c (working copy) @@ -699,29 +710,28 @@ sa_dev->start_port = s; sa_dev->end_port = e; - for (i = s; i <= e; ++i) { - sa_dev->port[i - s].mr = NULL; - sa_dev->port[i - s].sm_ah = NULL; - sa_dev->port[i - s].port_num = i; - spin_lock_init(&sa_dev->port[i - s].ah_lock); + for (i = 0; i <= e - s; ++i) { + sa_dev->port[i].mr = NULL; + sa_dev->port[i].sm_ah = NULL; + sa_dev->port[i].port_num = i + s; + spin_lock_init(&sa_dev->port[i].ah_lock); - sa_dev->port[i - s].agent = - ib_register_mad_agent(device, i, IB_QPT_GSI, + sa_dev->port[i].agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, NULL, 0, send_handler, recv_handler, sa_dev); - if (IS_ERR(sa_dev->port[i - s].agent)) + if (IS_ERR(sa_dev->port[i].agent)) goto err; - sa_dev->port[i - s].mr = ib_get_dma_mr(sa_dev->port[i - s].agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(sa_dev->port[i - s].mr)) { - /* Bump i so agent from this iter. is freed */ - ++i; + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[i].mr)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); goto err; } - INIT_WORK(&sa_dev->port[i - s].update_task, - update_sm_ah, &sa_dev->port[i - s]); + INIT_WORK(&sa_dev->port[i].update_task, + update_sm_ah, &sa_dev->port[i]); } /* @@ -732,27 +742,20 @@ */ INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); - if (ib_register_event_handler(&sa_dev->event_handler)) { - kfree(sa_dev); + if (ib_register_event_handler(&sa_dev->event_handler)) goto err; - } - for (i = s; i <= e; ++i) - update_sm_ah(&sa_dev->port[i - s]); + for (i = 0; i <= e - s; ++i) + update_sm_ah(&sa_dev->port[i]); ib_set_client_data(device, &sa_client, sa_dev); return; err: - while (--i >= s) { - if (sa_dev->port[i - s].mr && !IS_ERR(sa_dev->port[i - s].mr)) - ib_dereg_mr(sa_dev->port[i - s].mr); - - if (sa_dev->port[i - s].sm_ah) - kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); - - ib_unregister_mad_agent(sa_dev->port[i - s].agent); + while (--i >= 0) { + ib_dereg_mr(sa_dev->port[i].mr); + ib_unregister_mad_agent(sa_dev->port[i].agent); } kfree(sa_dev); From roland at topspin.com Fri Nov 5 19:09:18 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 19:09:18 -0800 Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query. In-Reply-To: (Krishna Kumar's message of "Fri, 5 Nov 2004 11:57:53 -0800 (PST)") References: Message-ID: <52brebhdwh.fsf@topspin.com> Sorry, this and the follow-up patch are wrong. The if the send succeeds then we can't free the query structure until the query finishes up. (The query will be freed in the appropriate ->release method in this case). You are right that there is a memory leak though. I fixed it like this: Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1166) +++ infiniband/core/sa_query.c (working copy) @@ -500,6 +500,7 @@ static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) { + kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); } @@ -544,11 +545,12 @@ rec, query->sa_query.mad->data); ret = send_mad(&query->sa_query, timeout_ms); - if (ret) + if (ret) { + kfree(query->sa_query.mad); kfree(query); + } else + *sa_query = &query->sa_query; - *sa_query = &query->sa_query; - return ret ? ret : query->sa_query.id; } EXPORT_SYMBOL(ib_sa_path_rec_get); @@ -572,6 +574,7 @@ static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) { + kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } @@ -617,11 +620,12 @@ rec, query->sa_query.mad->data); ret = send_mad(&query->sa_query, timeout_ms); - if (ret) + if (ret) { + kfree(query->sa_query.mad); kfree(query); + } else + *sa_query = &query->sa_query; - *sa_query = &query->sa_query; - return ret ? ret : query->sa_query.id; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); From roland at topspin.com Fri Nov 5 19:10:11 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 19:10:11 -0800 Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c In-Reply-To: (Krishna Kumar's message of "Fri, 5 Nov 2004 14:01:22 -0800 (PST)") References: Message-ID: <527jozhdv0.fsf@topspin.com> Thanks but I'm not going to apply this. I prefer to have the locking and the idr lookup be explicit (and it's only done in two places so the cleanup is pretty minimal). Thanks, Roland From roland at topspin.com Fri Nov 5 19:10:55 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 19:10:55 -0800 Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in ib_post_send_mad In-Reply-To: <418C0941.2080904@ichips.intel.com> (Sean Hefty's message of "Fri, 05 Nov 2004 15:14:09 -0800") References: <1099680579.2965.7.camel@hpc-1> <418C0941.2080904@ichips.intel.com> Message-ID: <523bznhdts.fsf@topspin.com> Sean> Hal, can you check that your code stays within 80 characters Sean> per line? The 80 character limit is really just a guideline. It's not worth going through contortions to fix an 85-character line. - Roland From mshefty at ichips.intel.com Fri Nov 5 19:56:27 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 05 Nov 2004 19:56:27 -0800 Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in ib_post_send_mad In-Reply-To: <523bznhdts.fsf@topspin.com> References: <1099680579.2965.7.camel@hpc-1> <418C0941.2080904@ichips.intel.com> <523bznhdts.fsf@topspin.com> Message-ID: <418C4B6B.1030108@ichips.intel.com> Roland Dreier wrote: > Sean> Hal, can you check that your code stays within 80 characters > Sean> per line? > > The 80 character limit is really just a guideline. It's not worth > going through contortions to fix an 85-character line. Okay. I was just going by the coding style documentation that mentioned that this was a "hard limit". If it's not that big of a deal, then I'll only worry about excessively long lines. - Sean From roland at topspin.com Fri Nov 5 21:23:13 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 21:23:13 -0800 Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in ib_post_send_mad In-Reply-To: <418C4B6B.1030108@ichips.intel.com> (Sean Hefty's message of "Fri, 05 Nov 2004 19:56:27 -0800") References: <1099680579.2965.7.camel@hpc-1> <418C0941.2080904@ichips.intel.com> <523bznhdts.fsf@topspin.com> <418C4B6B.1030108@ichips.intel.com> Message-ID: <52y8hfft4u.fsf@topspin.com> Sean> Okay. I was just going by the coding style documentation Sean> that mentioned that this was a "hard limit". If it's not Sean> that big of a deal, then I'll only worry about excessively Sean> long lines. Yeah, if you read through the kernel source, you can find tons and tons of lines somewhat longer than 80 characters. In fact just now I was noticing gems like the 125-character line struct class_device *class_simple_device_add(struct class_simple *cs, dev_t dev, struct device *device, const char *fmt, ...) in drivers/base/class_simple.c... maybe a better example of the right way to do things is a line like if (tp->link_config.advertising & ADVERTISED_1000baseT_Half) from drivers/net/tg3.c, which ends in column 83 but look fine. - R. From roland at topspin.com Fri Nov 5 21:39:38 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 05 Nov 2004 21:39:38 -0800 Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct Message-ID: <52u0s3fsdh.fsf@topspin.com> It seems that there are lots of places where consumers need to allocate an entire ib_device_attr struct and deal with the possibility that ib_query_device() might fail, just to find out how many ports a device has. We discussed this before and concluded that it was OK to assume that the number of physical ports is constant. This patch simplifies a lot of code by making phys_port_cnt a field in struct ib_device, so consumers can just read the value when they need it. Does this look good to commit? Thanks, Roland Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1167) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -821,19 +821,12 @@ static void ipoib_add_one(struct ib_device *device) { - struct ib_device_attr props; int port; - if (ib_query_device(device, &props)) { - printk(KERN_WARNING "%s: ib_device_properties_get failed\n", - device->name); - return; - } - if (device->node_type == IB_NODE_SWITCH) ipoib_add_port("ib%d", device, 0); else - for (port = 1; port <= props.phys_port_cnt; ++port) + for (port = 1; port <= device->phys_port_cnt; ++port) ipoib_add_port("ib%d", device, port); } Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 1167) +++ infiniband/include/ib_verbs.h (working copy) @@ -91,7 +91,6 @@ int max_cqe; int max_mr; int max_pd; - int phys_port_cnt; int max_qp_rd_atom; int max_ee_rd_atom; int max_res_rd_atom; @@ -794,6 +793,7 @@ } reg_state; u8 node_type; + u8 phys_port_cnt; }; struct ib_client { Index: infiniband/core/device.c =================================================================== --- infiniband/core/device.c (revision 1167) +++ infiniband/core/device.c (working copy) @@ -191,7 +191,6 @@ int ib_register_device(struct ib_device *device) { struct ib_device_private *priv; - struct ib_device_attr prop; int ret; down(&device_sem); @@ -217,18 +216,11 @@ *priv = (struct ib_device_private) { 0 }; - ret = device->query_device(device, &prop); - if (ret) { - printk(KERN_WARNING "query_device failed for %s\n", - device->name); - goto out_free; - } - if (device->node_type == IB_NODE_SWITCH) { priv->start_port = priv->end_port = 0; } else { priv->start_port = 1; - priv->end_port = prop.phys_port_cnt; + priv->end_port = device->phys_port_cnt; } priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data), @@ -236,6 +228,7 @@ if (!priv->port_data) { printk(KERN_WARNING "Couldn't allocate port info for %s\n", device->name); + ret = -ENOMEM; goto out_free; } @@ -253,7 +246,8 @@ goto out_free_port; } - if (ib_device_register_sysfs(device)) { + ret = ib_device_register_sysfs(device); + if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", device->name); goto out_free_cache; Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1167) +++ infiniband/core/user_mad.c (working copy) @@ -489,12 +489,8 @@ if (device->node_type == IB_NODE_SWITCH) s = e = 0; else { - struct ib_device_attr attr; - if (ib_query_device(device, &attr)) - return; - s = 1; - e = attr.phys_port_cnt; + e = device->phys_port_cnt; } umad_dev = kmalloc(sizeof *umad_dev + Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1167) +++ infiniband/core/mad.c (working copy) @@ -244,7 +244,6 @@ mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; - mad_agent_priv->phys_port_cnt = port_priv->phys_port_cnt; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; @@ -418,7 +417,7 @@ if (!smi_handle_dr_smp_recv((struct ib_smp *)&mad_priv->mad, mad_agent->device->node_type, mad_agent->port_num, - mad_agent_priv->phys_port_cnt)) { + mad_agent->device->phys_port_cnt)) { ret = -EINVAL; kmem_cache_free(ib_mad_cache, mad_priv); goto error1; @@ -1085,7 +1084,7 @@ if (!smi_handle_dr_smp_recv(smp, port_priv->device->node_type, port_priv->port_num, - port_priv->phys_port_cnt)) + port_priv->device->phys_port_cnt)) goto out; if (!smi_check_forward_dr_smp(smp)) goto out; @@ -1125,7 +1124,7 @@ (struct ib_smp *)response, port_priv->device->node_type, port_priv->port_num, - port_priv->phys_port_cnt)) { + port_priv->device->phys_port_cnt)) { kfree(response); goto out; } @@ -1842,8 +1841,7 @@ * Create the QP, PD, MR, and CQ if needed */ static int ib_mad_port_open(struct ib_device *device, - int port_num, - int num_ports) + int port_num) { int ret, cq_size; u64 iova = 0; @@ -1872,7 +1870,6 @@ memset(port_priv, 0, sizeof *port_priv); port_priv->device = device; port_priv->port_num = port_num; - port_priv->phys_port_cnt = num_ports; spin_lock_init(&port_priv->reg_lock); cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; @@ -1985,29 +1982,22 @@ static void ib_mad_init_device(struct ib_device *device) { int ret, num_ports, cur_port, i, ret2; - struct ib_device_attr device_attr; - ret = ib_query_device(device, &device_attr); - if (ret) { - printk(KERN_ERR PFX "Couldn't query device %s\n", device->name); - goto error_device_query; - } - if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { - num_ports = device_attr.phys_port_cnt; + num_ports = device->phys_port_cnt; cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - ret = ib_mad_port_open(device, cur_port, num_ports); + ret = ib_mad_port_open(device, cur_port); if (ret) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, cur_port); goto error_device_open; } - ret = ib_agent_port_open(device, cur_port, num_ports); + ret = ib_agent_port_open(device, cur_port); if (ret) { printk(KERN_ERR PFX "Couldn't open %s port %d for agents\n", device->name, cur_port); @@ -2039,20 +2029,13 @@ static void ib_mad_remove_device(struct ib_device *device) { - int ret, i, num_ports, cur_port, ret2; - struct ib_device_attr device_attr; + int ret = 0, i, num_ports, cur_port, ret2; - ret = ib_query_device(device, &device_attr); - if (ret) { - printk(KERN_ERR PFX "Couldn't query device %s\n", device->name); - goto error_device_query; - } - if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { - num_ports = device_attr.phys_port_cnt; + num_ports = device->phys_port_cnt; cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { @@ -2071,9 +2054,6 @@ ret = ret2; } } - -error_device_query: - return; } static struct ib_client mad_client = { Index: infiniband/core/agent.h =================================================================== --- infiniband/core/agent.h (revision 1167) +++ infiniband/core/agent.h (working copy) @@ -27,8 +27,7 @@ #define __AGENT_H_ extern int ib_agent_port_open(struct ib_device *device, - int port_num, - int phys_port_cnt); + int port_num); extern int ib_agent_port_close(struct ib_device *device, int port_num); Index: infiniband/core/sysfs.c =================================================================== --- infiniband/core/sysfs.c (revision 1167) +++ infiniband/core/sysfs.c (working copy) @@ -640,14 +640,9 @@ if (ret) goto err_put; } else { - struct ib_device_attr attr; int i; - ret = ib_query_device(device, &attr); - if (ret) - goto err_put; - - for (i = 1; i <= attr.phys_port_cnt; ++i) { + for (i = 1; i <= device->phys_port_cnt; ++i) { ret = add_port(device, i); if (ret) goto err_put; Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1167) +++ infiniband/core/sa_query.c (working copy) @@ -696,12 +696,8 @@ if (device->node_type == IB_NODE_SWITCH) s = e = 0; else { - struct ib_device_attr attr; - if (ib_query_device(device, &attr)) - return; - s = 1; - e = attr.phys_port_cnt; + e = device->phys_port_cnt; } sa_dev = kmalloc(sizeof *sa_dev + Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 1167) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -47,7 +47,6 @@ if (!in_mad || !out_mad) goto out; - props->phys_port_cnt = to_mdev(ibdev)->limits.num_ports; props->fw_ver = to_mdev(ibdev)->fw_ver; memset(in_mad, 0, sizeof *in_mad); @@ -573,6 +572,7 @@ strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = dev->pdev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; dev->ib_dev.query_device = mthca_query_device; From halr at voltaire.com Fri Nov 5 21:51:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 00:51:35 -0500 Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct In-Reply-To: <52u0s3fsdh.fsf@topspin.com> References: <52u0s3fsdh.fsf@topspin.com> Message-ID: <1099720295.3278.80.camel@localhost.localdomain> On Sat, 2004-11-06 at 00:39, Roland Dreier wrote: > It seems that there are lots of places where consumers need to > allocate an entire ib_device_attr struct and deal with the possibility > that ib_query_device() might fail, just to find out how many ports a > device has. Yes, I noticed this when I added yet another case of this earlier today/yesterday. You beat me to this :-) > We discussed this before and concluded that it was OK to > assume that the number of physical ports is constant. Agreed. > This patch simplifies a lot of code by making phys_port_cnt a field in > struct ib_device, so consumers can just read the value when they need it. > > Does this look good to commit? Looks good. Do you want me to try it (tomorrow/today depending on your time zone before committing it) ? -- Hal From halr at voltaire.com Fri Nov 5 22:04:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 01:04:50 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] Remove unused variable from ipoib_mutlicast.c Message-ID: <1099721089.14986.1.camel@hpc-1> Remove unused variable from ipoib_mutlicast.c Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 1167) +++ ipoib_multicast.c (working copy) @@ -259,7 +259,6 @@ { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) ipoib_mcast_join_finish(mcast, mcmember); From halr at voltaire.com Sat Nov 6 04:00:15 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 07:00:15 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] Remove unused variable (not debug) from ipoib_mutlicast.c Message-ID: <1099742414.17534.1.camel@hpc-1> Remove unused variable (not debug) from ipoib_mutlicast.c (This supercedes the previous version) Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 1167) +++ ipoib_multicast.c (working copy) @@ -259,7 +259,9 @@ { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct ipoib_dev_priv *priv = netdev_priv(dev); +#endif if (!status) ipoib_mcast_join_finish(mcast, mcmember); From roland at topspin.com Sat Nov 6 08:43:40 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 06 Nov 2004 08:43:40 -0800 Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct In-Reply-To: <1099720295.3278.80.camel@localhost.localdomain> (Hal Rosenstock's message of "Sat, 06 Nov 2004 00:51:35 -0500") References: <52u0s3fsdh.fsf@topspin.com> <1099720295.3278.80.camel@localhost.localdomain> Message-ID: <52lldfexmr.fsf@topspin.com> Hal> Looks good. Do you want me to try it (tomorrow/today Hal> depending on your time zone before committing it) ? Sure, I'm happy to wait. - Roland From roland at topspin.com Sat Nov 6 08:44:56 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 06 Nov 2004 08:44:56 -0800 Subject: [openib-general] [PATCH] [TRIVIAL] Remove unused variable (not debug) from ipoib_mutlicast.c In-Reply-To: <1099742414.17534.1.camel@hpc-1> (Hal Rosenstock's message of "Sat, 06 Nov 2004 07:00:15 -0500") References: <1099742414.17534.1.camel@hpc-1> Message-ID: <52hdo3exkn.fsf@topspin.com> Hal> Remove unused variable (not debug) from ipoib_mutlicast.c Thanks for pointing this out. I fixed it like this rather than adding another #ifdef... Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 1167) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -259,14 +259,13 @@ { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) ipoib_mcast_join_finish(mcast, mcmember); else { if (mcast->logcount++ < 20) - ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT - ", status %d\n", + ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1167) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -228,7 +228,7 @@ #define ipoib_printk(level, priv, format, arg...) \ - printk(level "%s: " format, (priv)->dev->name , ## arg) + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1167) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -461,12 +461,8 @@ /*..ipoib_ib_dev_cleanup -- clean up IB resources for iface */ void ipoib_ib_dev_cleanup(struct net_device *dev) { - struct ipoib_dev_priv *priv = netdev_priv(dev); + ipoib_dbg(netdev_priv(dev), "cleaning up ib_dev\n"); - /* Avoid unused warning if DEBUG is off */ - (void) priv; - ipoib_dbg(priv, "cleaning up ib_dev\n"); - ipoib_mcast_stop_thread(dev); /* Delete the broadcast address and the local address */ From halr at voltaire.com Sat Nov 6 10:14:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 13:14:29 -0500 Subject: [openib-general] IPoIB Multicast Message-ID: <1099764868.20222.8.camel@hpc-1> Hi Roland, The IB multicast support appears to be much better :-) I have not been able to recreate (at least as yet) any of the ifdown or modprobe -r issues I used to see. I will keep an eye on this and report back if this changes. I have found two minor anomalies/questions which do not cause any operational issues: 1. If you down the interface and bring it back up, the second time up, there are 2 identical join requests for the broadcast group rather than just 1. These 2 come out very close to one another (217 usec apart). Is there some counting issue that is causing this ? 2. When leaving an IP multicast group, there appears to be an extra join to 0x16 (something like 224.0.0.22 which would be for IGMP). Any ideas on this ? Thanks. -- Hal From halr at voltaire.com Sat Nov 6 10:34:57 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 13:34:57 -0500 Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct In-Reply-To: <52lldfexmr.fsf@topspin.com> References: <52u0s3fsdh.fsf@topspin.com> <1099720295.3278.80.camel@localhost.localdomain> <52lldfexmr.fsf@topspin.com> Message-ID: <1099766097.20222.23.camel@hpc-1> On Sat, 2004-11-06 at 11:43, Roland Dreier wrote: > Hal> Looks good. Do you want me to try it (tomorrow/today > Hal> depending on your time zone before committing it) ? > > Sure, I'm happy to wait. Works for me. -- Hal From halr at voltaire.com Sat Nov 6 10:42:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 13:42:20 -0500 Subject: [openib-general] Latest IPoIB Bringup Questions In-Reply-To: <52654u1vwg.fsf@topspin.com> References: <1098985903.17991.74.camel@hpc-1> <52654u1vwg.fsf@topspin.com> Message-ID: <1099766540.20222.29.camel@hpc-1> On Thu, 2004-10-28 at 15:32, Roland Dreier wrote: > Probably better to work on ip, since ifconfig has other issues (such > as using an ioctl limited to 14 bytes to get the HW addr) I presume this is the same issue for arp (e.g. arp -a). So how do we go about getting this increased in the 2.6 kernel ? Is 20 bytes sufficient ? Should this be part of our 2.6 diffs as well ? -- Hal From halr at voltaire.com Sat Nov 6 10:49:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 13:49:17 -0500 Subject: [Fwd: [openib-general] ifconfig ib0 down and then up vis a vis IP connectivity] Message-ID: <1099766957.20222.34.camel@hpc-1> I have confirmed that this is an ARP cache issue on the remote machine. The remote node is responding to the old DQPN of the machine whose IPoIB interface was downed and then brought up. When it is brought up again, it has a different QPN. The remote node still has the old QPN cached until it times out and sends back to the old QPN which is discarded on the local node. It behaves just like a hardware address change for an IP address for which a remote node had communication with previously (and cached it's MAC address). -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] ifconfig ib0 down and then up vis a vis IP connectivity Date: 02 Nov 2004 14:45:03 -0500 Hi, What is the ARP timeout in Linux ? If I down and then up the ib0 interface, there is some delay before connectivity is restored despite the fact that it is successfully (re)attached to the multicast groups and that all the QPNs seem to be the same. After some time period, connectivity is restored. Any idea on what is different ? It seems like it is an ARP cache issue. Thanks. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Sat Nov 6 11:26:30 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 06 Nov 2004 11:26:30 -0800 Subject: [openib-general] Latest IPoIB Bringup Questions In-Reply-To: <1099766540.20222.29.camel@hpc-1> (Hal Rosenstock's message of "Sat, 06 Nov 2004 13:42:20 -0500") References: <1098985903.17991.74.camel@hpc-1> <52654u1vwg.fsf@topspin.com> <1099766540.20222.29.camel@hpc-1> Message-ID: <52d5yqg4nt.fsf@topspin.com> Hal> So how do we go about getting this increased in the 2.6 Hal> kernel ? Is 20 bytes sufficient ? Should this be part of our Hal> 2.6 diffs as well ? The kernel has no problem (I had MAX_ADDR_LEN increased to 32 about 2 years ago). Just use the ip tool instead of ifconfig and arp (eg "ip neigh" or "ip addr"). - R. From roland at topspin.com Sat Nov 6 11:27:53 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 06 Nov 2004 11:27:53 -0800 Subject: [openib-general] IPoIB Multicast In-Reply-To: <1099764868.20222.8.camel@hpc-1> (Hal Rosenstock's message of "Sat, 06 Nov 2004 13:14:29 -0500") References: <1099764868.20222.8.camel@hpc-1> Message-ID: <528y9eg4li.fsf@topspin.com> Hal> 1. If you down the interface and bring it back up, the second Hal> time up, there are 2 identical join requests for the Hal> broadcast group rather than just 1. These 2 come out very Hal> close to one another (217 usec apart). Is there some counting Hal> issue that is causing this ? Hal> 2. When leaving an IP multicast group, there appears to be an Hal> extra join to 0x16 (something like 224.0.0.22 which would be Hal> for IGMP). Any ideas on this ? If you or someone else doesn't debug these issues first, I'll take a look at the code. - R. From halr at voltaire.com Sat Nov 6 11:48:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 06 Nov 2004 14:48:38 -0500 Subject: [openib-general] IPoIB Multicast In-Reply-To: <528y9eg4li.fsf@topspin.com> References: <1099764868.20222.8.camel@hpc-1> <528y9eg4li.fsf@topspin.com> Message-ID: <1099770518.3281.3.camel@localhost.localdomain> On Sat, 2004-11-06 at 14:27, Roland Dreier wrote: > Hal> 1. If you down the interface and bring it back up, the second > Hal> time up, there are 2 identical join requests for the > Hal> broadcast group rather than just 1. These 2 come out very > Hal> close to one another (217 usec apart). Is there some counting > Hal> issue that is causing this ? > > Hal> 2. When leaving an IP multicast group, there appears to be an > Hal> extra join to 0x16 (something like 224.0.0.22 which would be > Hal> for IGMP). Any ideas on this ? > > If you or someone else doesn't debug these issues first, I'll take a > look at the code. I'll take a first crack and look at the code to see what I can determine. On the second issue, I partially understand what is going on: IPmc group changes need to be reported via IGMP so the IPmc router knows to prune the multicast tree, but... I don't understand why it joins here (and not earlier when a IPmc group is first joined by this node and second after the join is successful, I do not see any IGMP packet come out of the node (onto IB; maybe it is going out the ethernet instead). -- Hal From mshefty at ichips.intel.com Mon Nov 8 08:48:43 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Nov 2004 08:48:43 -0800 Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct In-Reply-To: <52u0s3fsdh.fsf@topspin.com> References: <52u0s3fsdh.fsf@topspin.com> Message-ID: <418FA36B.8010402@ichips.intel.com> Roland Dreier wrote: > This patch simplifies a lot of code by making phys_port_cnt a field in > struct ib_device, so consumers can just read the value when they need it. > > Does this look good to commit? Looks good to me. - Sean From iod00d at hp.com Mon Nov 8 08:55:55 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 8 Nov 2004 08:55:55 -0800 Subject: [openib-general] [PATCH] mad: Handle outgoing SMPs in ib_post_send_mad In-Reply-To: <418C4B6B.1030108@ichips.intel.com> References: <1099680579.2965.7.camel@hpc-1> <418C0941.2080904@ichips.intel.com> <523bznhdts.fsf@topspin.com> <418C4B6B.1030108@ichips.intel.com> Message-ID: <20041108165555.GD14706@cup.hp.com> On Fri, Nov 05, 2004 at 07:56:27PM -0800, Sean Hefty wrote: > Roland Dreier wrote: > >The 80 character limit is really just a guideline. It's not worth > >going through contortions to fix an 85-character line. > > Okay. I was just going by the coding style documentation that mentioned > that this was a "hard limit". If it's not that big of a deal, then I'll > only worry about excessively long lines. I don't take it as a hard limit either. But I rarely write code that exceeds 80 columns. And I expect someone will complain when gen2 is submitted to LKML if more than a few lines are longer than 80 columns. grant From roland at topspin.com Mon Nov 8 09:10:31 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 09:10:31 -0800 Subject: [openib-general] [PATCH][RFC] Put phys_port_cnt in device struct In-Reply-To: <418FA36B.8010402@ichips.intel.com> (Sean Hefty's message of "Mon, 08 Nov 2004 08:48:43 -0800") References: <52u0s3fsdh.fsf@topspin.com> <418FA36B.8010402@ichips.intel.com> Message-ID: <523bzke06w.fsf@topspin.com> Cool, I've committed this. - R. From krkumar at us.ibm.com Mon Nov 8 10:45:08 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 8 Nov 2004 10:45:08 -0800 (PST) Subject: [openib-general] [PATCH] Extra kfrees, clean up unregisters, etc ... In-Reply-To: <52fz3nhf1n.fsf@topspin.com> Message-ID: Yes, looks good. - KK On Fri, 5 Nov 2004, Roland Dreier wrote: > Thanks for the audit. I applied this version of your patch. Does > this still look correct? > > Index: infiniband/core/sa_query.c > =================================================================== > --- infiniband/core/sa_query.c (revision 1164) > +++ infiniband/core/sa_query.c (working copy) > @@ -699,29 +710,28 @@ > sa_dev->start_port = s; > sa_dev->end_port = e; > > - for (i = s; i <= e; ++i) { > - sa_dev->port[i - s].mr = NULL; > - sa_dev->port[i - s].sm_ah = NULL; > - sa_dev->port[i - s].port_num = i; > - spin_lock_init(&sa_dev->port[i - s].ah_lock); > + for (i = 0; i <= e - s; ++i) { > + sa_dev->port[i].mr = NULL; > + sa_dev->port[i].sm_ah = NULL; > + sa_dev->port[i].port_num = i + s; > + spin_lock_init(&sa_dev->port[i].ah_lock); > > - sa_dev->port[i - s].agent = > - ib_register_mad_agent(device, i, IB_QPT_GSI, > + sa_dev->port[i].agent = > + ib_register_mad_agent(device, i + s, IB_QPT_GSI, > NULL, 0, send_handler, > recv_handler, sa_dev); > - if (IS_ERR(sa_dev->port[i - s].agent)) > + if (IS_ERR(sa_dev->port[i].agent)) > goto err; > > - sa_dev->port[i - s].mr = ib_get_dma_mr(sa_dev->port[i - s].agent->qp->pd, > - IB_ACCESS_LOCAL_WRITE); > - if (IS_ERR(sa_dev->port[i - s].mr)) { > - /* Bump i so agent from this iter. is freed */ > - ++i; > + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, > + IB_ACCESS_LOCAL_WRITE); > + if (IS_ERR(sa_dev->port[i].mr)) { > + ib_unregister_mad_agent(sa_dev->port[i].agent); > goto err; > } > > - INIT_WORK(&sa_dev->port[i - s].update_task, > - update_sm_ah, &sa_dev->port[i - s]); > + INIT_WORK(&sa_dev->port[i].update_task, > + update_sm_ah, &sa_dev->port[i]); > } > > /* > @@ -732,27 +742,20 @@ > */ > > INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); > - if (ib_register_event_handler(&sa_dev->event_handler)) { > - kfree(sa_dev); > + if (ib_register_event_handler(&sa_dev->event_handler)) > goto err; > - } > > - for (i = s; i <= e; ++i) > - update_sm_ah(&sa_dev->port[i - s]); > + for (i = 0; i <= e - s; ++i) > + update_sm_ah(&sa_dev->port[i]); > > ib_set_client_data(device, &sa_client, sa_dev); > > return; > > err: > - while (--i >= s) { > - if (sa_dev->port[i - s].mr && !IS_ERR(sa_dev->port[i - s].mr)) > - ib_dereg_mr(sa_dev->port[i - s].mr); > - > - if (sa_dev->port[i - s].sm_ah) > - kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); > - > - ib_unregister_mad_agent(sa_dev->port[i - s].agent); > + while (--i >= 0) { > + ib_dereg_mr(sa_dev->port[i].mr); > + ib_unregister_mad_agent(sa_dev->port[i].agent); > } > > kfree(sa_dev); > > From roland at topspin.com Mon Nov 8 10:55:22 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 10:55:22 -0800 Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr Message-ID: <52654gcgrp.fsf@topspin.com> Convert mad.c and agent.c to use ib_get_dma_mr() instead of ib_reg_phys_mr(). This is simpler and is actually required on platforms such as sparc64 where DMA addresses may not match up with physical RAM addresses. OK to commit? - Roland Index: core/agent.c =================================================================== --- core/agent.c (revision 1172) +++ core/agent.c (working copy) @@ -281,15 +281,9 @@ kfree(agent_send_wr); } -int ib_agent_port_open(struct ib_device *device, int port_num, - int phys_port_cnt) +int ib_agent_port_open(struct ib_device *device, int port_num) { int ret; - u64 iova = 0; - struct ib_phys_buf buf_list = { - .addr = 0, - .size = (unsigned long) high_memory - PAGE_OFFSET - }; struct ib_agent_port_private *port_priv; struct ib_mad_reg_req reg_req; unsigned long flags; @@ -312,7 +306,6 @@ memset(port_priv, 0, sizeof *port_priv); port_priv->port_num = port_num; - port_priv->phys_port_cnt = phys_port_cnt; port_priv->wr_id = 0; spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->send_posted_list); @@ -356,9 +349,8 @@ goto error4; } - port_priv->mr = ib_reg_phys_mr(port_priv->dr_smp_agent->qp->pd, - &buf_list, 1, - IB_ACCESS_LOCAL_WRITE, &iova); + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); if (IS_ERR(port_priv->mr)) { printk(KERN_ERR SPFX "Couldn't register MR\n"); ret = PTR_ERR(port_priv->mr); Index: core/mad.c =================================================================== --- core/mad.c (revision 1172) +++ core/mad.c (working copy) @@ -1844,11 +1844,6 @@ int port_num) { int ret, cq_size; - u64 iova = 0; - struct ib_phys_buf buf_list = { - .addr = 0, - .size = (unsigned long) high_memory - PAGE_OFFSET - }; struct ib_mad_port_private *port_priv; unsigned long flags; @@ -1890,8 +1885,7 @@ goto error4; } - port_priv->mr = ib_reg_phys_mr(port_priv->pd, &buf_list, 1, - IB_ACCESS_LOCAL_WRITE, &iova); + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(port_priv->mr)) { printk(KERN_ERR PFX "Couldn't register ib_mad MR\n"); ret = PTR_ERR(port_priv->mr); From tduffy at sun.com Mon Nov 8 10:57:23 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 08 Nov 2004 10:57:23 -0800 Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr In-Reply-To: <52654gcgrp.fsf@topspin.com> References: <52654gcgrp.fsf@topspin.com> Message-ID: <1099940243.2274.8.camel@duffman> On Mon, 2004-11-08 at 10:55 -0800, Roland Dreier wrote: > Convert mad.c and agent.c to use ib_get_dma_mr() instead of > ib_reg_phys_mr(). This is simpler and is actually required on > platforms such as sparc64 where DMA addresses may not match up with > physical RAM addresses. > > OK to commit? Yes Yes please. -tduffy -- Tom Duffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From krkumar at us.ibm.com Mon Nov 8 10:50:44 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 8 Nov 2004 10:50:44 -0800 (PST) Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c In-Reply-To: <527jozhdv0.fsf@topspin.com> Message-ID: On Fri, 5 Nov 2004, Roland Dreier wrote: > Thanks but I'm not going to apply this. I prefer to have the locking > and the idr lookup be explicit (and it's only done in two places so > the cleanup is pretty minimal). Actually three places ... And IMO, it does make the locking code look cleaner, eg, the original code (with multiple unlocks) : void ib_sa_cancel_query(int id, struct ib_sa_query *query) { unsigned long flags; spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, query->id) != query) { spin_unlock_irqrestore(&idr_lock, flags); return; } spin_unlock_irqrestore(&idr_lock, flags); ib_cancel_mad(query->port->agent, query->id); } now becomes : void ib_sa_cancel_query(int id, struct ib_sa_query *query) { if (ib_sa_find_idr(id) == query) ib_cancel_mad(query->port->agent, query->id); } with the find: static inline struct ib_sa_query *ib_sa_find_idr(int id) { struct ib_sa_query *query unsigned long flags; spin_lock_irqsave(&idr_lock, flags); query = idr_find(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); return query; } thx, - KK From mshefty at ichips.intel.com Mon Nov 8 11:01:42 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Nov 2004 11:01:42 -0800 Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr In-Reply-To: <52654gcgrp.fsf@topspin.com> References: <52654gcgrp.fsf@topspin.com> Message-ID: <418FC296.7070602@ichips.intel.com> Roland Dreier wrote: > Convert mad.c and agent.c to use ib_get_dma_mr() instead of > ib_reg_phys_mr(). This is simpler and is actually required on > platforms such as sparc64 where DMA addresses may not match up with > physical RAM addresses. > > OK to commit? Looks good to me. - Sean From mshefty at ichips.intel.com Mon Nov 8 11:07:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Nov 2004 11:07:54 -0800 Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c In-Reply-To: References: Message-ID: <418FC40A.9080403@ichips.intel.com> Krishna Kumar wrote: > Actually three places ... And IMO, it does make the locking code look > cleaner, eg, the original code (with multiple unlocks) : > > void ib_sa_cancel_query(int id, struct ib_sa_query *query) > { > unsigned long flags; > > spin_lock_irqsave(&idr_lock, flags); > if (idr_find(&query_idr, query->id) != query) { > spin_unlock_irqrestore(&idr_lock, flags); > return; > } > spin_unlock_irqrestore(&idr_lock, flags); > > ib_cancel_mad(query->port->agent, query->id); I admit that I haven't looked at the SA code yet, but can ib_sa_cancel_query pass straight through to ib_cancel_mad? Since the lock is not held around both the find and the cancel, it seems possible. - Sean From krkumar at us.ibm.com Mon Nov 8 11:03:18 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 8 Nov 2004 11:03:18 -0800 (PST) Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query. In-Reply-To: <52brebhdwh.fsf@topspin.com> Message-ID: Hi Roland, I agree with this. BTW, can't the release handler execute before the (I know, quirky race, but interrupts ...) : } else *sa_query = &query->sa_query and free up the memory ? Do you want send_mad() to return a copy of the sa_query (*copy = *query) before it actually sends on the wire ? The callers can use this on success case to return sa_query and id. - KK On Fri, 5 Nov 2004, Roland Dreier wrote: > Sorry, this and the follow-up patch are wrong. The if the send > succeeds then we can't free the query structure until the query > finishes up. (The query will be freed in the appropriate ->release > method in this case). > > You are right that there is a memory leak though. I fixed it like > this: > > Index: infiniband/core/sa_query.c > =================================================================== > --- infiniband/core/sa_query.c (revision 1166) > +++ infiniband/core/sa_query.c (working copy) > @@ -500,6 +500,7 @@ > > static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) > { > + kfree(sa_query->mad); > kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); > } > > @@ -544,11 +545,12 @@ > rec, query->sa_query.mad->data); > > ret = send_mad(&query->sa_query, timeout_ms); > - if (ret) > + if (ret) { > + kfree(query->sa_query.mad); > kfree(query); > + } else > + *sa_query = &query->sa_query; > > - *sa_query = &query->sa_query; > - > return ret ? ret : query->sa_query.id; > } > EXPORT_SYMBOL(ib_sa_path_rec_get); > @@ -572,6 +574,7 @@ > > static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) > { > + kfree(sa_query->mad); > kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); > } > > @@ -617,11 +620,12 @@ > rec, query->sa_query.mad->data); > > ret = send_mad(&query->sa_query, timeout_ms); > - if (ret) > + if (ret) { > + kfree(query->sa_query.mad); > kfree(query); > + } else > + *sa_query = &query->sa_query; > > - *sa_query = &query->sa_query; > - > return ret ? ret : query->sa_query.id; > } > EXPORT_SYMBOL(ib_sa_mcmember_rec_query); > > From roland at topspin.com Mon Nov 8 11:12:03 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 11:12:03 -0800 Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c In-Reply-To: <418FC40A.9080403@ichips.intel.com> (Sean Hefty's message of "Mon, 08 Nov 2004 11:07:54 -0800") References: <418FC40A.9080403@ichips.intel.com> Message-ID: <521xf4cfzw.fsf@topspin.com> Actually looking at this code one more time: spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, query->id) != query) { spin_unlock_irqrestore(&idr_lock, flags); return; } spin_unlock_irqrestore(&idr_lock, flags); ib_cancel_mad(query->port->agent, query->id); I realize that it has a race. I check that the query is still around inside the spinlock, but the query could complete and be freed in between the unlock and the call to ib_cancel_mad(). I'll have to add some reference counting... - R. From halr at voltaire.com Mon Nov 8 11:19:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 08 Nov 2004 14:19:53 -0500 Subject: [openib-general] [PATCH] mad: Eliminate line wraps in mad.c and agent.c Message-ID: <1099941592.25460.4.camel@hpc-1> mad: Eliminate line wraps (lines over 80 columns) in mad.c and agent.c Index: mad.c =================================================================== --- mad.c (revision 1168) +++ mad.c (working copy) @@ -244,7 +244,6 @@ mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; - mad_agent_priv->phys_port_cnt = port_priv->phys_port_cnt; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; @@ -400,25 +399,28 @@ GFP_ATOMIC : GFP_KERNEL); if (!mad_priv) { ret = -ENOMEM; - printk(KERN_ERR PFX "No memory for local response MAD\n"); + printk(KERN_ERR PFX "No memory for local " + "response MAD\n"); goto error1; } mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); - ret = mad_agent->device->process_mad(mad_agent->device, - 0, - mad_agent->port_num, - smp->dr_slid, /* ? */ - (struct ib_mad *)smp, - (struct ib_mad *)&mad_priv->mad); + ret = mad_agent->device->process_mad( + mad_agent->device, + 0, + mad_agent->port_num, + smp->dr_slid, /* ? */ + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); if ((ret & IB_MAD_RESULT_SUCCESS) && (ret & IB_MAD_RESULT_REPLY)) { - if (!smi_handle_dr_smp_recv((struct ib_smp *)&mad_priv->mad, - mad_agent->device->node_type, - mad_agent->port_num, - mad_agent_priv->phys_port_cnt)) { + if (!smi_handle_dr_smp_recv( + (struct ib_smp *)&mad_priv->mad, + mad_agent->device->node_type, + mad_agent->port_num, + mad_agent->device->phys_port_cnt)) { ret = -EINVAL; kmem_cache_free(ib_mad_cache, mad_priv); goto error1; @@ -430,7 +432,10 @@ mad_agent_priv->agent.recv_handler) { struct ib_wc wc; - /* Defined behavior is to complete response before request */ + /* + * Defined behavior is to complete response + * before request + */ wc.wr_id = send_wr->wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; @@ -443,13 +448,16 @@ wc.sl = 0; wc.dlid_path_bits = 0; mad_priv->header.recv_wc.wc = &wc; - mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad); + mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); mad_priv->header.recv_buf.grh = NULL; mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; - mad_priv->header.recv_wc.recv_buf = &mad_priv->header.recv_buf; - mad_agent_priv->agent.recv_handler(mad_agent, - &mad_priv->header.recv_wc); + mad_priv->header.recv_wc.recv_buf = + &mad_priv->header.recv_buf; + mad_agent_priv->agent.recv_handler( + mad_agent, + &mad_priv->header.recv_wc); } else kmem_cache_free(ib_mad_cache, mad_priv); @@ -458,7 +466,9 @@ mad_send_wc.status = IB_WC_SUCCESS; mad_send_wc.vendor_err = 0; mad_send_wc.wr_id = send_wr->wr_id; - mad_agent_priv->agent.send_handler(mad_agent, &mad_send_wc); + mad_agent_priv->agent.send_handler( + mad_agent, + &mad_send_wc); ret = 1; } else ret = -EINVAL; @@ -515,7 +525,8 @@ (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) goto error2; - mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, agent); /* Walk list of send WRs and post each on send list */ @@ -527,7 +538,8 @@ if (!cur_send_wr->wr.ud.mad_hdr) { *bad_send_wr = cur_send_wr; - printk(KERN_ERR PFX "MAD header must be supplied in WR %p\n", cur_send_wr); + printk(KERN_ERR PFX "MAD header must be supplied " + "in WR %p\n", cur_send_wr); goto error1; } @@ -609,7 +621,8 @@ struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_private *priv; - mad_priv_hdr = container_of(mad_recv_wc, struct ib_mad_private_header, + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, recv_wc); priv = container_of(mad_priv_hdr, struct ib_mad_private, header); @@ -678,7 +691,8 @@ /* Allocate management method table */ *method = kmalloc(sizeof **method, GFP_ATOMIC); if (!*method) { - printk(KERN_ERR PFX "No memory for ib_mad_mgmt_method_table\n"); + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_method_table\n"); return -ENOMEM; } /* Clear management method table */ @@ -773,7 +787,8 @@ goto error3; /* Finally, add in methods being registered */ - for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); i < IB_MGMT_MAX_METHODS; i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, 1+i)) { @@ -806,7 +821,10 @@ struct ib_mad_mgmt_method_table *method; u8 mgmt_class; - /* Was MAD registration request supplied with original registration ? */ + /* + * Was MAD registration request supplied + * with original registration ? + */ if (!agent_priv->reg_req) { goto out; } @@ -1085,7 +1103,7 @@ if (!smi_handle_dr_smp_recv(smp, port_priv->device->node_type, port_priv->port_num, - port_priv->phys_port_cnt)) + port_priv->device->phys_port_cnt)) goto out; if (!smi_check_forward_dr_smp(smp)) goto out; @@ -1108,7 +1126,10 @@ response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); if (!response) { printk(KERN_ERR PFX "No memory for response MAD\n"); - /* Is it better to assume that it wouldn't be processed ? */ + /* + * Is it better to assume that + * it wouldn't be processed ? + */ goto out; } @@ -1122,16 +1143,17 @@ if (response->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { if (!smi_handle_dr_smp_recv( - (struct ib_smp *)response, - port_priv->device->node_type, - port_priv->port_num, - port_priv->phys_port_cnt)) { + (struct ib_smp *)response, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) { kfree(response); goto out; } } /* Send response */ - grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh); + grh = (void *)recv->header.recv_buf.mad - + sizeof(struct ib_grh); if (agent_send(response, grh, wc, port_priv->device, port_priv->port_num)) { @@ -1175,7 +1197,8 @@ struct ib_mad_send_wr_private, agent_list); - if (time_after(mad_agent_priv->timeout, mad_send_wr->timeout)) { + if (time_after(mad_agent_priv->timeout, + mad_send_wr->timeout)) { mad_agent_priv->timeout = mad_send_wr->timeout; cancel_delayed_work(&mad_agent_priv->work); delay = mad_send_wr->timeout - jiffies; @@ -1204,7 +1227,8 @@ temp_mad_send_wr = list_entry(list_item, struct ib_mad_send_wr_private, agent_list); - if (time_after(mad_send_wr->timeout, temp_mad_send_wr->timeout)) + if (time_after(mad_send_wr->timeout, + temp_mad_send_wr->timeout)) break; } list_add(&mad_send_wr->agent_list, list_item); @@ -1517,7 +1541,8 @@ PCI_DMA_FROMDEVICE); kmem_cache_free(ib_mad_cache, mad_priv); - printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx failed ret = %d\n", + printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx " + "failed ret = %d\n", (unsigned long long) recv_wr.wr_id, ret); return -EINVAL; } @@ -1607,7 +1632,8 @@ attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't allocate memory for " + "ib_qp_attr\n"); return -ENOMEM; } @@ -1628,7 +1654,8 @@ kfree(attr); if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init ret = %d\n", ret); + printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init " + "ret = %d\n", ret); return ret; } @@ -1643,7 +1670,8 @@ attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't allocate memory for " + "ib_qp_attr\n"); return -ENOMEM; } @@ -1654,7 +1682,8 @@ kfree(attr); if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr ret = %d\n", ret); + printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr " + "ret = %d\n", ret); return ret; } @@ -1669,7 +1698,8 @@ attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't allocate memory for " + "ib_qp_attr\n"); return -ENOMEM; } @@ -1681,7 +1711,8 @@ kfree(attr); if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts ret = %d\n", ret); + printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts " + "ret = %d\n", ret); return ret; } @@ -1696,7 +1727,8 @@ attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't allocate memory for " + "ib_qp_attr\n"); return -ENOMEM; } @@ -1707,7 +1739,8 @@ kfree(attr); if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset ret = %d\n", ret); + printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset " + "ret = %d\n", ret); return ret; } @@ -1743,14 +1776,16 @@ ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); if (ret) { - printk(KERN_ERR PFX "Failed to request completion notification\n"); + printk(KERN_ERR PFX "Failed to request completion " + "notification\n"); goto error; } for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]); if (ret) { - printk(KERN_ERR PFX "Couldn't post receive requests\n"); + printk(KERN_ERR PFX "Couldn't post receive " + "requests\n"); goto error; } } @@ -1777,11 +1812,13 @@ int i, ret; for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset(port_priv->qp_info[i].qp); + ret = ib_mad_change_qp_state_to_reset( + port_priv->qp_info[i].qp); if (ret) { - printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change " - "%s port %d QP%d state to RESET\n", - port_priv->device->name, port_priv->port_num, i); + printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change" + " %s port %d QP%d state to RESET\n", + port_priv->device->name, port_priv->port_num, + i); } ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); ib_mad_return_posted_send_mads(&port_priv->qp_info[i]); @@ -1842,8 +1879,7 @@ * Create the QP, PD, MR, and CQ if needed */ static int ib_mad_port_open(struct ib_device *device, - int port_num, - int num_ports) + int port_num) { int ret, cq_size; u64 iova = 0; @@ -1872,7 +1908,6 @@ memset(port_priv, 0, sizeof *port_priv); port_priv->device = device; port_priv->port_num = port_num; - port_priv->phys_port_cnt = num_ports; spin_lock_init(&port_priv->reg_lock); cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; @@ -1985,31 +2020,25 @@ static void ib_mad_init_device(struct ib_device *device) { int ret, num_ports, cur_port, i, ret2; - struct ib_device_attr device_attr; - ret = ib_query_device(device, &device_attr); - if (ret) { - printk(KERN_ERR PFX "Couldn't query device %s\n", device->name); - goto error_device_query; - } - if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { - num_ports = device_attr.phys_port_cnt; + num_ports = device->phys_port_cnt; cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - ret = ib_mad_port_open(device, cur_port, num_ports); + ret = ib_mad_port_open(device, cur_port); if (ret) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, cur_port); goto error_device_open; } - ret = ib_agent_port_open(device, cur_port, num_ports); + ret = ib_agent_port_open(device, cur_port); if (ret) { - printk(KERN_ERR PFX "Couldn't open %s port %d for agents\n", + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for agents\n", device->name, cur_port); goto error_device_open; } @@ -2022,7 +2051,8 @@ cur_port--; ret2 = ib_agent_port_close(device, cur_port); if (ret2) { - printk(KERN_ERR PFX "Couldn't close %s port %d for agent\n", + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", device->name, cur_port); } ret2 = ib_mad_port_close(device, cur_port); @@ -2039,26 +2069,20 @@ static void ib_mad_remove_device(struct ib_device *device) { - int ret, i, num_ports, cur_port, ret2; - struct ib_device_attr device_attr; + int ret = 0, i, num_ports, cur_port, ret2; - ret = ib_query_device(device, &device_attr); - if (ret) { - printk(KERN_ERR PFX "Couldn't query device %s\n", device->name); - goto error_device_query; - } - if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { - num_ports = device_attr.phys_port_cnt; + num_ports = device->phys_port_cnt; cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { ret2 = ib_agent_port_close(device, cur_port); if (ret2) { - printk(KERN_ERR PFX "Couldn't close %s port %d for agent\n", + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", device->name, cur_port); if (!ret) ret = ret2; @@ -2071,9 +2095,6 @@ ret = ret2; } } - -error_device_query: - return; } static struct ib_client mad_client = { Index: agent.c =================================================================== --- agent.c (revision 1168) +++ agent.c (working copy) @@ -84,7 +84,8 @@ return 1; port_priv = ib_get_agent_mad(device, port_num, NULL); if (!port_priv) { - printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d not open\n", + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " + "not open\n", device->name, port_num); return 1; } @@ -109,7 +110,8 @@ /* Find matching MAD agent */ port_priv = ib_get_agent_mad(NULL, 0, mad_agent); if (!port_priv) { - printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent %p\n", + printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent " + "%p\n", mad_agent); goto out; } @@ -143,12 +145,16 @@ if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; - ah_attr.grh.sgid_index = 0; /* Should sgid be looked up -? */ + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; ah_attr.grh.hop_limit = grh->hop_limit; - ah_attr.grh.flow_label = be32_to_cpup(&grh->version_tclass_flow) & 0xfffff; - ah_attr.grh.traffic_class = (be32_to_cpup(&grh->version_tclass_flow) >> 20) & 0xff; - memcpy(ah_attr.grh.dgid.raw, grh->sgid.raw, sizeof(struct ib_grh)); + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(struct ib_grh)); } else { ah_attr.ah_flags = 0; /* No GRH for SM class */ } @@ -243,8 +249,8 @@ /* Find matching MAD agent */ port_priv = ib_get_agent_mad(NULL, 0, mad_agent); if (!port_priv) { - printk(KERN_ERR SPFX "agent_send_handler: no matching MAD agent " - "%p\n", mad_agent); + printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " + "agent %p\n", mad_agent); return; } @@ -252,8 +258,9 @@ spin_lock_irqsave(&port_priv->send_list_lock, flags); if (list_empty(&port_priv->send_posted_list)) { spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - printk(KERN_ERR SPFX "Send completion WR ID 0x%Lx but send list " - "is empty\n", (unsigned long long) mad_send_wc->wr_id); + printk(KERN_ERR SPFX "Send completion WR ID 0x%Lx but send " + "list is empty\n", + (unsigned long long) mad_send_wc->wr_id); return; } From roland at topspin.com Mon Nov 8 11:20:10 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 11:20:10 -0800 Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query. In-Reply-To: (Krishna Kumar's message of "Mon, 8 Nov 2004 11:03:18 -0800 (PST)") References: Message-ID: <52wtwwb11x.fsf@topspin.com> Krishna> Hi Roland, I agree with this. BTW, can't the release Krishna> handler execute before the (I know, quirky race, but Krishna> interrupts ...) Yeah, good point (although the consumer can't rely on the value until the function has returned, the consumer's callback might overwrite it). - R. From roland at topspin.com Mon Nov 8 11:21:35 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 11:21:35 -0800 Subject: [openib-general] [PATCH] Fix panic and memory leak in SA Query. In-Reply-To: <52wtwwb11x.fsf@topspin.com> (Roland Dreier's message of "Mon, 08 Nov 2004 11:20:10 -0800") References: <52wtwwb11x.fsf@topspin.com> Message-ID: <52sm7kb0zk.fsf@topspin.com> I think this should be better: Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 1175) +++ core/sa_query.c (working copy) @@ -544,12 +544,13 @@ ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, query->sa_query.mad->data); + *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); if (ret) { + *sa_query = NULL; kfree(query->sa_query.mad); kfree(query); - } else - *sa_query = &query->sa_query; + } return ret ? ret : query->sa_query.id; } @@ -619,12 +620,13 @@ ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), rec, query->sa_query.mad->data); + *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); if (ret) { + *sa_query = NULL; kfree(query->sa_query.mad); kfree(query); - } else - *sa_query = &query->sa_query; + } return ret ? ret : query->sa_query.id; } From halr at voltaire.com Mon Nov 8 11:28:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 08 Nov 2004 14:28:39 -0500 Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr In-Reply-To: <52654gcgrp.fsf@topspin.com> References: <52654gcgrp.fsf@topspin.com> Message-ID: <1099942119.25460.8.camel@hpc-1> On Mon, 2004-11-08 at 13:55, Roland Dreier wrote: > Convert mad.c and agent.c to use ib_get_dma_mr() instead of > ib_reg_phys_mr(). This is simpler and is actually required on > platforms such as sparc64 where DMA addresses may not match up with > physical RAM addresses. > > OK to commit? OK by me with same comment in two places below: > > - Roland > > Index: core/agent.c > =================================================================== > --- core/agent.c (revision 1172) > +++ core/agent.c (working copy) > @@ -281,15 +281,9 @@ > kfree(agent_send_wr); > } > > -int ib_agent_port_open(struct ib_device *device, int port_num, > - int phys_port_cnt) > +int ib_agent_port_open(struct ib_device *device, int port_num) > { > int ret; > - u64 iova = 0; > - struct ib_phys_buf buf_list = { > - .addr = 0, > - .size = (unsigned long) high_memory - PAGE_OFFSET > - }; > struct ib_agent_port_private *port_priv; > struct ib_mad_reg_req reg_req; > unsigned long flags; > @@ -312,7 +306,6 @@ > > memset(port_priv, 0, sizeof *port_priv); > port_priv->port_num = port_num; > - port_priv->phys_port_cnt = phys_port_cnt; > port_priv->wr_id = 0; > spin_lock_init(&port_priv->send_list_lock); > INIT_LIST_HEAD(&port_priv->send_posted_list); > @@ -356,9 +349,8 @@ > goto error4; > } > > - port_priv->mr = ib_reg_phys_mr(port_priv->dr_smp_agent->qp->pd, > - &buf_list, 1, > - IB_ACCESS_LOCAL_WRITE, &iova); > + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, > + IB_ACCESS_LOCAL_WRITE); > if (IS_ERR(port_priv->mr)) { > printk(KERN_ERR SPFX "Couldn't register MR\n"); Should this message be changed ? > ret = PTR_ERR(port_priv->mr); > Index: core/mad.c > =================================================================== > --- core/mad.c (revision 1172) > +++ core/mad.c (working copy) > @@ -1844,11 +1844,6 @@ > int port_num) > { > int ret, cq_size; > - u64 iova = 0; > - struct ib_phys_buf buf_list = { > - .addr = 0, > - .size = (unsigned long) high_memory - PAGE_OFFSET > - }; > struct ib_mad_port_private *port_priv; > unsigned long flags; > > @@ -1890,8 +1885,7 @@ > goto error4; > } > > - port_priv->mr = ib_reg_phys_mr(port_priv->pd, &buf_list, 1, > - IB_ACCESS_LOCAL_WRITE, &iova); > + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); > if (IS_ERR(port_priv->mr)) { > printk(KERN_ERR PFX "Couldn't register ib_mad MR\n"); Should this message be changed ? > ret = PTR_ERR(port_priv->mr); From krkumar at us.ibm.com Mon Nov 8 11:18:50 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Mon, 8 Nov 2004 11:18:50 -0800 (PST) Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c In-Reply-To: <521xf4cfzw.fsf@topspin.com> Message-ID: Good catch Sean. Yes, but since it is a race (hence uncommon), isn't it enough to let the ib_cancel_mad handle it ? It drops out if find_send_by_wr_id fails to find this entry. - KK On Mon, 8 Nov 2004, Roland Dreier wrote: > Actually looking at this code one more time: > > spin_lock_irqsave(&idr_lock, flags); > if (idr_find(&query_idr, query->id) != query) { > spin_unlock_irqrestore(&idr_lock, flags); > return; > } > spin_unlock_irqrestore(&idr_lock, flags); > > ib_cancel_mad(query->port->agent, query->id); > > I realize that it has a race. I check that the query is still around > inside the spinlock, but the query could complete and be freed in > between the unlock and the call to ib_cancel_mad(). I'll have to add > some reference counting... > > - R. From roland at topspin.com Mon Nov 8 11:45:41 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 11:45:41 -0800 Subject: [openib-general] [PATCH] Encapsulate finding of id in sa_query.c In-Reply-To: (Krishna Kumar's message of "Mon, 8 Nov 2004 11:18:50 -0800 (PST)") References: Message-ID: <52oei8azve.fsf@topspin.com> Krishna> Good catch Sean. Yes, but since it is a race (hence Krishna> uncommon), isn't it enough to let the ib_cancel_mad Krishna> handle it ? It drops out if find_send_by_wr_id fails to Krishna> find this entry. Actually it's my catch :) The problem is that ib_cancel_mad(query->port->agent, query->id); dereferences query, which might already be gone. I think I have a clean way to fix it though. - R. From roland at topspin.com Mon Nov 8 11:46:52 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 11:46:52 -0800 Subject: [openib-general] [PATCH] mad.c/agent.c: use ib_get_dma_mr In-Reply-To: <1099942119.25460.8.camel@hpc-1> (Hal Rosenstock's message of "Mon, 08 Nov 2004 14:28:39 -0500") References: <52654gcgrp.fsf@topspin.com> <1099942119.25460.8.camel@hpc-1> Message-ID: <52k6swaztf.fsf@topspin.com> OK, I committed with error messags like "Couldn't get ib_mad DMA MR" - R. From halr at voltaire.com Mon Nov 8 12:29:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 08 Nov 2004 15:29:04 -0500 Subject: [openib-general] [PATCH] Eliminate no longer used phys_port_cnt member in mad and agent structures Message-ID: <1099945743.8714.1.camel@hpc-1> Eliminate no longer used phys_port_cnt member in mad and agent structures Index: mad_priv.h =================================================================== --- mad_priv.h (revision 1177) +++ mad_priv.h (working copy) @@ -115,7 +115,6 @@ atomic_t refcount; wait_queue_head_t wait; - int phys_port_cnt; u8 rmpp_version; }; @@ -157,7 +156,6 @@ struct list_head port_list; struct ib_device *device; int port_num; - int phys_port_cnt; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; Index: agent_priv.h =================================================================== --- agent_priv.h (revision 1177) +++ agent_priv.h (working copy) @@ -42,7 +42,6 @@ struct list_head send_posted_list; spinlock_t send_list_lock; int port_num; - int phys_port_cnt; struct ib_mad_agent *dr_smp_agent; /* DR SM class */ struct ib_mad_agent *lr_smp_agent; /* LR SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ From halr at voltaire.com Mon Nov 8 13:13:57 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 08 Nov 2004 16:13:57 -0500 Subject: [Fwd: Re: [openib-general] IPoIB Multicast] Message-ID: <1099948436.8714.23.camel@hpc-1> On the first issue, I now understand why there are 2 joins for the broadcast group. ipoib_set_mcast_list causes the ipoib_restart_task to be run which stops and starts the multicast thread. Doing this causes the first join to the broadcast group to be cancelled (flushed) and even though it appears to work on the IB wire, it is not completed in the host. The second join for the broadcast group is completed without being cancelled. Not sure what (if anything) should be done about this. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [openib-general] IPoIB Multicast Date: 06 Nov 2004 14:48:38 -0500 On Sat, 2004-11-06 at 14:27, Roland Dreier wrote: > Hal> 1. If you down the interface and bring it back up, the second > Hal> time up, there are 2 identical join requests for the > Hal> broadcast group rather than just 1. These 2 come out very > Hal> close to one another (217 usec apart). Is there some counting > Hal> issue that is causing this ? > > Hal> 2. When leaving an IP multicast group, there appears to be an > Hal> extra join to 0x16 (something like 224.0.0.22 which would be > Hal> for IGMP). Any ideas on this ? > > If you or someone else doesn't debug these issues first, I'll take a > look at the code. I'll take a first crack and look at the code to see what I can determine. On the second issue, I partially understand what is going on: IPmc group changes need to be reported via IGMP so the IPmc router knows to prune the multicast tree, but... I don't understand why it joins here (and not earlier when a IPmc group is first joined by this node and second after the join is successful, I do not see any IGMP packet come out of the node (onto IB; maybe it is going out the ethernet instead). -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Mon Nov 8 15:48:41 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Nov 2004 15:48:41 -0800 Subject: [openib-general] ib_mad_recv_done_handler questions Message-ID: <419005D9.8070305@ichips.intel.com> Looking at the latest changes to ib_mad_recv_done_handler, I have a couple of questions: * If the underlying driver provides a process_mad routine, a response MAD is allocated every time a MAD is received on QP 0 or 1. Can we either push this allocation down into the HCA driver, or find an alternative way of interacting between the two drivers that doesn't require this allocation unless a response will be generated? * If process_mad consumes the MAD, should the code just goto out? Something more like: ret = port_priv->device->process_mad(...) if ((ret & IB_MAD_RESULT_SUCCESS) && (ret & IB_MAD_RESULT_REPLY)) { ... } else becomes ret = port_priv->device->process_mad(...) if (ret & IB_MAD_RESULT_SUCCESS)) { if (ret & IB_MAD_RESULT_REPLY)) { ... } ... goto out; } else Does the MAD still need to be dispatched in this case? - Sean From mshefty at ichips.intel.com Mon Nov 8 16:27:23 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Nov 2004 16:27:23 -0800 Subject: [openib-general] MAD agent code comments Message-ID: <41900EEB.7050109@ichips.intel.com> A couple of comments (so far) while tracing through the MAD agent code. * There are a couple of places where ib_get_agent_mad() will be called multiple times in the same execution path. For example agent_send calls it, as does agent_mad_send. I didn't check to see if the calls would return the same ib_agent_port_private structure. (Would calling the function ib_get_agent_port() make more sense?) * The agent code assumes that sends are completed in the order that they are posted. The MAD code does not guarantee that this is the case. (It cannot do this as a result of matching requests with response, handling timeouts, and error handling.) - Sean From roland at topspin.com Mon Nov 8 16:51:05 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 16:51:05 -0800 Subject: [openib-general] ib_mad_recv_done_handler questions In-Reply-To: <419005D9.8070305@ichips.intel.com> (Sean Hefty's message of "Mon, 08 Nov 2004 15:48:41 -0800") References: <419005D9.8070305@ichips.intel.com> Message-ID: <521xf3c0au.fsf@topspin.com> Sean> * If the underlying driver provides a process_mad routine, a Sean> response MAD is allocated every time a MAD is received on QP Sean> 0 or 1. Can we either push this allocation down into the Sean> HCA driver, or find an alternative way of interacting Sean> between the two drivers that doesn't require this allocation Sean> unless a response will be generated? How about if the MAD layer allocates a response MAD when a MAD is received, and if the process_mad call doesn't actually generate a response the MAD layer just stashed the response MAD away to use for the next receive? This should keep the number of allocations within 1 of the number of responses actually generated, but save us from tracking allocations between two layers. - R. From mshefty at ichips.intel.com Mon Nov 8 16:57:28 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Nov 2004 16:57:28 -0800 Subject: [openib-general] ib_mad_recv_done_handler questions In-Reply-To: <521xf3c0au.fsf@topspin.com> References: <419005D9.8070305@ichips.intel.com> <521xf3c0au.fsf@topspin.com> Message-ID: <419015F8.4030500@ichips.intel.com> Roland Dreier wrote: > Sean> * If the underlying driver provides a process_mad routine, a > Sean> response MAD is allocated every time a MAD is received on QP > Sean> 0 or 1. Can we either push this allocation down into the > Sean> HCA driver, or find an alternative way of interacting > Sean> between the two drivers that doesn't require this allocation > Sean> unless a response will be generated? > > How about if the MAD layer allocates a response MAD when a MAD is > received, and if the process_mad call doesn't actually generate a > response the MAD layer just stashed the response MAD away to use for > the next receive? This should keep the number of allocations within 1 > of the number of responses actually generated, but save us from > tracking allocations between two layers. That sounds reasonable, and I think avoiding allocations in the HCA driver is desirable given the current design. - Sean From roland at topspin.com Mon Nov 8 20:52:04 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 08 Nov 2004 20:52:04 -0800 Subject: [openib-general] [PATCH] Convert cache.c to use RCU Message-ID: <52lldbaakr.fsf@topspin.com> Use RCU instead of seqlocks, and simplify the code. Index: core/device.c =================================================================== --- core/device.c (revision 1178) +++ core/device.c (working copy) @@ -190,8 +190,7 @@ int ib_register_device(struct ib_device *device) { - struct ib_device_private *priv; - int ret; + int ret; down(&device_sem); @@ -206,51 +205,16 @@ goto out; } - priv = kmalloc(sizeof *priv, GFP_KERNEL); - if (!priv) { - printk(KERN_WARNING "Couldn't allocate private struct for %s\n", - device->name); - ret = -ENOMEM; - goto out; - } - - *priv = (struct ib_device_private) { 0 }; - - if (device->node_type == IB_NODE_SWITCH) { - priv->start_port = priv->end_port = 0; - } else { - priv->start_port = 1; - priv->end_port = device->phys_port_cnt; - } - - priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data), - GFP_KERNEL); - if (!priv->port_data) { - printk(KERN_WARNING "Couldn't allocate port info for %s\n", - device->name); - ret = -ENOMEM; - goto out_free; - } - - device->core = priv; - INIT_LIST_HEAD(&device->event_handler_list); INIT_LIST_HEAD(&device->client_data_list); spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); - ret = ib_cache_setup(device); - if (ret) { - printk(KERN_WARNING "Couldn't create device info cache for %s\n", - device->name); - goto out_free_port; - } - ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", device->name); - goto out_free_cache; + goto out; } list_add_tail(&device->core_list, &device_list); @@ -265,18 +229,6 @@ client->add(device); } - up(&device_sem); - return 0; - - out_free_cache: - ib_cache_cleanup(device); - - out_free_port: - kfree(priv->port_data); - - out_free: - kfree(priv); - out: up(&device_sem); return ret; @@ -285,7 +237,6 @@ void ib_unregister_device(struct ib_device *device) { - struct ib_device_private *priv = device->core; struct ib_client *client; struct ib_client_data *context, *tmp; unsigned long flags; @@ -305,11 +256,6 @@ kfree(context); spin_unlock_irqrestore(&device->client_data_lock, flags); - ib_cache_cleanup(device); - - kfree(priv->port_data); - kfree(priv); - device->reg_state = IB_DEV_UNREGISTERED; } EXPORT_SYMBOL(ib_unregister_device); @@ -490,11 +436,18 @@ if (ret) printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + ret = ib_cache_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); + ib_sysfs_cleanup(); + } + return ret; } static void __exit ib_core_cleanup(void) { + ib_cache_cleanup(); ib_sysfs_cleanup(); } Index: core/cache.c =================================================================== --- core/cache.c (revision 1178) +++ core/cache.c (working copy) @@ -23,57 +23,77 @@ #include #include - #include #include +#include #include "core_priv.h" -int ib_cached_lid_get(struct ib_device *device, - u8 port, - struct ib_port_lid *port_lid) -{ - struct ib_device_private *priv; - unsigned int seq; +struct ib_pkey_cache { + struct rcu_head rcu; + int table_len; + u16 table[0]; +}; - priv = device->core; +struct ib_gid_cache { + struct rcu_head rcu; + int table_len; + union ib_gid table[0]; +}; - if (port < priv->start_port || port > priv->end_port) - return -EINVAL; +struct ib_update_work { + struct work_struct work; + struct ib_device *device; + u8 port_num; +}; - do { - seq = read_seqcount_begin(&priv->port_data[port].lock); - memcpy(port_lid, - &priv->port_data[port].port_lid, - sizeof (struct ib_port_lid)); - } while (read_seqcount_retry(&priv->port_data[port].lock, seq)); +static inline int start_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : 1; +} - return 0; +static inline int end_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; } -EXPORT_SYMBOL(ib_cached_lid_get); +static void rcu_free_pkey(struct rcu_head *head) +{ + struct ib_pkey_cache *cache = + container_of(head, struct ib_pkey_cache, rcu); + kfree(cache); +} + +static void rcu_free_gid(struct rcu_head *head) +{ + struct ib_gid_cache *cache = + container_of(head, struct ib_gid_cache, rcu); + kfree(cache); +} + int ib_cached_gid_get(struct ib_device *device, u8 port, int index, union ib_gid *gid) { - struct ib_device_private *priv; - unsigned int seq; + struct ib_gid_cache *cache; + int ret = 0; - priv = device->core; - - if (port < priv->start_port || port > priv->end_port) + if (port < start_port(device) || port > end_port(device)) return -EINVAL; - if (index < 0 || index >= priv->port_data[port].properties.gid_tbl_len) - return -EINVAL; + rcu_read_lock(); - do { - seq = read_seqcount_begin(&priv->port_data[port].lock); - *gid = priv->port_data[port].gid_table[index]; - } while (read_seqcount_retry(&priv->port_data[port].lock, seq)); + cache = rcu_dereference(device->cache.gid_cache[port - start_port(device)]); - return 0; + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *gid = cache->table[index]; + + rcu_read_unlock(); + + return ret; } EXPORT_SYMBOL(ib_cached_gid_get); @@ -82,23 +102,24 @@ int index, u16 *pkey) { - struct ib_device_private *priv; - unsigned int seq; + struct ib_pkey_cache *cache; + int ret = 0; - priv = device->core; - - if (port < priv->start_port || port > priv->end_port) + if (port < start_port(device) || port > end_port(device)) return -EINVAL; - if (index < 0 || index >= priv->port_data[port].properties.pkey_tbl_len) - return -EINVAL; + rcu_read_lock(); - do { - seq = read_seqcount_begin(&priv->port_data[port].lock); - *pkey = priv->port_data[port].pkey_table[index]; - } while (read_seqcount_retry(&priv->port_data[port].lock, seq)); + cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]); - return 0; + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *pkey = cache->table[index]; + + rcu_read_unlock(); + + return ret; } EXPORT_SYMBOL(ib_cached_pkey_get); @@ -107,207 +128,214 @@ u16 pkey, u16 *index) { - struct ib_device_private *priv; - unsigned int seq; - int i; - int found; + struct ib_pkey_cache *cache; + int i; + int ret = -ENOENT; - priv = device->core; - - if (port < priv->start_port || port > priv->end_port) + if (port < start_port(device) || port > end_port(device)) return -EINVAL; - do { - seq = read_seqcount_begin(&priv->port_data[port].lock); - found = -1; - for (i = 0; i < priv->port_data[port].properties.pkey_tbl_len; ++i) { - if ((priv->port_data[port].pkey_table[i] & 0x7fff) == - (pkey & 0x7fff)) { - found = i; - break; - } + rcu_read_lock(); + + cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]); + + *index = -1; + + for (i = 0; i < cache->table_len; ++i) + if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { + *index = i; + ret = 0; + break; } - } while (read_seqcount_retry(&priv->port_data[port].lock, seq)); - if (found < 0) { - return -ENOENT; - } else { - *index = found; - return 0; - } + rcu_read_unlock(); + return ret; } EXPORT_SYMBOL(ib_cached_pkey_find); static void ib_cache_update(struct ib_device *device, u8 port) { - struct ib_device_private *priv = device->core; - struct ib_port_data *info = &priv->port_data[port]; struct ib_port_attr *tprops = NULL; - union ib_gid *tgid = NULL; - u16 *tpkey = NULL; + struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; + struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; int i; int ret; tprops = kmalloc(sizeof *tprops, GFP_KERNEL); if (!tprops) - goto out; + return; - ret = device->query_port(device, port, tprops); + ret = ib_query_port(device, port, tprops); if (ret) { - printk(KERN_WARNING "query_port failed (%d) for %s\n", + printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", ret, device->name); - goto out; + goto err; } - tprops->gid_tbl_len = min(tprops->gid_tbl_len, - info->gid_table_alloc_length); - tgid = kmalloc(tprops->gid_tbl_len * sizeof *tgid, GFP_KERNEL); - if (!tgid) - goto out; + pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * + sizeof *pkey_cache->table, GFP_KERNEL); + if (!pkey_cache) + goto err; - for (i = 0; i < tprops->gid_tbl_len; ++i) { - ret = device->query_gid(device, port, i, tgid + i); + INIT_RCU_HEAD(&pkey_cache->rcu); + pkey_cache->table_len = tprops->pkey_tbl_len; + + gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * + sizeof *gid_cache->table, GFP_KERNEL); + if (!gid_cache) + goto err; + + INIT_RCU_HEAD(&gid_cache->rcu); + gid_cache->table_len = tprops->gid_tbl_len; + + for (i = 0; i < pkey_cache->table_len; ++i) { + ret = ib_query_pkey(device, port, i, pkey_cache->table + i); if (ret) { - printk(KERN_WARNING "query_gid failed (%d) for %s (index %d)\n", + printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", ret, device->name, i); - goto out; + goto err; } } - tprops->pkey_tbl_len = min(tprops->pkey_tbl_len, - info->pkey_table_alloc_length); - tpkey = kmalloc(tprops->pkey_tbl_len * sizeof (u16), - GFP_KERNEL); - if (!tpkey) - goto out; - - for (i = 0; i < tprops->pkey_tbl_len; ++i) { - ret = device->query_pkey(device, port, i, &tpkey[i]); + for (i = 0; i < gid_cache->table_len; ++i) { + ret = ib_query_gid(device, port, i, gid_cache->table + i); if (ret) { - printk(KERN_WARNING "query_pkey failed (%d) " - "for %s, port %d, index %d\n", - ret, device->name, port, i); - goto out; + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; } } - write_seqcount_begin(&info->lock); + old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; + old_gid_cache = device->cache.gid_cache [port - start_port(device)]; - info->properties = *tprops; +#warning Delete definition of rcu_assign_pointer when 2.6.10 is released! +#ifndef rcu_assign_pointer +#define rcu_assign_pointer(p, v) ({ \ + smp_wmb(); \ + (p) = (v); \ + }) +#endif - info->port_lid.lid = info->properties.lid; - info->port_lid.lmc = info->properties.lmc; + rcu_assign_pointer(device->cache.pkey_cache[port - start_port(device)], + pkey_cache); + rcu_assign_pointer(device->cache.gid_cache [port - start_port(device)], + gid_cache); - memcpy(info->gid_table, tgid, - tprops->gid_tbl_len * sizeof *tgid); - memcpy(info->pkey_table, tpkey, - tprops->pkey_tbl_len * sizeof *tpkey); + if (old_pkey_cache) + call_rcu(&old_pkey_cache->rcu, rcu_free_pkey); + if (old_gid_cache) + call_rcu(&old_gid_cache->rcu, rcu_free_gid); - write_seqcount_end(&info->lock); + kfree(tprops); + return; - out: +err: + kfree(pkey_cache); + kfree(gid_cache); kfree(tprops); - kfree(tpkey); - kfree(tgid); } -static void ib_cache_task(void *port_ptr) +static void ib_cache_task(void *work_ptr) { - struct ib_port_data *port_data = port_ptr; + struct ib_update_work *work = work_ptr; - ib_cache_update(port_data->device, port_data->port_num); + ib_cache_update(work->device, work->port_num); + kfree(work); } static void ib_cache_event(struct ib_event_handler *handler, struct ib_event *event) { + struct ib_update_work *work; + if (event->event == IB_EVENT_PORT_ERR || event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE) { - struct ib_device_private *priv = event->device->core; - schedule_work(&priv->port_data[event->element.port_num].refresh_task); + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (work) { + INIT_WORK(&work->work, ib_cache_task, work); + work->device = event->device; + work->port_num = event->element.port_num; + schedule_work(&work->work); + } } } -int ib_cache_setup(struct ib_device *device) +void ib_cache_setup_one(struct ib_device *device) { - struct ib_device_private *priv = device->core; - struct ib_port_attr prop; - int p; - int ret; + int p; - for (p = priv->start_port; p <= priv->end_port; ++p) { - priv->port_data[p].device = device; - priv->port_data[p].port_num = p; - INIT_WORK(&priv->port_data[p].refresh_task, - ib_cache_task, &priv->port_data[p]); - priv->port_data[p].gid_table = NULL; - priv->port_data[p].pkey_table = NULL; - priv->port_data[p].event_handler.device = NULL; + device->cache.pkey_cache = + kmalloc(sizeof *device->cache.pkey_cache * + end_port(device) - start_port(device), GFP_KERNEL); + device->cache.gid_cache = + kmalloc(sizeof *device->cache.pkey_cache * + end_port(device) - start_port(device), GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache) { + printk(KERN_WARNING "Couldn't allocate cache " + "for %s\n", device->name); + goto err; } - for (p = priv->start_port; p <= priv->end_port; ++p) { - seqcount_init(&priv->port_data[p].lock); - ret = device->query_port(device, p, &prop); - if (ret) { - printk(KERN_WARNING "query_port failed for %s\n", - device->name); - goto error; - } - priv->port_data[p].gid_table_alloc_length = prop.gid_tbl_len; - priv->port_data[p].gid_table = kmalloc(prop.gid_tbl_len * - sizeof (union ib_gid), - GFP_KERNEL); - if (!priv->port_data[p].gid_table) { - ret = -ENOMEM; - goto error; - } + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + device->cache.pkey_cache[p] = NULL; + device->cache.gid_cache [p] = NULL; + ib_cache_update(device, p + start_port(device)); + } - priv->port_data[p].pkey_table_alloc_length = prop.pkey_tbl_len; - priv->port_data[p].pkey_table = kmalloc(prop.pkey_tbl_len * sizeof (u16), - GFP_KERNEL); - if (!priv->port_data[p].pkey_table) { - ret = -ENOMEM; - goto error; - } + INIT_IB_EVENT_HANDLER(&device->cache.event_handler, + device, ib_cache_event); + if (ib_register_event_handler(&device->cache.event_handler)) + goto err_cache; - ib_cache_update(device, p); + return; - INIT_IB_EVENT_HANDLER(&priv->port_data[p].event_handler, - device, ib_cache_event); - ret = ib_register_event_handler(&priv->port_data[p].event_handler); - if (ret) { - priv->port_data[p].event_handler.device = NULL; - goto error; - } +err_cache: + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); } - return 0; +err: + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} - error: - for (p = priv->start_port; p <= priv->end_port; ++p) { - if (priv->port_data[p].event_handler.device) - ib_unregister_event_handler(&priv->port_data[p].event_handler); - kfree(priv->port_data[p].gid_table); - kfree(priv->port_data[p].pkey_table); +void ib_cache_cleanup_one(struct ib_device *device) +{ + int p; + + ib_unregister_event_handler(&device->cache.event_handler); + flush_scheduled_work(); + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); } - return ret; + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); } -void ib_cache_cleanup(struct ib_device *device) +struct ib_client cache_client = { + .name = "cache", + .add = ib_cache_setup_one, + .remove = ib_cache_cleanup_one +}; + +int __init ib_cache_setup(void) { - struct ib_device_private *priv = device->core; - int p; + return ib_register_client(&cache_client); +} - for (p = priv->start_port; p <= priv->end_port; ++p) { - ib_unregister_event_handler(&priv->port_data[p].event_handler); - kfree(priv->port_data[p].gid_table); - kfree(priv->port_data[p].pkey_table); - } +void __exit ib_cache_cleanup(void) +{ + ib_unregister_client(&cache_client); } /* Index: core/core_priv.h =================================================================== --- core/core_priv.h (revision 1178) +++ core/core_priv.h (working copy) @@ -29,39 +29,15 @@ #include -struct ib_device_private { - int start_port; - int end_port; - u64 node_guid; - struct ib_port_data *port_data; -}; - -struct ib_port_data { - struct ib_device *device; - - struct ib_event_handler event_handler; - struct work_struct refresh_task; - - seqcount_t lock; - struct ib_port_attr properties; - struct ib_port_lid port_lid; - int gid_table_alloc_length; - u16 pkey_table_alloc_length; - union ib_gid *gid_table; - u16 *pkey_table; - u8 port_num; -}; - -int ib_cache_setup(struct ib_device *device); -void ib_cache_cleanup(struct ib_device *device); -void ib_completion_thread(struct list_head *entry, void *device_ptr); -void ib_async_thread(struct list_head *entry, void *device_ptr); - int ib_device_register_sysfs(struct ib_device *device); void ib_device_unregister_sysfs(struct ib_device *device); + int ib_sysfs_setup(void); void ib_sysfs_cleanup(void); +int ib_cache_setup(void); +void ib_cache_cleanup(void); + #endif /* _CORE_PRIV_H */ /* Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 1178) +++ include/ib_verbs.h (working copy) @@ -672,6 +672,12 @@ #define IB_DEVICE_NAME_MAX 64 +struct ib_cache { + struct ib_event_handler event_handler; + struct ib_pkey_cache **pkey_cache; + struct ib_gid_cache **gid_cache; +}; + struct ib_device { struct pci_dev *dma_device; @@ -684,7 +690,8 @@ struct list_head client_data_list; spinlock_t client_data_lock; - void *core; + struct ib_cache cache; + u32 flags; int (*query_device)(struct ib_device *device, Index: include/ts_ib_core.h =================================================================== --- include/ts_ib_core.h (revision 1178) +++ include/ts_ib_core.h (working copy) @@ -24,14 +24,6 @@ #ifndef _TS_IB_CORE_H #define _TS_IB_CORE_H -struct ib_port_lid { - u16 lid; - u8 lmc; -}; - -int ib_cached_lid_get(struct ib_device *device, - u8 port, - struct ib_port_lid *port_lid); int ib_cached_gid_get(struct ib_device *device, u8 port, int index, Index: ulp/ipoib/ipoib_multicast.c =================================================================== --- ulp/ipoib/ipoib_multicast.c (revision 1178) +++ ulp/ipoib/ipoib_multicast.c (working copy) @@ -517,10 +517,12 @@ } { - struct ib_port_lid port_lid; + struct ib_port_attr attr; - ib_cached_lid_get(priv->ca, priv->port, &port_lid); - priv->local_lid = port_lid.lid; + if (!ib_query_port(priv->ca, priv->port, &attr)) + priv->local_lid = attr.lid; + else + ipoib_warn(priv, "ib_query_port failed\n"); } priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - From sreenivasulu at topspin.com Tue Nov 9 02:26:04 2004 From: sreenivasulu at topspin.com (Sreenivasulu Pulichintala) Date: Tue, 9 Nov 2004 15:56:04 +0530 Subject: [openib-general] VAPI_RETRY_EXC_ERR Message-ID: <4A388685F814D54CAE412B2DAB7CE91C195454@initexch.topspincom.com> HI, I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran applications, some times my application crashes producing the following error - === Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81 mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_spinandwaitcq: Assertion `sc->status == VAPI_SUCCESS' failed. Timeout alarm signaled^M Cleaning up all processes ...done.^M Killed by signal 15.^M^M === In what possible cases I get this error? Is it because of RESYNC? Any help in this regard is highly appreciated. Thanks Sree -------------- next part -------------- An HTML attachment was scrubbed... URL: From sreenivasulu at topspin.com Tue Nov 9 02:49:17 2004 From: sreenivasulu at topspin.com (Sreenivasulu Pulichintala) Date: Tue, 9 Nov 2004 16:19:17 +0530 Subject: [openib-general] VAPI_RETRY_EXC_ERR Message-ID: <4A388685F814D54CAE412B2DAB7CE91C195455@initexch.topspincom.com> The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR -----Original Message----- From: Sreenivasulu Pulichintala Sent: Tuesday, November 09, 2004 3:56 PM To: openib-general at openib.org Subject: [openib-general] VAPI_RETRY_EXC_ERR HI, I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran applications, some times my application crashes producing the following error - === Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81 mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_spinandwaitcq: Assertion `sc->status == VAPI_SUCCESS' failed. Timeout alarm signaled^M Cleaning up all processes ...done.^M Killed by signal 15.^M^M === In what possible cases I get this error? Is it because of RESYNC? Any help in this regard is highly appreciated. Thanks Sree -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Nov 9 05:57:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 08:57:56 -0500 Subject: [openib-general] ib_mad_recv_done_handler questions In-Reply-To: <419005D9.8070305@ichips.intel.com> References: <419005D9.8070305@ichips.intel.com> Message-ID: <1100008675.8714.2084.camel@hpc-1> On Mon, 2004-11-08 at 18:48, Sean Hefty wrote: > Looking at the latest changes to ib_mad_recv_done_handler, I have a > couple of questions: > * If process_mad consumes the MAD, should the code just goto out? > Something more like: > > ret = port_priv->device->process_mad(...) > if ((ret & IB_MAD_RESULT_SUCCESS) && > (ret & IB_MAD_RESULT_REPLY)) { > ... > } else > > becomes > > ret = port_priv->device->process_mad(...) > if (ret & IB_MAD_RESULT_SUCCESS)) { > if (ret & IB_MAD_RESULT_REPLY)) { > ... > } > ... > goto out; > } else Patch shortly on this. > Does the MAD still need to be dispatched in this case? I'm not sure exactly what all the reasons for !success being returned from process_mad are but my reasoning was as follows: In this error case, it is unclear whether the packet would have been consumed or not. If it would not have been consumed, it should be dispatched. It is only in the case where it would have been consumed that dispatching it causes a potential issue. If the packet is indeed dispatched to a client, wouldn't/shouldn't the client throw it away (as unexpected) ? If it is thrown away in this error case (a more conservative strategy), some retransmission strategy would kick in on one side or the other. I wasn't sure about this and chose the former strategy. -- Hal From halr at voltaire.com Tue Nov 9 06:01:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 09:01:37 -0500 Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases Message-ID: <1100008896.8714.2105.camel@hpc-1> mad: In ib_mad_recv_done_handler, don't dispatch additional error cases Index: mad.c =================================================================== --- mad.c (revision 1180) +++ mad.c (working copy) @@ -1138,26 +1138,27 @@ wc->slid, recv->header.recv_buf.mad, response); - if ((ret & IB_MAD_RESULT_SUCCESS) && - (ret & IB_MAD_RESULT_REPLY)) { - if (response->mad_hdr.mgmt_class == - IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - if (!smi_handle_dr_smp_recv( - (struct ib_smp *)response, - port_priv->device->node_type, - port_priv->port_num, - port_priv->device->phys_port_cnt)) { + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_REPLY) { + if (response->mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv( + (struct ib_smp *)response, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) { + kfree(response); + goto out; + } + } + /* Send response */ + grh = (void *)recv->header.recv_buf.mad - + sizeof(struct ib_grh); + if (agent_send(response, grh, wc, + port_priv->device, + port_priv->port_num)) { kfree(response); - goto out; } - } - /* Send response */ - grh = (void *)recv->header.recv_buf.mad - - sizeof(struct ib_grh); - if (agent_send(response, grh, wc, - port_priv->device, - port_priv->port_num)) { - kfree(response); goto out; } } else From halr at voltaire.com Tue Nov 9 06:12:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 09:12:38 -0500 Subject: [openib-general] MAD agent code comments In-Reply-To: <41900EEB.7050109@ichips.intel.com> References: <41900EEB.7050109@ichips.intel.com> Message-ID: <1100009558.13933.3.camel@localhost.localdomain> On Mon, 2004-11-08 at 19:27, Sean Hefty wrote: > A couple of comments (so far) while tracing through the MAD agent code. > > * There are a couple of places where ib_get_agent_mad() will be called > multiple times in the same execution path. For example agent_send calls > it, as does agent_mad_send. Are there others like this ? > I didn't check to see if the calls would > return the same ib_agent_port_private structure. I eliminated the duplicate call. Patch shortly on this. > (Would calling the > function ib_get_agent_port() make more sense?) Yes. > * The agent code assumes that sends are completed in the order that they > are posted. The MAD code does not guarantee that this is the case. (It > cannot do this as a result of matching requests with response, handling > timeouts, and error handling.) Since the agent does not use solicited sends, are its sends completed in order (so this is only an issue for clients using solicited sends) ? -- Hal From halr at voltaire.com Tue Nov 9 06:27:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 09:27:59 -0500 Subject: [openib-general] [PATCH] agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate duplicated call to it in agent_mad_send Message-ID: <1100010478.26166.2.camel@hpc-1> agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate duplicated call to it in agent_mad_send (pointed out by Sean Hefty) Index: agent.c =================================================================== --- agent.c (revision 1180) +++ agent.c (working copy) @@ -35,8 +35,8 @@ static inline struct ib_agent_port_private * -__ib_get_agent_mad(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +__ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) { struct ib_agent_port_private *entry; @@ -61,14 +61,14 @@ } static inline struct ib_agent_port_private * -ib_get_agent_mad(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) { struct ib_agent_port_private *entry; unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - entry = __ib_get_agent_mad(device, port_num, mad_agent); + entry = __ib_get_agent_port(device, port_num, mad_agent); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); return entry; @@ -82,7 +82,7 @@ if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) return 1; - port_priv = ib_get_agent_mad(device, port_num, NULL); + port_priv = ib_get_agent_port(device, port_num, NULL); if (!port_priv) { printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " "not open\n", @@ -94,11 +94,11 @@ } static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_agent_port_private *port_priv, struct ib_mad *mad, struct ib_grh *grh, struct ib_wc *wc) { - struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; struct ib_sge gather_list; struct ib_send_wr send_wr; @@ -107,15 +107,6 @@ unsigned long flags; int ret = 1; - /* Find matching MAD agent */ - port_priv = ib_get_agent_mad(NULL, 0, mad_agent); - if (!port_priv) { - printk(KERN_ERR SPFX "agent_mad_send: no matching MAD agent " - "%p\n", - mad_agent); - goto out; - } - agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); if (!agent_send_wr) goto out; @@ -213,7 +204,7 @@ struct ib_agent_port_private *port_priv; struct ib_mad_agent *mad_agent; - port_priv = ib_get_agent_mad(device, port_num, NULL); + port_priv = ib_get_agent_port(device, port_num, NULL); if (!port_priv) { printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", device->name, port_num); @@ -235,7 +226,7 @@ return 1; } - return agent_mad_send(mad_agent, mad, grh, wc); + return agent_mad_send(mad_agent, port_priv, mad, grh, wc); } static void agent_send_handler(struct ib_mad_agent *mad_agent, @@ -247,7 +238,7 @@ unsigned long flags; /* Find matching MAD agent */ - port_priv = ib_get_agent_mad(NULL, 0, mad_agent); + port_priv = ib_get_agent_port(NULL, 0, mad_agent); if (!port_priv) { printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " "agent %p\n", mad_agent); @@ -296,7 +287,7 @@ unsigned long flags; /* First, check if port already open for SMI */ - port_priv = ib_get_agent_mad(device, port_num, NULL); + port_priv = ib_get_agent_port(device, port_num, NULL); if (port_priv) { printk(KERN_DEBUG SPFX "%s port %d already open\n", device->name, port_num); @@ -388,7 +379,7 @@ unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - port_priv = __ib_get_agent_mad(device, port_num, NULL); + port_priv = __ib_get_agent_port(device, port_num, NULL); if (port_priv == NULL) { spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); printk(KERN_ERR SPFX "Port %d not found\n", port_num); From halr at voltaire.com Tue Nov 9 06:21:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 09:21:25 -0500 Subject: [openib-general] ib_mad_recv_done_handler questions In-Reply-To: <521xf3c0au.fsf@topspin.com> References: <419005D9.8070305@ichips.intel.com> <521xf3c0au.fsf@topspin.com> Message-ID: <1100010085.13933.5.camel@localhost.localdomain> On Mon, 2004-11-08 at 19:51, Roland Dreier wrote: > Sean> * If the underlying driver provides a process_mad routine, a > Sean> response MAD is allocated every time a MAD is received on QP > Sean> 0 or 1. Can we either push this allocation down into the > Sean> HCA driver, or find an alternative way of interacting > Sean> between the two drivers that doesn't require this allocation > Sean> unless a response will be generated? > > How about if the MAD layer allocates a response MAD when a MAD is > received, and if the process_mad call doesn't actually generate a > response the MAD layer just stashed the response MAD away to use for > the next receive? This should keep the number of allocations within 1 > of the number of responses actually generated, but save us from > tracking allocations between two layers. I like it. I'll work up a patch for this. -- Hal From halr at voltaire.com Tue Nov 9 06:34:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 09:34:28 -0500 Subject: [openib-general] IPoIB Completion Handling Message-ID: <1100010867.26166.7.camel@hpc-1> Hi Roland, In ipoib_ib_handle_wc when status != success, isn't the WC opcode invalid ? Also, in that case, don't receives also need to be reposted ? -- Hal From tziporet at mellanox.co.il Tue Nov 9 06:43:01 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 9 Nov 2004 16:43:01 +0200 Subject: [openib-general] VAPI_RETRY_EXC_ERR Message-ID: <506C3D7B14CDD411A52C00025558DED6064BE9E6@mtlex01.yok.mtl.com> There can be several problems: - The retry count is too small - try to put max number - 7 - Maybe the timeout is too small - so the HCA start to perform retry too much - try to enlarge it to 21 - Can be that the PSN between two sides is not synchronized - The link fail - The QP in the other side was closed or moved to error In case this error occurs at the beginning of the application then it can indicate that the QP configuration is wrong. Tziporet -----Original Message----- From: Sreenivasulu Pulichintala [mailto:sreenivasulu at topspin.com] Sent: Tuesday, November 09, 2004 12:49 PM To: openib-general at openib.org Subject: RE: [openib-general] VAPI_RETRY_EXC_ERR The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR -----Original Message----- From: Sreenivasulu Pulichintala Sent: Tuesday, November 09, 2004 3:56 PM To: openib-general at openib.org Subject: [openib-general] VAPI_RETRY_EXC_ERR HI, I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran applications, some times my application crashes producing the following error - === Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81 mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_spinandwaitcq: Assertion `sc->status == VAPI_SUCCESS' failed. Timeout alarm signaled^M Cleaning up all processes ...done.^M Killed by signal 15.^M^M === In what possible cases I get this error? Is it because of RESYNC? Any help in this regard is highly appreciated. Thanks Sree -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Tue Nov 9 07:09:53 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 07:09:53 -0800 Subject: [openib-general] Re: IPoIB Completion Handling In-Reply-To: <1100010867.26166.7.camel@hpc-1> (Hal Rosenstock's message of "Tue, 09 Nov 2004 09:34:28 -0500") References: <1100010867.26166.7.camel@hpc-1> Message-ID: <52hdnz9hz2.fsf@topspin.com> Hal> In ipoib_ib_handle_wc when status != success, isn't the WC Hal> opcode invalid ? Also, in that case, don't receives also need Hal> to be reposted ? Yes, the error handling in IPoIB needs to be fixed. - R. From roland at topspin.com Tue Nov 9 07:37:50 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 07:37:50 -0800 Subject: [openib-general] Re: IPoIB Completion Handling In-Reply-To: <52hdnz9hz2.fsf@topspin.com> (Roland Dreier's message of "Tue, 09 Nov 2004 07:09:53 -0800") References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com> Message-ID: <524qjz9goh.fsf@topspin.com> Hal> In ipoib_ib_handle_wc when status != success, isn't the WC Hal> opcode invalid ? Also, in that case, don't receives also need Hal> to be reposted ? Roland> Yes, the error handling in IPoIB needs to be fixed. By the way, reposting the receives is not the right thing to do on error -- the QP will be in the error state, so any new work requests will just complete with a flush status. We need to reset the QP and start over to recover from errors. - R. From halr at voltaire.com Tue Nov 9 08:05:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 11:05:46 -0500 Subject: [openib-general] Re: IPoIB Completion Handling In-Reply-To: <524qjz9goh.fsf@topspin.com> References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com> <524qjz9goh.fsf@topspin.com> Message-ID: <1100016345.13933.230.camel@localhost.localdomain> On Tue, 2004-11-09 at 10:37, Roland Dreier wrote: > By the way, reposting the receives is not the right thing to do on > error -- the QP will be in the error state, so any new work requests > will just complete with a flush status. We need to reset the QP and > start over to recover from errors. Is the same thing true for QP0/1 ? If so, this needs to be done there as well. (There used to be a port restart there but this was excised). -- Hal From roland at topspin.com Tue Nov 9 08:37:05 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 08:37:05 -0800 Subject: [openib-general] Re: IPoIB Completion Handling In-Reply-To: <1100016345.13933.230.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 09 Nov 2004 11:05:46 -0500") References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com> <524qjz9goh.fsf@topspin.com> <1100016345.13933.230.camel@localhost.localdomain> Message-ID: <52zn1r7zda.fsf@topspin.com> Roland> By the way, reposting the receives is not the right thing Roland> to do on error -- the QP will be in the error state, so Roland> any new work requests will just complete with a flush Roland> status. We need to reset the QP and start over to recover Roland> from errors. Hal> Is the same thing true for QP0/1 ? If so, this needs to be Hal> done there as well. (There used to be a port restart there Hal> but this was excised). Yes, of course (QP0/1 act just like normal UD QPs as far as work request processing/error semantics go). - R. From halr at voltaire.com Tue Nov 9 08:49:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 11:49:21 -0500 Subject: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy Message-ID: <1100018960.7222.6.camel@hpc-1> mad/agent: Modify receive buffer allocation strategy (Inefficiency pointed out by Sean; algorithm described by Roland) Problem: Currently, if the underlying driver provides a process_mad routine, a response MAD is allocated every time a MAD is received on QP 0 or 1. Solution: The MAD layer can allocate a response MAD when a MAD is received, and if the process_mad call doesn't actually generate a response the MAD layer just stashes the response MAD away to use for the next receive. This should keep the number of allocations within 1 of the number of responses actually generated, but save us from tracking allocations between two layers. Index: agent.h =================================================================== --- agent.h (revision 1180) +++ agent.h (working copy) @@ -31,7 +31,7 @@ extern int ib_agent_port_close(struct ib_device *device, int port_num); -extern int agent_send(struct ib_mad *mad, +extern int agent_send(struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc, struct ib_device *device, Index: agent_priv.h =================================================================== --- agent_priv.h (revision 1180) +++ agent_priv.h (working copy) @@ -33,7 +33,7 @@ struct ib_agent_send_wr { struct list_head send_list; struct ib_ah *ah; - struct ib_mad *mad; + struct ib_mad_private *mad; DECLARE_PCI_UNMAP_ADDR(mapping) }; Index: agent.c =================================================================== --- agent.c (revision 1182) +++ agent.c (working copy) @@ -33,7 +33,9 @@ static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED; static LIST_HEAD(ib_agent_port_list); +extern kmem_cache_t *ib_mad_cache; + static inline struct ib_agent_port_private * __ib_get_agent_port(struct ib_device *device, int port_num, struct ib_mad_agent *mad_agent) @@ -95,7 +97,7 @@ static int agent_mad_send(struct ib_mad_agent *mad_agent, struct ib_agent_port_private *port_priv, - struct ib_mad *mad, + struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc) { @@ -114,10 +116,10 @@ /* PCI mapping */ gather_list.addr = pci_map_single(mad_agent->device->dma_device, - mad, - sizeof(struct ib_mad), + &mad->grh, + sizeof *mad - sizeof mad->header, PCI_DMA_TODEVICE); - gather_list.length = sizeof(struct ib_mad); + gather_list.length = sizeof *mad - sizeof mad->header; gather_list.lkey = (*port_priv->mr).lkey; send_wr.next = NULL; @@ -133,7 +135,7 @@ ah_attr.src_path_bits = wc->dlid_path_bits; ah_attr.sl = wc->sl; ah_attr.static_rate = 0; - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; /* Should sgid be looked up ? */ @@ -162,14 +164,14 @@ } send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { send_wr.wr.ud.pkey_index = wc->pkey_index; send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; } else { send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */ send_wr.wr.ud.remote_qkey = 0; /* for SMPs */ } - send_wr.wr.ud.mad_hdr = (struct ib_mad_hdr *)mad; + send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr; send_wr.wr_id = ++port_priv->wr_id; pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); @@ -180,7 +182,8 @@ spin_unlock_irqrestore(&port_priv->send_list_lock, flags); pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); kfree(agent_send_wr); @@ -195,7 +198,7 @@ return ret; } -int agent_send(struct ib_mad *mad, +int agent_send(struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc, struct ib_device *device, @@ -212,7 +215,7 @@ } /* Get mad agent based on mgmt_class in MAD */ - switch (mad->mad_hdr.mgmt_class) { + switch (mad->mad.mad.mad_hdr.mgmt_class) { case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: mad_agent = port_priv->dr_smp_agent; break; @@ -269,13 +272,14 @@ /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); /* Release allocated memory */ - kfree(agent_send_wr->mad); + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); kfree(agent_send_wr); } Index: mad.c =================================================================== --- mad.c (revision 1181) +++ mad.c (working copy) @@ -69,7 +69,7 @@ MODULE_AUTHOR("Sean Hefty"); -static kmem_cache_t *ib_mad_cache; +kmem_cache_t *ib_mad_cache; static struct list_head ib_mad_port_list; static u32 ib_mad_client_id = 0; @@ -83,7 +83,8 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info); +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, @@ -1067,12 +1068,17 @@ { struct ib_mad_qp_info *qp_info; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_private *recv; + struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; struct ib_smp *smp; int solicited; + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; qp_info = mad_list->mad_queue->qp_info; dequeue_mad(mad_list); @@ -1119,11 +1125,9 @@ /* Give driver "right of first refusal" on incoming MAD */ if (port_priv->device->process_mad) { - struct ib_mad *response; struct ib_grh *grh; int ret; - response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); if (!response) { printk(KERN_ERR PFX "No memory for response MAD\n"); /* @@ -1137,32 +1141,29 @@ port_priv->port_num, wc->slid, recv->header.recv_buf.mad, - response); + &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_REPLY) { - if (response->mad_hdr.mgmt_class == + if (response->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { if (!smi_handle_dr_smp_recv( - (struct ib_smp *)response, + (struct ib_smp *)&response->mad.mad, port_priv->device->node_type, port_priv->port_num, port_priv->device->phys_port_cnt)) { - kfree(response); goto out; } } /* Send response */ grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh); - if (agent_send(response, grh, wc, - port_priv->device, - port_priv->port_num)) { - kfree(response); - } + if (!agent_send(response, grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; goto out; } - } else - kfree(response); + } } /* Determine corresponding MAD agent for incoming receive MAD */ @@ -1183,7 +1184,7 @@ kmem_cache_free(ib_mad_cache, recv); /* Post another receive request for this QP */ - ib_mad_post_receive_mad(qp_info); + ib_mad_post_receive_mad(qp_info, response); } static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) @@ -1491,7 +1492,8 @@ queue_work(port_priv->wq, &port_priv->work); } -static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info) +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) { struct ib_mad_private *mad_priv; struct ib_sge sg_list; @@ -1499,19 +1501,23 @@ struct ib_recv_wr *bad_recv_wr; int ret; - /* - * Allocate memory for receive buffer. - * This is for both MAD and private header - * which contains the receive tracking structure. - * By prepending this header, there is one rather - * than two memory allocations. - */ - mad_priv = kmem_cache_alloc(ib_mad_cache, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); - if (!mad_priv) { - printk(KERN_ERR PFX "No memory for receive buffer\n"); - return -ENOMEM; + if (mad) + mad_priv = mad; + else { + /* + * Allocate memory for receive buffer. + * This is for both MAD and private header + * which contains the receive tracking structure. + * By prepending this header, there is one rather + * than two memory allocations. + */ + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + return -ENOMEM; + } } /* Setup scatter list */ @@ -1559,7 +1565,7 @@ int i, ret; for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - ret = ib_mad_post_receive_mad(qp_info); + ret = ib_mad_post_receive_mad(qp_info, NULL); if (ret) { printk(KERN_ERR PFX "receive post %d failed " "on %s port %d\n", i + 1, From mshefty at ichips.intel.com Tue Nov 9 09:00:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 09:00:54 -0800 Subject: [openib-general] MAD agent code comments In-Reply-To: <1100009558.13933.3.camel@localhost.localdomain> References: <41900EEB.7050109@ichips.intel.com> <1100009558.13933.3.camel@localhost.localdomain> Message-ID: <4190F7C6.8020509@ichips.intel.com> Hal Rosenstock wrote: > Since the agent does not use solicited sends, are its sends completed in > order (so this is only an issue for clients using solicited sends) ? I would think that solicited sends (i.e. responses) would be easier to maintain order, since those wouldn't have a timeout. But my preference would be to not defined the API this way. It makes queuing for QP overrun and error handling difficult. For example, a client posts 2 sends, both of which get queued. If the first send gets posted, but the second send fails when posting to the QP, then we'd need to delay reporting the second send's completion. This also makes it more difficult to go to multi-threaded completion handling, if that were shown to be beneficial. - Sean From halr at voltaire.com Tue Nov 9 09:07:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 12:07:56 -0500 Subject: [openib-general] More on IPoIB Multicast Message-ID: <1100020075.7342.1.camel@hpc-1> Hi Roland, If a multicast send is attempted and the node is not joined to the multicast group which is the destination of the send, a send only join (which is neutered due to lack of SM support) is assumed. Is my understanding correct ? Linux also supports multicast routing. For this case, I think a non member join is needed. I'm not sure how to detect which of the join cases to use. Also, for multicast routing, the multicast group created/removed traps would need to be subscribed to (and the SM would need to support these). Does anyone know if OpenSM does support this ? -- Hal From roland at topspin.com Tue Nov 9 09:07:06 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 09:07:06 -0800 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <1100020075.7342.1.camel@hpc-1> (Hal Rosenstock's message of "Tue, 09 Nov 2004 12:07:56 -0500") References: <1100020075.7342.1.camel@hpc-1> Message-ID: <52r7n37xz9.fsf@topspin.com> Hal> Hi Roland, If a multicast send is attempted and the node is Hal> not joined to the multicast group which is the destination of Hal> the send, a send only join (which is neutered due to lack of Hal> SM support) is assumed. Is my understanding correct ? Yes. Hal> Linux also supports multicast routing. For this case, I think Hal> a non member join is needed. I'm not sure how to detect which Hal> of the join cases to use. Hal> Also, for multicast routing, the multicast group Hal> created/removed traps would need to be subscribed to (and the Hal> SM would need to support these). Someone who understands how the kernel does multicast routing would have to guide us here. My goal is to get basic IPv4 cleaned up to the point I feel comfortable submitting upstream. However I'm very happy to have other people look at IPv6, multicast routing, multiport bonding/failover (although my feeling is that it would be better to extend the existing bonding driver rather than trying to put this in the IPoIB driver), .... - R. From mshefty at ichips.intel.com Tue Nov 9 09:11:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 09:11:39 -0800 Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases In-Reply-To: <1100008896.8714.2105.camel@hpc-1> References: <1100008896.8714.2105.camel@hpc-1> Message-ID: <4190FA4B.5000000@ichips.intel.com> Hal Rosenstock wrote: > mad: In ib_mad_recv_done_handler, don't dispatch additional error cases > + if (ret & IB_MAD_RESULT_SUCCESS) { > + if (ret & IB_MAD_RESULT_REPLY) { > + if (response->mad_hdr.mgmt_class == > + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { > + if (!smi_handle_dr_smp_recv( > + (struct ib_smp *)response, > + port_priv->device->node_type, > + port_priv->port_num, > + port_priv->device->phys_port_cnt)) { > + kfree(response); > + goto out; > + } > + } > + /* Send response */ > + grh = (void *)recv->header.recv_buf.mad - > + sizeof(struct ib_grh); > + if (agent_send(response, grh, wc, > + port_priv->device, > + port_priv->port_num)) { > kfree(response); > } > goto out; > } goto out; I guess I was wondering if it was okay to move "goto out" to here, and always skip dispatching if process_mad returned success. I think dispatching in the failure case makes sense. > } else From mshefty at ichips.intel.com Tue Nov 9 09:16:33 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 09:16:33 -0800 Subject: [openib-general] Re: IPoIB Completion Handling In-Reply-To: <1100016345.13933.230.camel@localhost.localdomain> References: <1100010867.26166.7.camel@hpc-1> <52hdnz9hz2.fsf@topspin.com> <524qjz9goh.fsf@topspin.com> <1100016345.13933.230.camel@localhost.localdomain> Message-ID: <4190FB71.1090704@ichips.intel.com> Hal Rosenstock wrote: > On Tue, 2004-11-09 at 10:37, Roland Dreier wrote: > >>By the way, reposting the receives is not the right thing to do on >>error -- the QP will be in the error state, so any new work requests >>will just complete with a flush status. We need to reset the QP and >>start over to recover from errors. > > > Is the same thing true for QP0/1 ? If so, this needs to be done there as > well. (There used to be a port restart there but this was excised). Btw, I have plans to get to this shortly. I have the send queuing code complete (need to re-merge after the patches this morning), but I haven't been able to debug the code yet. I'm running into some issues configuring a point-to-point "fabric", with opensm running on the sourceforge stack on the other node. I have some changes to handle send queuing that are needed when recovering from QP errors as well. - Sean From halr at voltaire.com Tue Nov 9 09:18:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 12:18:39 -0500 Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases In-Reply-To: <4190FA4B.5000000@ichips.intel.com> References: <1100008896.8714.2105.camel@hpc-1> <4190FA4B.5000000@ichips.intel.com> Message-ID: <1100020719.13933.332.camel@localhost.localdomain> On Tue, 2004-11-09 at 12:11, Sean Hefty wrote: > Hal Rosenstock wrote: > > > mad: In ib_mad_recv_done_handler, don't dispatch additional error cases > > + if (ret & IB_MAD_RESULT_SUCCESS) { > > + if (ret & IB_MAD_RESULT_REPLY) { > > + if (response->mad_hdr.mgmt_class == > > + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { > > + if (!smi_handle_dr_smp_recv( > > + (struct ib_smp *)response, > > + port_priv->device->node_type, > > + port_priv->port_num, > > + port_priv->device->phys_port_cnt)) { > > + kfree(response); > > + goto out; > > + } > > + } > > + /* Send response */ > > + grh = (void *)recv->header.recv_buf.mad - > > + sizeof(struct ib_grh); > > + if (agent_send(response, grh, wc, > > + port_priv->device, > > + port_priv->port_num)) { > > kfree(response); > > } > > goto out; > > } > > goto out; > > I guess I was wondering if it was okay to move "goto out" to here, and > always skip dispatching if process_mad returned success. I think > dispatching in the failure case makes sense. Yes (more than OK, it's better :-) I'll issue a patch for this shortly. -- Hal From halr at voltaire.com Tue Nov 9 09:33:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 12:33:02 -0500 Subject: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch in additional case Message-ID: <1100021581.10808.1.camel@hpc-1> mad: In ib_mad_recv_done_handler, don't dispatch in additional case Index: mad.c =================================================================== --- mad.c (revision 1183) +++ mad.c (working copy) @@ -1161,8 +1161,8 @@ port_priv->device, port_priv->port_num)) response = NULL; - goto out; } + goto out; } } From halr at voltaire.com Tue Nov 9 09:30:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 12:30:50 -0500 Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy] Message-ID: <1100021450.13933.339.camel@localhost.localdomain> One more thing on this I forgot to post: As I am not yet set up with Kegel cross tools (and don't have a machine where the pci_ macros are non trivial), I would appreciate it if someone could verify these changes (or latest code) on some architecture where the pci_ macros are non trivial. Thanks. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy Date: 09 Nov 2004 11:49:21 -0500 mad/agent: Modify receive buffer allocation strategy (Inefficiency pointed out by Sean; algorithm described by Roland) Problem: Currently, if the underlying driver provides a process_mad routine, a response MAD is allocated every time a MAD is received on QP 0 or 1. Solution: The MAD layer can allocate a response MAD when a MAD is received, and if the process_mad call doesn't actually generate a response the MAD layer just stashes the response MAD away to use for the next receive. This should keep the number of allocations within 1 of the number of responses actually generated, but save us from tracking allocations between two layers. Index: agent.h =================================================================== --- agent.h (revision 1180) +++ agent.h (working copy) @@ -31,7 +31,7 @@ extern int ib_agent_port_close(struct ib_device *device, int port_num); -extern int agent_send(struct ib_mad *mad, +extern int agent_send(struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc, struct ib_device *device, Index: agent_priv.h =================================================================== --- agent_priv.h (revision 1180) +++ agent_priv.h (working copy) @@ -33,7 +33,7 @@ struct ib_agent_send_wr { struct list_head send_list; struct ib_ah *ah; - struct ib_mad *mad; + struct ib_mad_private *mad; DECLARE_PCI_UNMAP_ADDR(mapping) }; Index: agent.c =================================================================== --- agent.c (revision 1182) +++ agent.c (working copy) @@ -33,7 +33,9 @@ static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED; static LIST_HEAD(ib_agent_port_list); +extern kmem_cache_t *ib_mad_cache; + static inline struct ib_agent_port_private * __ib_get_agent_port(struct ib_device *device, int port_num, struct ib_mad_agent *mad_agent) @@ -95,7 +97,7 @@ static int agent_mad_send(struct ib_mad_agent *mad_agent, struct ib_agent_port_private *port_priv, - struct ib_mad *mad, + struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc) { @@ -114,10 +116,10 @@ /* PCI mapping */ gather_list.addr = pci_map_single(mad_agent->device->dma_device, - mad, - sizeof(struct ib_mad), + &mad->grh, + sizeof *mad - sizeof mad->header, PCI_DMA_TODEVICE); - gather_list.length = sizeof(struct ib_mad); + gather_list.length = sizeof *mad - sizeof mad->header; gather_list.lkey = (*port_priv->mr).lkey; send_wr.next = NULL; @@ -133,7 +135,7 @@ ah_attr.src_path_bits = wc->dlid_path_bits; ah_attr.sl = wc->sl; ah_attr.static_rate = 0; - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; /* Should sgid be looked up ? */ @@ -162,14 +164,14 @@ } send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { send_wr.wr.ud.pkey_index = wc->pkey_index; send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; } else { send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */ send_wr.wr.ud.remote_qkey = 0; /* for SMPs */ } - send_wr.wr.ud.mad_hdr = (struct ib_mad_hdr *)mad; + send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr; send_wr.wr_id = ++port_priv->wr_id; pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); @@ -180,7 +182,8 @@ spin_unlock_irqrestore(&port_priv->send_list_lock, flags); pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); kfree(agent_send_wr); @@ -195,7 +198,7 @@ return ret; } -int agent_send(struct ib_mad *mad, +int agent_send(struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc, struct ib_device *device, @@ -212,7 +215,7 @@ } /* Get mad agent based on mgmt_class in MAD */ - switch (mad->mad_hdr.mgmt_class) { + switch (mad->mad.mad.mad_hdr.mgmt_class) { case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: mad_agent = port_priv->dr_smp_agent; break; @@ -269,13 +272,14 @@ /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); /* Release allocated memory */ - kfree(agent_send_wr->mad); + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); kfree(agent_send_wr); } Index: mad.c =================================================================== --- mad.c (revision 1181) +++ mad.c (working copy) @@ -69,7 +69,7 @@ MODULE_AUTHOR("Sean Hefty"); -static kmem_cache_t *ib_mad_cache; +kmem_cache_t *ib_mad_cache; static struct list_head ib_mad_port_list; static u32 ib_mad_client_id = 0; @@ -83,7 +83,8 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info); +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, @@ -1067,12 +1068,17 @@ { struct ib_mad_qp_info *qp_info; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_private *recv; + struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; struct ib_smp *smp; int solicited; + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; qp_info = mad_list->mad_queue->qp_info; dequeue_mad(mad_list); @@ -1119,11 +1125,9 @@ /* Give driver "right of first refusal" on incoming MAD */ if (port_priv->device->process_mad) { - struct ib_mad *response; struct ib_grh *grh; int ret; - response = kmalloc(sizeof(struct ib_mad), GFP_KERNEL); if (!response) { printk(KERN_ERR PFX "No memory for response MAD\n"); /* @@ -1137,32 +1141,29 @@ port_priv->port_num, wc->slid, recv->header.recv_buf.mad, - response); + &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_REPLY) { - if (response->mad_hdr.mgmt_class == + if (response->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { if (!smi_handle_dr_smp_recv( - (struct ib_smp *)response, + (struct ib_smp *)&response->mad.mad, port_priv->device->node_type, port_priv->port_num, port_priv->device->phys_port_cnt)) { - kfree(response); goto out; } } /* Send response */ grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh); - if (agent_send(response, grh, wc, - port_priv->device, - port_priv->port_num)) { - kfree(response); - } + if (!agent_send(response, grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; goto out; } - } else - kfree(response); + } } /* Determine corresponding MAD agent for incoming receive MAD */ @@ -1183,7 +1184,7 @@ kmem_cache_free(ib_mad_cache, recv); /* Post another receive request for this QP */ - ib_mad_post_receive_mad(qp_info); + ib_mad_post_receive_mad(qp_info, response); } static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) @@ -1491,7 +1492,8 @@ queue_work(port_priv->wq, &port_priv->work); } -static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info) +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) { struct ib_mad_private *mad_priv; struct ib_sge sg_list; @@ -1499,19 +1501,23 @@ struct ib_recv_wr *bad_recv_wr; int ret; - /* - * Allocate memory for receive buffer. - * This is for both MAD and private header - * which contains the receive tracking structure. - * By prepending this header, there is one rather - * than two memory allocations. - */ - mad_priv = kmem_cache_alloc(ib_mad_cache, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); - if (!mad_priv) { - printk(KERN_ERR PFX "No memory for receive buffer\n"); - return -ENOMEM; + if (mad) + mad_priv = mad; + else { + /* + * Allocate memory for receive buffer. + * This is for both MAD and private header + * which contains the receive tracking structure. + * By prepending this header, there is one rather + * than two memory allocations. + */ + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + return -ENOMEM; + } } /* Setup scatter list */ @@ -1559,7 +1565,7 @@ int i, ret; for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - ret = ib_mad_post_receive_mad(qp_info); + ret = ib_mad_post_receive_mad(qp_info, NULL); if (ret) { printk(KERN_ERR PFX "receive post %d failed " "on %s port %d\n", i + 1, _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From libor at topspin.com Tue Nov 9 09:55:13 2004 From: libor at topspin.com (Libor Michalek) Date: Tue, 9 Nov 2004 09:55:13 -0800 Subject: [openib-general] VAPI_RETRY_EXC_ERR In-Reply-To: <4A388685F814D54CAE412B2DAB7CE91C195455@initexch.topspincom.com>; from sreenivasulu@topspin.com on Tue, Nov 09, 2004 at 04:19:17PM +0530 References: <4A388685F814D54CAE412B2DAB7CE91C195455@initexch.topspincom.com> Message-ID: <20041109095513.A30186@topspin.com> On Tue, Nov 09, 2004 at 04:19:17PM +0530, Sreenivasulu Pulichintala wrote: > -----Original Message----- > From: Sreenivasulu Pulichintala > Sent: Tuesday, November 09, 2004 3:56 PM > To: openib-general at openib.org > Subject: [openib-general] VAPI_RETRY_EXC_ERR > > HI, > > I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my > fortran applications, some times my application crashes producing the > following error - > > Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81 Of the possible issues that Tziporet lists, the most likely problem with MVAPICH 0.9.2 is that the local ack timeout is too small for either large or blocking clusters. It is currently set to 10 (DEFAULT_ACK_TIMEOUT) which translates to 4 milliseconds. (IBTA spec section 9.9.2) I would try a value such as 15 or 20... Also the retry counter is set using the define DEFAULT_RETRY_COUNT in the MVAPICH source. It's currently set to 5. -Libor From roland at topspin.com Tue Nov 9 11:37:33 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 11:37:33 -0800 Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy] In-Reply-To: <1100021450.13933.339.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 09 Nov 2004 12:30:50 -0500") References: <1100021450.13933.339.camel@localhost.localdomain> Message-ID: <52is8e7r0i.fsf@topspin.com> Hal> One more thing on this I forgot to post: As I am not yet set Hal> up with Kegel cross tools (and don't have a machine where the Hal> pci_ macros are non trivial), I would appreciate it if Hal> someone could verify these changes (or latest code) on some Hal> architecture where the pci_ macros are non trivial. It builds fine on all the architectures I test but (with r1184) the SMA doesn't seem to be working (port stays in INIT state). I see the port_rcv_data counter going up so I know the SM is sweeping. On i386 I don't see anything in the log, and on ppc64 I see a stream of: Invalid directed route in the kernel log. - R. From roland at topspin.com Tue Nov 9 11:39:16 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 11:39:16 -0800 Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy] In-Reply-To: <52is8e7r0i.fsf@topspin.com> (Roland Dreier's message of "Tue, 09 Nov 2004 11:37:33 -0800") References: <1100021450.13933.339.camel@localhost.localdomain> <52is8e7r0i.fsf@topspin.com> Message-ID: <52ekj27qxn.fsf@topspin.com> By the way, we probably want this applied: Index: core/mad.c =================================================================== --- core/mad.c (revision 1184) +++ core/mad.c (working copy) @@ -385,7 +385,7 @@ mad_agent->device->node_type, mad_agent->port_num)) { ret = -EINVAL; - printk(KERN_ERR "Invalid directed route\n"); + printk(KERN_ERR PFX "Invalid directed route\n"); goto error1; } if (smi_check_local_dr_smp(smp, From mshefty at ichips.intel.com Tue Nov 9 11:40:14 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 11:40:14 -0800 Subject: [openib-general] error trying to bring up node Message-ID: <41911D1E.10608@ichips.intel.com> I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... - Sean From mshefty at ichips.intel.com Tue Nov 9 11:56:47 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 11:56:47 -0800 Subject: [openib-general] error trying to bring up node In-Reply-To: <41911D1E.10608@ichips.intel.com> References: <41911D1E.10608@ichips.intel.com> Message-ID: <419120FF.7010607@ichips.intel.com> Sean Hefty wrote: > I have two nodes directly connected. When trying to bring up the openib > node, I receive a local length error on the CQ after trying to perform a > send. > > I'm continuing to debug... static int agent_mad_send(struct ib_mad_agent *mad_agent, struct ib_agent_port_private *port_priv, struct ib_mad_private *mad, struct ib_grh *grh, struct ib_wc *wc) { ... /* PCI mapping */ gather_list.addr = pci_map_single(mad_agent->device->dma_device, &mad->grh, sizeof *mad - sizeof mad->header, PCI_DMA_TODEVICE); gather_list.length = sizeof *mad - sizeof mad->header; gather_list.lkey = (*port_priv->mr).lkey; Wouldn't this result in sending the GRH data buffer before the MAD buffer? Does mthca check the size of sends that are posted to QP0/1 and report an error if they are larger than 256 bytes? - Sean From halr at voltaire.com Tue Nov 9 12:19:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 15:19:51 -0500 Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy] In-Reply-To: <52ekj27qxn.fsf@topspin.com> References: <1100021450.13933.339.camel@localhost.localdomain> <52is8e7r0i.fsf@topspin.com> <52ekj27qxn.fsf@topspin.com> Message-ID: <1100031591.13933.349.camel@localhost.localdomain> On Tue, 2004-11-09 at 14:39, Roland Dreier wrote: > By the way, we probably want this applied: Thanks. Applied. -- Hal From roland at topspin.com Tue Nov 9 12:25:55 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 12:25:55 -0800 Subject: [openib-general] error trying to bring up node In-Reply-To: <419120FF.7010607@ichips.intel.com> (Sean Hefty's message of "Tue, 09 Nov 2004 11:56:47 -0800") References: <41911D1E.10608@ichips.intel.com> <419120FF.7010607@ichips.intel.com> Message-ID: <52zn1q6a7g.fsf@topspin.com> Sean> Wouldn't this result in sending the GRH data buffer before Sean> the MAD buffer? Yes, it sure looks that way. Sean> Does mthca check the size of sends that are Sean> posted to QP0/1 and report an error if they are larger than Sean> 256 bytes? No, it will probably send it. (And cause a problem on the receive side) - R. From halr at voltaire.com Tue Nov 9 12:24:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 15:24:16 -0500 Subject: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy] In-Reply-To: <52is8e7r0i.fsf@topspin.com> References: <1100021450.13933.339.camel@localhost.localdomain> <52is8e7r0i.fsf@topspin.com> Message-ID: <1100031856.13933.353.camel@localhost.localdomain> On Tue, 2004-11-09 at 14:37, Roland Dreier wrote: > Hal> One more thing on this I forgot to post: As I am not yet set > Hal> up with Kegel cross tools (and don't have a machine where the > Hal> pci_ macros are non trivial), I would appreciate it if > Hal> someone could verify these changes (or latest code) on some > Hal> architecture where the pci_ macros are non trivial. > > It builds fine on all the architectures I test but (with r1184) the > SMA doesn't seem to be working (port stays in INIT state). I see the > port_rcv_data counter going up so I know the SM is sweeping. On i386 > I don't see anything in the log, and on ppc64 I see a stream of: > > Invalid directed route > > in the kernel log. In smi.c, smi_handle_dr_smp_send is indicating this packet is invalid for some reason. What are the hop_cnt and hop_ptr in the outgoing SMP ? Is your configuration the same as Sean's (back to back HCAs) ? Thanks. -- Hal From halr at voltaire.com Tue Nov 9 12:25:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 15:25:37 -0500 Subject: [openib-general] error trying to bring up node In-Reply-To: <419120FF.7010607@ichips.intel.com> References: <41911D1E.10608@ichips.intel.com> <419120FF.7010607@ichips.intel.com> Message-ID: <1100031937.13933.355.camel@localhost.localdomain> On Tue, 2004-11-09 at 14:56, Sean Hefty wrote: > Sean Hefty wrote: > > > I have two nodes directly connected. When trying to bring up the openib > > node, I receive a local length error on the CQ after trying to perform a > > send. > > > > I'm continuing to debug... > > static int agent_mad_send(struct ib_mad_agent *mad_agent, > struct ib_agent_port_private *port_priv, > struct ib_mad_private *mad, > struct ib_grh *grh, > struct ib_wc *wc) > { > ... > /* PCI mapping */ > gather_list.addr = pci_map_single(mad_agent->device->dma_device, > &mad->grh, > sizeof *mad - > sizeof mad->header, > PCI_DMA_TODEVICE); > gather_list.length = sizeof *mad - sizeof mad->header; > gather_list.lkey = (*port_priv->mr).lkey; > > > Wouldn't this result in sending the GRH data buffer before the MAD > buffer? Does mthca check the size of sends that are posted to QP0/1 and > report an error if they are larger than 256 bytes? Doesn't that just map starting at the GRH ? This is to handle PMA responses which might have GRHs. -- Hal From roland at topspin.com Tue Nov 9 12:30:48 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 12:30:48 -0800 Subject: [openib-general] error trying to bring up node In-Reply-To: <1100031937.13933.355.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 09 Nov 2004 15:25:37 -0500") References: <41911D1E.10608@ichips.intel.com> <419120FF.7010607@ichips.intel.com> <1100031937.13933.355.camel@localhost.localdomain> Message-ID: <52u0ry69zb.fsf@topspin.com> Hal> Doesn't that just map starting at the GRH ? This is to handle Hal> PMA responses which might have GRHs. Sure, it maps starting at the GRH and uses that as the start of the gather segment used for the send (and tries to send more than 256 bytes). This is wrong even when sending a packet with GRH (the address vector has the global route information; you don't have to supply a GRH when posting the send). - Roland From mshefty at ichips.intel.com Tue Nov 9 12:32:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 12:32:55 -0800 Subject: [openib-general] error trying to bring up node In-Reply-To: <1100031937.13933.355.camel@localhost.localdomain> References: <41911D1E.10608@ichips.intel.com> <419120FF.7010607@ichips.intel.com> <1100031937.13933.355.camel@localhost.localdomain> Message-ID: <41912977.50806@ichips.intel.com> Hal Rosenstock wrote: > On Tue, 2004-11-09 at 14:56, Sean Hefty wrote: > >>Sean Hefty wrote: >> >> >>>I have two nodes directly connected. When trying to bring up the openib >>>node, I receive a local length error on the CQ after trying to perform a >>>send. >>> >>>I'm continuing to debug... >> >>static int agent_mad_send(struct ib_mad_agent *mad_agent, >> struct ib_agent_port_private *port_priv, >> struct ib_mad_private *mad, >> struct ib_grh *grh, >> struct ib_wc *wc) >>{ >>... >> /* PCI mapping */ >> gather_list.addr = pci_map_single(mad_agent->device->dma_device, >> &mad->grh, >> sizeof *mad - >> sizeof mad->header, >> PCI_DMA_TODEVICE); >> gather_list.length = sizeof *mad - sizeof mad->header; >> gather_list.lkey = (*port_priv->mr).lkey; >> >> >>Wouldn't this result in sending the GRH data buffer before the MAD >>buffer? Does mthca check the size of sends that are posted to QP0/1 and >>report an error if they are larger than 256 bytes? > > > Doesn't that just map starting at the GRH ? This is to handle PMA > responses which might have GRHs. It does. But the GRH buffer shouldn't be sent by the user. My thought was the this would result in the receiver mis-interpreting the received MAD, and probably dropping it. But I'm seeing that the work request completes in error, which makes me think that there's still another error somewhere. - Sean From halr at voltaire.com Tue Nov 9 12:30:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 15:30:29 -0500 Subject: [openib-general] error trying to bring up node In-Reply-To: <1100031937.13933.355.camel@localhost.localdomain> References: <41911D1E.10608@ichips.intel.com> <419120FF.7010607@ichips.intel.com> <1100031937.13933.355.camel@localhost.localdomain> Message-ID: <1100032229.13933.360.camel@localhost.localdomain> On Tue, 2004-11-09 at 15:25, Hal Rosenstock wrote: > Doesn't that just map starting at the GRH ? This is to handle PMA > responses which might have GRHs. Never mind. I see the problem. -- Hal From halr at voltaire.com Tue Nov 9 12:49:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 15:49:34 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length Message-ID: <1100033372.17687.3.camel@hpc-1> agent: Fix agent_mad_send PCI mapping and gather address and length Index: agent.c =================================================================== --- agent.c (revision 1183) +++ agent.c (working copy) @@ -116,10 +116,10 @@ /* PCI mapping */ gather_list.addr = pci_map_single(mad_agent->device->dma_device, - &mad->grh, - sizeof *mad - sizeof mad->header, + &mad->mad, + sizeof(struct ib_mad), PCI_DMA_TODEVICE); - gather_list.length = sizeof *mad - sizeof mad->header; + gather_list.length = sizeof(struct ib_mad); gather_list.lkey = (*port_priv->mr).lkey; send_wr.next = NULL; @@ -272,8 +272,7 @@ /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), + sizeof(struct ib_mad), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); From roland at topspin.com Tue Nov 9 12:50:30 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 12:50:30 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100033372.17687.3.camel@hpc-1> (Hal Rosenstock's message of "Tue, 09 Nov 2004 15:49:34 -0500") References: <1100033372.17687.3.camel@hpc-1> Message-ID: <52bre6692h.fsf@topspin.com> OK, this works on my i386 system but I'm still getting ib_mad: Invalid directed route on ppc64. I'll try to debug what exactly is happening (ie put some prints in to see why smi_handle_dr_smp_send() is rejecting it). - R. From roland at topspin.com Tue Nov 9 12:53:13 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 12:53:13 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52bre6692h.fsf@topspin.com> (Roland Dreier's message of "Tue, 09 Nov 2004 12:50:30 -0800") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> Message-ID: <527jou68xy.fsf@topspin.com> Roland> OK, this works on my i386 system but I'm still getting Roland> ib_mad: Invalid directed route Roland> on ppc64. I'll try to debug what exactly is happening (ie Roland> put some prints in to see why smi_handle_dr_smp_send() is Roland> rejecting it). By the way, the i386 system is connected directly to the switch running the SM, while the ppc64 system is a few hops away. So it's just as likely to be a DR SMI handling problem as a ppc64 architecture issue. - R. From halr at voltaire.com Tue Nov 9 12:55:43 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 15:55:43 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <527jou68xy.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> Message-ID: <1100033742.2170.11.camel@localhost.localdomain> On Tue, 2004-11-09 at 15:53, Roland Dreier wrote: > By the way, the i386 system is connected directly to the switch > running the SM, That's the config I run in too. > while the ppc64 system is a few hops away. I think Sean's original config was a couple of hops. > So it's > just as likely to be a DR SMI handling problem as a ppc64 architecture > issue. My money's on a DR SMI issue :-) -- Hal From root at DYN318430BLD.linux.local Tue Nov 9 13:32:15 2004 From: root at DYN318430BLD.linux.local (root) Date: Tue, 9 Nov 2004 13:32:15 -0800 (PST) Subject: [openib-general] [PATCH] Unnecessary initialization of sa_query in failure case. In-Reply-To: <52sm7kb0zk.fsf@topspin.com> Message-ID: diff -ruNp org/core/sa_query.c new/core/sa_query.c --- org/core/sa_query.c 2004-11-09 12:51:35.000000000 -0800 +++ new/core/sa_query.c 2004-11-09 13:30:38.000000000 -0800 @@ -547,7 +547,6 @@ int ib_sa_path_rec_get(struct ib_device *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); if (ret) { - *sa_query = NULL; kfree(query->sa_query.mad); kfree(query); } @@ -623,7 +622,6 @@ int ib_sa_mcmember_rec_query(struct ib_d *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); if (ret) { - *sa_query = NULL; kfree(query->sa_query.mad); kfree(query); } From roland at topspin.com Tue Nov 9 15:03:00 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 15:03:00 -0800 Subject: [openib-general] Re: [PATCH] Unnecessary initialization of sa_query in failure case. In-Reply-To: (root@dyn318430bld.linux.local's message of "Tue, 9 Nov 2004 13:32:15 -0800 (PST)") References: Message-ID: <52pt2m4od7.fsf@topspin.com> Why is this initialization unnecessary? If we delete these lines then sa_query is left pointing to invalid memory when a send fails? - R. From root at DYN318430BLD.linux.local Tue Nov 9 14:06:47 2004 From: root at DYN318430BLD.linux.local (root) Date: Tue, 9 Nov 2004 14:06:47 -0800 (PST) Subject: [openib-general] Question on handle_outgoing_smp Message-ID: In following code : if (smi_check_local_dr_smp(smp, mad_agent->device, mad_agent->port_num)) { ... ret = mad_agent->device->process_mad( mad_agent->device, 0, mad_agent->port_num, smp->dr_slid, /* ? */ (struct ib_mad *)smp, (struct ib_mad *)&mad_priv->mad); How do we guarantee that the process_mad() was supplied (not NULL) ? That is if smi_check_local_smp didn't get called via smi_check_local_dr_smp ? thx, - KK From krkumar at us.ibm.com Tue Nov 9 15:31:44 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 9 Nov 2004 15:31:44 -0800 (PST) Subject: [openib-general] Re: [PATCH] Unnecessary initialization of sa_query in failure case. In-Reply-To: <52pt2m4od7.fsf@topspin.com> Message-ID: On Tue, 9 Nov 2004, Roland Dreier wrote: > Why is this initialization unnecessary? If we delete these lines then > sa_query is left pointing to invalid memory when a send fails? Because ULP's should not use a pointers to-be-set-in-callee routines if the call failed. In this case, path_rec_start and unicast_arp_start should not use "query" if the call failed. And "query" is a stack variable in those routines, so it won't hang around too long :-) thanks, - KK From tduffy at sun.com Tue Nov 9 15:49:35 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 09 Nov 2004 15:49:35 -0800 Subject: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <20041109232308.607B72283D4@openib.ca.sandia.gov> References: <20041109232308.607B72283D4@openib.ca.sandia.gov> Message-ID: <1100044175.12438.3.camel@duffman> On Tue, 2004-11-09 at 15:23 -0800, halr at openib.org wrote: > Author: halr > Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004) > New Revision: 1186 > > Modified: > gen2/trunk/src/linux-kernel/infiniband/core/agent.c > Log: > Fix agent_mad_send PCI mapping and gather address and length Please revert this change. It seems to break x86_64 as well, at least in my setup. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Nov 9 15:54:07 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 15:54:07 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100033742.2170.11.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 09 Nov 2004 15:55:43 -0500") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> Message-ID: <52llda4m00.fsf@topspin.com> OK, I think I understand the problem, but I'm not sure what the correct solution is. When a DR SMP arrives at a CA from the SM, hop_cnt == hop_ptr == number of hops in the directed route, and somehow they are not updated correctly by the time the response reaches handle_outgoing_smp(). I can't follow the code well enough to understand why all DR SMPs have to go through both smi_handle_dr_smp_recv() and smi_handle_dr_smp_send() but the patch below seems to correct things for me (ports go to ACTIVE on all my systems). (handle_outgoing_smp() already calls smi_handle_dr_smp_recv() so it seems the response was getting passed to smi_handle_dr_smp_recv() twice). - R. Index: mad.c =================================================================== --- mad.c (revision 1186) +++ mad.c (working copy) @@ -1144,16 +1144,6 @@ &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_REPLY) { - if (response->mad.mad.mad_hdr.mgmt_class == - IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - if (!smi_handle_dr_smp_recv( - (struct ib_smp *)&response->mad.mad, - port_priv->device->node_type, - port_priv->port_num, - port_priv->device->phys_port_cnt)) { - goto out; - } - } /* Send response */ grh = (void *)recv->header.recv_buf.mad - sizeof(struct ib_grh); From Nitin.Hande at Sun.COM Tue Nov 9 15:55:45 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Tue, 09 Nov 2004 15:55:45 -0800 Subject: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <1100044175.12438.3.camel@duffman> References: <20041109232308.607B72283D4@openib.ca.sandia.gov> <1100044175.12438.3.camel@duffman> Message-ID: <41915901.9000502@Sun.COM> Tom Duffy wrote: > On Tue, 2004-11-09 at 15:23 -0800, halr at openib.org wrote: > >>Author: halr >>Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004) >>New Revision: 1186 >> >>Modified: >> gen2/trunk/src/linux-kernel/infiniband/core/agent.c >>Log: >>Fix agent_mad_send PCI mapping and gather address and length > > > Please revert this change. It seems to break x86_64 as well, at least > in my setup. certainly it does break my x86_64 setup too. Can we revert back to working set of bits please ? Thanks Nitin > > -tduffy > > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Tue Nov 9 16:01:15 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 16:01:15 -0800 Subject: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <41915901.9000502@Sun.COM> (Nitin Hande's message of "Tue, 09 Nov 2004 15:55:45 -0800") References: <20041109232308.607B72283D4@openib.ca.sandia.gov> <1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM> Message-ID: <528y9a4lo4.fsf@topspin.com> Nitin> certainly it does break my x86_64 setup too. Can we revert Nitin> back to working set of bits please ? It's actually not an architecture issue -- it's an issue if your node is more than one hop from the SM. You should be able to use the patch I just posted to get things working again. Let's give Hal a chance to fix things up properly. - R. From mshefty at ichips.intel.com Tue Nov 9 16:08:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 16:08:46 -0800 Subject: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <528y9a4lo4.fsf@topspin.com> References: <20041109232308.607B72283D4@openib.ca.sandia.gov> <1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM> <528y9a4lo4.fsf@topspin.com> Message-ID: <41915C0E.9040807@ichips.intel.com> Roland Dreier wrote: > Nitin> certainly it does break my x86_64 setup too. Can we revert > Nitin> back to working set of bits please ? > > It's actually not an architecture issue -- it's an issue if your node > is more than one hop from the SM. You should be able to use the patch > I just posted to get things working again. Let's give Hal a chance to > fix things up properly. This patch just fixed the issues I was having as well, and I'm running with two systems directly connected. Thanks. - Sean From tduffy at sun.com Tue Nov 9 16:12:04 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 09 Nov 2004 16:12:04 -0800 Subject: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <528y9a4lo4.fsf@topspin.com> References: <20041109232308.607B72283D4@openib.ca.sandia.gov> <1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM> <528y9a4lo4.fsf@topspin.com> Message-ID: <1100045524.12438.14.camel@duffman> On Tue, 2004-11-09 at 16:01 -0800, Roland Dreier wrote: > Nitin> certainly it does break my x86_64 setup too. Can we revert > Nitin> back to working set of bits please ? > > It's actually not an architecture issue -- it's an issue if your node > is more than one hop from the SM. You should be able to use the patch > I just posted to get things working again. Let's give Hal a chance to > fix things up properly. OK, your patch got rid of the "ib_mad: Invalid directed route" message anyways. And my port is going to ACTIVE now. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Nitin.Hande at Sun.COM Tue Nov 9 16:11:52 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Tue, 09 Nov 2004 16:11:52 -0800 Subject: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <528y9a4lo4.fsf@topspin.com> References: <20041109232308.607B72283D4@openib.ca.sandia.gov> <1100044175.12438.3.camel@duffman> <41915901.9000502@Sun.COM> <528y9a4lo4.fsf@topspin.com> Message-ID: <41915CC8.9070000@Sun.COM> Roland Dreier wrote: > Nitin> certainly it does break my x86_64 setup too. Can we revert > Nitin> back to working set of bits please ? > > It's actually not an architecture issue -- it's an issue if your node > is more than one hop from the SM. You should be able to use the patch > I just posted to get things working again. Let's give Hal a chance to > fix things up properly. > > - R. Applying your patch, I do not see the "redirect message" anymore. But I cannot ping the peer interface yet. I am on x86_64 arch btw. Unfortunately I gotta run, will debug more later tonight. Thanks Nitin From mshefty at ichips.intel.com Tue Nov 9 17:12:53 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Nov 2004 17:12:53 -0800 Subject: [openib-general] [PATCH] handle QP0/1 send queue overrun Message-ID: <41916B15.8050909@ichips.intel.com> The following patch adds support for handling QP0/1 send queue overrun, along with a couple of related fixes: * The patch includes that provided by Roland in order to configure the fabric. * The code no longer modifies the user's send_wr structures when sending a MAD. * Sent MADs work requests are copied in order to handle both queuing and for error recovery (when added). * The receive side code was slightly restructured to use a single function to repost receives. If a receive cannot be posted for some reason (e.g. lack of memory), it will now try to refill the receive queue when posting an additional receive. (This will also make it possible for the code to be lazier about reposting receives, which would allow for better batching of completions.) Also, I switched my mailer, so I apologize in advance if I hose up my patch. - Sean Index: core/agent.c =================================================================== --- core/agent.c (revision 1186) +++ core/agent.c (working copy) @@ -117,9 +117,9 @@ /* PCI mapping */ gather_list.addr = pci_map_single(mad_agent->device->dma_device, &mad->mad, - sizeof(struct ib_mad), + sizeof mad->mad, PCI_DMA_TODEVICE); - gather_list.length = sizeof(struct ib_mad); + gather_list.length = sizeof mad->mad; gather_list.lkey = (*port_priv->mr).lkey; send_wr.next = NULL; @@ -182,8 +182,7 @@ spin_unlock_irqrestore(&port_priv->send_list_lock, flags); pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), + sizeof mad->mad, PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); kfree(agent_send_wr); @@ -272,7 +271,7 @@ /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof(struct ib_mad), + sizeof agent_send_wr->mad->mad, PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); Index: core/mad.c =================================================================== --- core/mad.c (revision 1186) +++ core/mad.c (working copy) @@ -83,9 +83,8 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info, - struct ib_mad_private *mad); -static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); @@ -345,24 +344,11 @@ } EXPORT_SYMBOL(ib_unregister_mad_agent); -static void queue_mad(struct ib_mad_queue *mad_queue, - struct ib_mad_list_head *mad_list) -{ - unsigned long flags; - - mad_list->mad_queue = mad_queue; - spin_lock_irqsave(&mad_queue->lock, flags); - list_add_tail(&mad_list->list, &mad_queue->list); - mad_queue->count++; - spin_unlock_irqrestore(&mad_queue->lock, flags); -} - static void dequeue_mad(struct ib_mad_list_head *mad_list) { struct ib_mad_queue *mad_queue; unsigned long flags; - BUG_ON(!mad_list->mad_queue); mad_queue = mad_list->mad_queue; spin_lock_irqsave(&mad_queue->lock, flags); list_del(&mad_list->list); @@ -481,24 +467,35 @@ } static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_send_wr_private *mad_send_wr, - struct ib_send_wr *send_wr, - struct ib_send_wr **bad_send_wr) + struct ib_mad_send_wr_private *mad_send_wr) { struct ib_mad_qp_info *qp_info; + struct ib_send_wr *bad_send_wr; + unsigned long flags; int ret; /* Replace user's WR ID with our own to find WR upon completion */ qp_info = mad_agent_priv->qp_info; - mad_send_wr->wr_id = send_wr->wr_id; - send_wr->wr_id = (unsigned long)&mad_send_wr->mad_list; - queue_mad(&qp_info->send_queue, &mad_send_wr->mad_list); + mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; + mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; + mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; - ret = ib_post_send(mad_agent_priv->agent.qp, send_wr, bad_send_wr); - if (ret) { - printk(KERN_NOTICE PFX "ib_post_send failed ret = %d\n", ret); - dequeue_mad(&mad_send_wr->mad_list); - *bad_send_wr = send_wr; + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->send_queue.list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, + &mad_send_wr->send_wr, &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + dequeue_mad(&mad_send_wr->mad_list); + } + } else { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->overflow_list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = 0; } return ret; } @@ -511,9 +508,8 @@ struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr) { - int ret; - struct ib_send_wr *cur_send_wr, *next_send_wr; - struct ib_mad_agent_private *mad_agent_priv; + int ret = -EINVAL; + struct ib_mad_agent_private *mad_agent_priv; /* Validate supplied parameters */ if (!bad_send_wr) @@ -522,6 +518,9 @@ if (!mad_agent || !send_wr ) goto error2; + if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) + goto error2; + if (!mad_agent->send_handler || (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) goto error2; @@ -531,30 +530,31 @@ agent); /* Walk list of send WRs and post each on send list */ - cur_send_wr = send_wr; - while (cur_send_wr) { + while (send_wr) { unsigned long flags; + struct ib_send_wr *next_send_wr; struct ib_mad_send_wr_private *mad_send_wr; struct ib_smp *smp; - if (!cur_send_wr->wr.ud.mad_hdr) { - *bad_send_wr = cur_send_wr; + /* + * Save pointer to next work request to post in case the + * current one completes, and the user modifies the work + * request associated with the completion. + */ + if (!send_wr->wr.ud.mad_hdr) { printk(KERN_ERR PFX "MAD header must be supplied " - "in WR %p\n", cur_send_wr); - goto error1; + "in WR %p\n", send_wr); + goto error2; } + next_send_wr = (struct ib_send_wr *)send_wr->next; - next_send_wr = (struct ib_send_wr *)cur_send_wr->next; - - smp = (struct ib_smp *)cur_send_wr->wr.ud.mad_hdr; + smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - ret = handle_outgoing_smp(mad_agent, smp, cur_send_wr); - if (ret < 0) { /* error */ - *bad_send_wr = cur_send_wr; - goto error1; - } else if (ret == 1) { /* locally consumed */ + ret = handle_outgoing_smp(mad_agent, smp, send_wr); + if (ret < 0) /* error */ + goto error2; + else if (ret == 1) /* locally consumed */ goto next; - } } /* Allocate MAD send WR tracking structure */ @@ -562,16 +562,21 @@ (in_atomic() || irqs_disabled()) ? GFP_ATOMIC : GFP_KERNEL); if (!mad_send_wr) { - *bad_send_wr = cur_send_wr; printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); - return -ENOMEM; + ret = -ENOMEM; + goto error2; } + mad_send_wr->send_wr = *send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + memcpy(mad_send_wr->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + mad_send_wr->send_wr.next = NULL; mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; mad_send_wr->agent = mad_agent; /* Timeout will be updated after send completes */ - mad_send_wr->timeout = msecs_to_jiffies(cur_send_wr->wr. + mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. ud.timeout_ms); /* One reference for each work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); @@ -584,31 +589,24 @@ &mad_agent_priv->send_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - cur_send_wr->next = NULL; - ret = ib_send_mad(mad_agent_priv, mad_send_wr, - cur_send_wr, bad_send_wr); + ret = ib_send_mad(mad_agent_priv, mad_send_wr); if (ret) { - /* Handle QP overrun separately... -ENOMEM */ - /* Handle posting when QP is in error state... */ - /* Fail send request */ spin_lock_irqsave(&mad_agent_priv->lock, flags); list_del(&mad_send_wr->agent_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - atomic_dec(&mad_agent_priv->refcount); - return ret; + goto error2; } next: - cur_send_wr = next_send_wr; + send_wr = next_send_wr; } - return 0; error2: *bad_send_wr = send_wr; error1: - return -EINVAL; + return ret; } EXPORT_SYMBOL(ib_post_send_mad); @@ -1125,7 +1123,6 @@ /* Give driver "right of first refusal" on incoming MAD */ if (port_priv->device->process_mad) { - struct ib_grh *grh; int ret; if (!response) { @@ -1144,20 +1141,8 @@ &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_REPLY) { - if (response->mad.mad.mad_hdr.mgmt_class == - IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - if (!smi_handle_dr_smp_recv( - (struct ib_smp *)&response->mad.mad, - port_priv->device->node_type, - port_priv->port_num, - port_priv->device->phys_port_cnt)) { - goto out; - } - } /* Send response */ - grh = (void *)recv->header.recv_buf.mad - - sizeof(struct ib_grh); - if (!agent_send(response, grh, wc, + if (!agent_send(response, &recv->grh, wc, port_priv->device, port_priv->port_num)) response = NULL; @@ -1178,13 +1163,14 @@ */ recv = NULL; } - out: - if (recv) - kmem_cache_free(ib_mad_cache, recv); - /* Post another receive request for this QP */ - ib_mad_post_receive_mad(qp_info, response); + if (response) { + ib_mad_post_receive_mads(qp_info, response); + if (recv) + kmem_cache_free(ib_mad_cache, recv); + } else + ib_mad_post_receive_mads(qp_info, recv); } static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) @@ -1291,16 +1277,51 @@ static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { - struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wr_private *mad_send_wr, *queued_send_wr; struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_queue *send_queue; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, mad_list); - dequeue_mad(mad_list); - /* Restore client wr_id in WC */ + send_queue = mad_list->mad_queue; + qp_info = send_queue->qp_info; + +retry: + queued_send_wr = NULL; + spin_lock_irqsave(&send_queue->lock, flags); + list_del(&mad_list->list); + + /* Move queued send to the send queue. */ + if (send_queue->count-- > send_queue->max_active) { + mad_list = container_of(qp_info->overflow_list.next, + struct ib_mad_list_head, list); + queued_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + list_del(&mad_list->list); + list_add_tail(&mad_list->list, &send_queue->list); + } + spin_unlock_irqrestore(&send_queue->lock, flags); + + /* Restore client wr_id in WC and complete send. */ wc->wr_id = mad_send_wr->wr_id; ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); + + if (queued_send_wr) { + ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, + &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + mad_send_wr = queued_send_wr; + wc->status = IB_WC_LOC_QP_OP_ERR; + goto retry; + } + } } /* @@ -1492,88 +1513,74 @@ queue_work(port_priv->wq, &port_priv->work); } -static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info, - struct ib_mad_private *mad) +/* + * Allocate receive MADs and post receive WRs for them. + */ +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) { + unsigned long flags; + int post, ret; struct ib_mad_private *mad_priv; struct ib_sge sg_list; - struct ib_recv_wr recv_wr; - struct ib_recv_wr *bad_recv_wr; - int ret; + struct ib_recv_wr recv_wr, *bad_recv_wr; + struct ib_mad_queue *recv_queue = &qp_info->recv_queue; - if (mad) - mad_priv = mad; - else { - /* - * Allocate memory for receive buffer. - * This is for both MAD and private header - * which contains the receive tracking structure. - * By prepending this header, there is one rather - * than two memory allocations. - */ - mad_priv = kmem_cache_alloc(ib_mad_cache, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); - if (!mad_priv) { - printk(KERN_ERR PFX "No memory for receive buffer\n"); - return -ENOMEM; - } - } - - /* Setup scatter list */ - sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device, - &mad_priv->grh, - sizeof *mad_priv - - sizeof mad_priv->header, - PCI_DMA_FROMDEVICE); + /* Initialize common scatter list fields. */ sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; sg_list.lkey = (*qp_info->port_priv->mr).lkey; - /* Setup receive WR */ + /* Initialize common receive WR fields. */ recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; - recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; - pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); - - /* Post receive WR. */ - queue_mad(&qp_info->recv_queue, &mad_priv->header.mad_list); - ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); - if (ret) { - dequeue_mad(&mad_priv->header.mad_list); - pci_unmap_single(qp_info->port_priv->device->dma_device, - pci_unmap_addr(&mad_priv->header, mapping), - sizeof *mad_priv - sizeof mad_priv->header, - PCI_DMA_FROMDEVICE); - - kmem_cache_free(ib_mad_cache, mad_priv); - printk(KERN_NOTICE PFX "ib_post_recv WRID 0x%Lx " - "failed ret = %d\n", - (unsigned long long) recv_wr.wr_id, ret); - return -EINVAL; - } - - return 0; -} -/* - * Allocate receive MADs and post receive WRs for them - */ -static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info) -{ - int i, ret; - - for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - ret = ib_mad_post_receive_mad(qp_info, NULL); + do { + /* Allocate and map receive buffer. */ + if (mad) { + mad_priv = mad; + mad = NULL; + } else { + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + ret = -ENOMEM; + break; + } + } + sg_list.addr = pci_map_single(qp_info->port_priv-> + device->dma_device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + PCI_DMA_FROMDEVICE); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; + mad_priv->header.mad_list.mad_queue = recv_queue; + + /* Post receive WR. */ + spin_lock_irqsave(&recv_queue->lock, flags); + post = (++recv_queue->count < recv_queue->max_active); + list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); + spin_unlock_irqrestore(&recv_queue->lock, flags); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); if (ret) { - printk(KERN_ERR PFX "receive post %d failed " - "on %s port %d\n", i + 1, - qp_info->port_priv->device->name, - qp_info->port_priv->port_num); + spin_lock_irqsave(&recv_queue->lock, flags); + list_del(&mad_priv->header.mad_list.list); + recv_queue->count--; + spin_unlock_irqrestore(&recv_queue->lock, flags); + pci_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + PCI_DMA_FROMDEVICE); + kmem_cache_free(ib_mad_cache, mad_priv); + printk(KERN_ERR PFX "ib_post_recv failed: = %d\n", ret); break; } - } + } while (post); return ret; } @@ -1625,6 +1632,7 @@ spin_lock_irqsave(&qp_info->send_queue.lock, flags); INIT_LIST_HEAD(&qp_info->send_queue.list); qp_info->send_queue.count = 0; + INIT_LIST_HEAD(&qp_info->overflow_list); spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); } @@ -1789,7 +1797,7 @@ } for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_post_receive_mads(&port_priv->qp_info[i]); + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { printk(KERN_ERR PFX "Couldn't post receive " "requests\n"); @@ -1851,6 +1859,7 @@ qp_info->port_priv = port_priv; init_mad_queue(qp_info, &qp_info->send_queue); init_mad_queue(qp_info, &qp_info->recv_queue); + INIT_LIST_HEAD(&qp_info->overflow_list); memset(&qp_init_attr, 0, sizeof qp_init_attr); qp_init_attr.send_cq = port_priv->cq; @@ -1870,6 +1879,9 @@ ret = PTR_ERR(qp_info->qp); goto error; } + /* Use minimum queue sizes unless the CQ is resized. */ + qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; + qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; return 0; error: Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1186) +++ core/mad_priv.h (working copy) @@ -122,6 +122,8 @@ struct ib_mad_list_head mad_list; struct list_head agent_list; struct ib_mad_agent *agent; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; u64 wr_id; /* client WR ID */ u64 tid; unsigned long timeout; @@ -141,6 +143,7 @@ spinlock_t lock; struct list_head list; int count; + int max_active; struct ib_mad_qp_info *qp_info; }; @@ -149,7 +152,7 @@ struct ib_qp *qp; struct ib_mad_queue send_queue; struct ib_mad_queue recv_queue; - /* struct ib_mad_queue overflow_queue; */ + struct list_head overflow_list; }; struct ib_mad_port_private { -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: diffs URL: From halr at voltaire.com Tue Nov 9 19:26:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 09 Nov 2004 22:26:07 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52llda4m00.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> Message-ID: <1100057166.17621.23.camel@hpc-1> On Tue, 2004-11-09 at 18:54, Roland Dreier wrote: > OK, I think I understand the problem, but I'm not sure what the > correct solution is. When a DR SMP arrives at a CA from the SM, > hop_cnt == hop_ptr == number of hops in the directed route, What was the number ? > and somehow they are not updated correctly by the time the response > reaches handle_outgoing_smp(). > > I can't follow the code well enough to understand why all DR SMPs have > to go through both smi_handle_dr_smp_recv() and > smi_handle_dr_smp_send() but the patch below seems to correct things > for me (ports go to ACTIVE on all my systems). (handle_outgoing_smp() > already calls smi_handle_dr_smp_recv() so it seems the response was > getting passed to smi_handle_dr_smp_recv() twice). I integrated this patch and checked it back in. I don't think this is the solution for all cases (and something else is broken). The second call to smi_handle_dr_smp_recv was to validate the DR in the response packet before sending it. The response would be a returning DR packet (D bit 1). If hop_cnt == hop_ptr, I suspect this has been broken since r1163 (not including the other things I broke in it today). I will do some more work in understanding this tomorrow. -- Hal From roland at topspin.com Tue Nov 9 20:55:46 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 20:55:46 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100057166.17621.23.camel@hpc-1> (Hal Rosenstock's message of "Tue, 09 Nov 2004 22:26:07 -0500") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> Message-ID: <52r7n22tgt.fsf@topspin.com> Roland> OK, I think I understand the problem, but I'm not sure Roland> what the correct solution is. When a DR SMP arrives at a Roland> CA from the SM, hop_cnt == hop_ptr == number of hops in Roland> the directed route, Hal> What was the number ? For one port it was 4 and for another it was 6. It could really be anything (it's just how many hops away the SM is). Hal> I integrated this patch and checked it back in. I don't think Hal> this is the solution for all cases (and something else is Hal> broken). Could be. I had a hard time checking the code in smi.c (which is split between smi_handle_dr_smp_recv() and smi_handle_dr_smp_send() as well as smi_check_forward_dr_smp(), but which has outgoing and returning DR handling mixed together) against the IB spec (which splits outgoing and returning DR handling). Hal> The second call to smi_handle_dr_smp_recv was to validate the Hal> DR in the response packet before sending it. The response Hal> would be a returning DR packet (D bit 1). If hop_cnt == Hal> hop_ptr, I guess the problem with calling smi_handle_dr_smp_recv() twice on the same packet is that the function may alter the packet. - R. From roland at topspin.com Tue Nov 9 21:55:43 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 09 Nov 2004 21:55:43 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52r7n22tgt.fsf@topspin.com> (Roland Dreier's message of "Tue, 09 Nov 2004 20:55:46 -0800") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> Message-ID: <52ekj22qow.fsf@topspin.com> It seems that MAD handling is still not quite right. It seems in my set up that IPoIB is not seeing the response to its MCMember set... (it does look like the query is reaching the SM) - R. From halr at voltaire.com Wed Nov 10 06:28:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 09:28:11 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52ekj22qow.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> Message-ID: <1100096891.801.25.camel@hpc-1> On Wed, 2004-11-10 at 00:55, Roland Dreier wrote: > It seems that MAD handling is still not quite right. It seems in my > set up that IPoIB is not seeing the response to its MCMember > set... (it does look like the query is reaching the SM) This is a separate issue from the ports not becoming active (DR handling issue). I broke this part yesterday (not a good day at all :-( at either r1184 and/or r1181 when I added what I thought was correct based on Sean's emails (not dispatching additional error cases in ib_mad_recv_done_handler (and then improperly thought I verified the changes that things were still working)). I can see now that this is wrong and have a fix for what stops IPoIB from working. The problem was that the response was received by the MAD layer but not dispatched due to the change(s) noted above. So I am patching at least enough to get things operational for now. Please confirm that it works for you. I will not touch things until I hear that it does. Also, it seems to me that no response needs to be handed to process_mad. Does this optimization make sense ? Sorry for the temporary inconvenience. I will try not to do this again. It is no fun for anyone. -- Hal mad: In ib_mad_recv_done_handler, if process_mad returns SUCCESS but not REPLY, received packet still needs to be dispatched Index: mad.c =================================================================== --- mad.c (revision 1187) +++ mad.c (working copy) @@ -1151,8 +1151,8 @@ port_priv->device, port_priv->port_num)) response = NULL; + goto out; } - goto out; } } From roland at topspin.com Wed Nov 10 07:36:29 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 10 Nov 2004 07:36:29 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100096891.801.25.camel@hpc-1> (Hal Rosenstock's message of "Wed, 10 Nov 2004 09:28:11 -0500") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> Message-ID: <52actp3ede.fsf@topspin.com> >>>>> "Hal" == Hal Rosenstock writes: Hal> I can see now that this is wrong and have a fix for what Hal> stops IPoIB from working. The problem was that the response Hal> was received by the MAD layer but not dispatched due to the Hal> change(s) noted above. Hal> So I am patching at least enough to get things operational Hal> for now. Please confirm that it works for you. I will not Hal> touch things until I hear that it does. Yes, IPoIB works for me again. Hal> Also, it seems to me that no response needs to be handed to Hal> process_mad. Does this optimization make sense ? I'm not sure I understand the question. process_mad definitely needs a buffer to return a response in. Are you suggesting that process_mad overwrite the input buffer when it generates a response? That's probably OK although I'm not sure if it's much of an improvement (process_mad will probably have to allocate a response buffer internally and copy the response when returning). - R. From halr at voltaire.com Wed Nov 10 07:53:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 10:53:21 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52actp3ede.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com> Message-ID: <1100102000.2836.6.camel@hpc-1> On Wed, 2004-11-10 at 10:36, Roland Dreier wrote: > Yes, IPoIB works for me again. Thanks for validating. > Hal> Also, it seems to me that no response needs to be handed to > Hal> process_mad. Does this optimization make sense ? > > I'm not sure I understand the question. process_mad definitely needs > a buffer to return a response in. Are you suggesting that process_mad > overwrite the input buffer when it generates a response? That's > probably OK although I'm not sure if it's much of an improvement > (process_mad will probably have to allocate a response buffer > internally and copy the response when returning). I'm asking about also checking the method prior to calling process_mad. If the method is a response method (e.g. GetResp for one), we could bypass calling process_mad. Or is this not worth the extra checks in the MAD layer as it is low enough overhead and adds additional protocol knowledge into the MAD layer ? -- Hal From roland at topspin.com Wed Nov 10 08:05:14 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 10 Nov 2004 08:05:14 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100102000.2836.6.camel@hpc-1> (Hal Rosenstock's message of "Wed, 10 Nov 2004 10:53:21 -0500") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com> <1100102000.2836.6.camel@hpc-1> Message-ID: <521xf13d1h.fsf@topspin.com> Hal> I'm asking about also checking the method prior to calling Hal> process_mad. If the method is a response method (e.g. GetResp Hal> for one), we could bypass calling process_mad. Or is this not Hal> worth the extra checks in the MAD layer as it is low enough Hal> overhead and adds additional protocol knowledge into the MAD Hal> layer ? Oh, I see now. I don't think that's worth doing. I think keeping the MAD code simpler is probably best right now. - R. From Nitin.Hande at Sun.COM Wed Nov 10 08:05:48 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Wed, 10 Nov 2004 08:05:48 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100096891.801.25.camel@hpc-1> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> Message-ID: <41923C5C.7080501@Sun.COM> Hal Rosenstock wrote: > On Wed, 2004-11-10 at 00:55, Roland Dreier wrote: > >>It seems that MAD handling is still not quite right. It seems in my >>set up that IPoIB is not seeing the response to its MCMember >>set... (it does look like the query is reaching the SM) > > > This is a separate issue from the ports not becoming active (DR handling > issue). I broke this part yesterday (not a good day at all :-( at either > r1184 and/or r1181 when I added what I thought was correct based on > Sean's emails (not dispatching additional error cases in > ib_mad_recv_done_handler (and then improperly thought I verified the > changes that things were still working)). > > I can see now that this is wrong and have a fix for what stops IPoIB > from working. The problem was that the response was received by the MAD > layer but not dispatched due to the change(s) noted above. > > So I am patching at least enough to get things operational for now. > Please confirm that it works for you. I will not touch things until I > hear that it does. IPoIB seems to be working for me. I am on x86_64 platform. Thanks Nitin > > Also, it seems to me that no response needs to be handed to process_mad. > Does this optimization make sense ? > > Sorry for the temporary inconvenience. I will try not to do this again. > It is no fun for anyone. > > -- Hal > > mad: In ib_mad_recv_done_handler, if process_mad returns SUCCESS but not > REPLY, received packet still needs to be dispatched > > Index: mad.c > =================================================================== > --- mad.c (revision 1187) > +++ mad.c (working copy) > @@ -1151,8 +1151,8 @@ > port_priv->device, > port_priv->port_num)) > response = NULL; > + goto out; > } > - goto out; > } > } > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Nov 10 08:28:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 11:28:24 -0500 Subject: [openib-general] Question on handle_outgoing_smp References: Message-ID: <001d01c4c742$4e4f5f10$6814a8c0@Gripen> root wrote: > In following code : > > if (smi_check_local_dr_smp(smp, mad_agent->device, > mad_agent->port_num)) { ... > ret = mad_agent->device->process_mad( > mad_agent->device, > 0, > mad_agent->port_num, > smp->dr_slid, /* ? */ > (struct ib_mad *)smp, > (struct ib_mad > *)&mad_priv->mad); > > How do we guarantee that the process_mad() was supplied (not NULL) ? > That is if smi_check_local_smp didn't get called via > smi_check_local_dr_smp ? Sorry for the use of a bad mail client here but I didn't receive this on my normal email client. A check for process_mad routine being supplied needs to be added to this here. I missed it here (but had it in the other place in mad.c where process_mad is called). I will issue a patch for this in a while. Thanks. -- Hal From mshefty at ichips.intel.com Wed Nov 10 08:55:34 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Nov 2004 08:55:34 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100096891.801.25.camel@hpc-1> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> Message-ID: <41924806.8060509@ichips.intel.com> Hal Rosenstock wrote: > This is a separate issue from the ports not becoming active (DR handling > issue). I broke this part yesterday (not a good day at all :-( at either > r1184 and/or r1181 when I added what I thought was correct based on > Sean's emails (not dispatching additional error cases in > ib_mad_recv_done_handler (and then improperly thought I verified the > changes that things were still working)). What exactly does it mean then when process_mad returns success? Do any of the return bits from process_mad indicate that the MAD was for the HCA driver? - Sean From roland at topspin.com Wed Nov 10 08:59:54 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 10 Nov 2004 08:59:54 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <41924806.8060509@ichips.intel.com> (Sean Hefty's message of "Wed, 10 Nov 2004 08:55:34 -0800") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com> Message-ID: <52wtwt1vxx.fsf@topspin.com> Sean> What exactly does it mean then when process_mad returns Sean> success? Do any of the return bits from process_mad Sean> indicate that the MAD was for the HCA driver? SUCCESS means that process_mad didn't encounter any errors. If REPLY or CONSUMED is set then process_mad actually handled the packet. - R. From roland at topspin.com Wed Nov 10 09:02:16 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 10 Nov 2004 09:02:16 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52wtwt1vxx.fsf@topspin.com> (Roland Dreier's message of "Wed, 10 Nov 2004 08:59:54 -0800") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com> <52wtwt1vxx.fsf@topspin.com> Message-ID: <52sm7h1vtz.fsf@topspin.com> By the way, if I am reading the code correctly, it looks like the MAD layer only checks for IB_MAD_RESULT_REPLY and not IB_MAD_RESULT_CONSUMED. If IB_MAD_RESULT_CONSUMED is set then the packet is something like a trap repress handled by the SMA or a locally generated trap that the driver forwarded to the SM, so the packet should not go through agent dispatch. - R. From halr at voltaire.com Wed Nov 10 09:20:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 12:20:05 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52wtwt1vxx.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com> <52wtwt1vxx.fsf@topspin.com> Message-ID: <1100107204.2836.36.camel@hpc-1> On Wed, 2004-11-10 at 11:59, Roland Dreier wrote: > Sean> What exactly does it mean then when process_mad returns > Sean> success? Do any of the return bits from process_mad > Sean> indicate that the MAD was for the HCA driver? > > SUCCESS means that process_mad didn't encounter any errors. If REPLY > or CONSUMED is set then process_mad actually handled the packet. I would assume that REPLY and CONSUMED are also mutually exclusive. -- Hal From halr at voltaire.com Wed Nov 10 09:26:00 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 12:26:00 -0500 Subject: [openib-general] [PATCH] mad: In handle_outgoing_smp, validate process_mad routine exists prior to calling it Message-ID: <1100107560.2836.41.camel@hpc-1> mad: In handle_outgoing_smp, validate process_mad routine exists prior to calling it (issue pointed out by KK) Index: mad.c =================================================================== --- mad.c (revision 1189) +++ mad.c (working copy) @@ -405,30 +405,32 @@ goto error1; } - mad_agent_priv = container_of(mad_agent, - struct ib_mad_agent_private, - agent); - ret = mad_agent->device->process_mad( - mad_agent->device, - 0, - mad_agent->port_num, - smp->dr_slid, /* ? */ - (struct ib_mad *)smp, - (struct ib_mad *)&mad_priv->mad); - if ((ret & IB_MAD_RESULT_SUCCESS) && - (ret & IB_MAD_RESULT_REPLY)) { - if (!smi_handle_dr_smp_recv( - (struct ib_smp *)&mad_priv->mad, - mad_agent->device->node_type, - mad_agent->port_num, - mad_agent->device->phys_port_cnt)) { - ret = -EINVAL; - kmem_cache_free(ib_mad_cache, mad_priv); - goto error1; + if (mad_agent->device->process_mad) { + ret = mad_agent->device->process_mad( + mad_agent->device, + 0, + mad_agent->port_num, + smp->dr_slid, /* ? */ + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + if ((ret & IB_MAD_RESULT_SUCCESS) && + (ret & IB_MAD_RESULT_REPLY)) { + if (!smi_handle_dr_smp_recv( + (struct ib_smp *)&mad_priv->mad, + mad_agent->device->node_type, + mad_agent->port_num, + mad_agent->device->phys_port_cnt)) { + ret = -EINVAL; + kmem_cache_free(ib_mad_cache, mad_priv); + goto error1; + } } } /* See if response is solicited and there is a recv handler */ + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); if (solicited_mad(&mad_priv->mad.mad) && mad_agent_priv->agent.recv_handler) { struct ib_wc wc; From halr at voltaire.com Wed Nov 10 09:36:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 12:36:17 -0500 Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun In-Reply-To: <41916B15.8050909@ichips.intel.com> References: <41916B15.8050909@ichips.intel.com> Message-ID: <1100108177.2836.48.camel@hpc-1> On Tue, 2004-11-09 at 20:12, Sean Hefty wrote: > The following patch adds support for handling QP0/1 send queue overrun, > along with a couple of related fixes: > > * The patch includes that provided by Roland in order to configure the > fabric. > * The code no longer modifies the user's send_wr structures when sending > a MAD. > * Sent MADs work requests are copied in order to handle both queuing and > for error recovery (when added). > * The receive side code was slightly restructured to use a single > function to repost receives. If a receive cannot be posted for some > reason (e.g. lack of memory), it will now try to refill the receive > queue when posting an additional receive. (This will also make it > possible for the code to be lazier about reposting receives, which would > allow for better batching of completions.) I will break this up into two chunks: 1. the minor agent change 2. the rest (mad changes) excluding the already applied patch (to bring the ports up to ACTIVE) which I believe is temporary. > > Also, I switched my mailer, so I apologize in advance if I hose up my patch. It seems to have doubled up the inline diffs (as well as including it as an attachment) but no need to regenerate because of this. -- Hal From halr at voltaire.com Wed Nov 10 10:02:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 13:02:44 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <521xf13d1h.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com> <1100102000.2836.6.camel@hpc-1> <521xf13d1h.fsf@topspin.com> Message-ID: <1100109764.2836.50.camel@hpc-1> On Wed, 2004-11-10 at 11:05, Roland Dreier wrote: > I think keeping the MAD code simpler is probably best right now. Hope that is for technical reasons and not for the recent missteps. -- Hal From halr at voltaire.com Wed Nov 10 10:04:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 13:04:41 -0500 Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun In-Reply-To: <1100108177.2836.48.camel@hpc-1> References: <41916B15.8050909@ichips.intel.com> <1100108177.2836.48.camel@hpc-1> Message-ID: <1100109881.2836.52.camel@hpc-1> On Wed, 2004-11-10 at 12:36, Hal Rosenstock wrote: > I will break this up into two chunks: > 1. the minor agent change Thanks. Applied. -- Hal From roland at topspin.com Wed Nov 10 10:02:04 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 10 Nov 2004 10:02:04 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100109764.2836.50.camel@hpc-1> (Hal Rosenstock's message of "Wed, 10 Nov 2004 13:02:44 -0500") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <52actp3ede.fsf@topspin.com> <1100102000.2836.6.camel@hpc-1> <521xf13d1h.fsf@topspin.com> <1100109764.2836.50.camel@hpc-1> Message-ID: <52bre51t2b.fsf@topspin.com> Roland> I think keeping the MAD code simpler is probably best right now. Hal> Hope that is for technical reasons and not for the recent missteps. Yes, it's just that the MAD code is quite complicated already with multiple tests for DR SMPs etc; mad.c alone is over 2000 lines now. I don't think you could even find a microbenchmark that could measure the improvement in testing the response bit in the MAD code rather than calling into process_mad for every packet, so I don't think we need to add more code to the MAD layer to do it. - R. From halr at voltaire.com Wed Nov 10 10:32:49 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 13:32:49 -0500 Subject: [openib-general] Solicited response with no matching send request Message-ID: <1100111569.2836.61.camel@hpc-1> Hi, I was just rerunning all of my test cases and have a question about the MAD layer receive processing: Currently if no matching send request is found, the received MAD is freed (around line 1035 of the current mad.c). In this case, timeout too short, etc., is this the correct behavior ? Or should the receive packet be given to a matching MAD agent with a receive handler (perhaps with a different status) ? The latter would allow for an additional send model for requests which I don't think is supported now at the cost of having the client throw away these receives based on a new status code (perhaps some sort of timeout). Just wondering... Thanks. -- Hal From mshefty at ichips.intel.com Wed Nov 10 10:43:56 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Nov 2004 10:43:56 -0800 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <1100111569.2836.61.camel@hpc-1> References: <1100111569.2836.61.camel@hpc-1> Message-ID: <4192616C.7070905@ichips.intel.com> Hal Rosenstock wrote: > Currently if no matching send request is found, the received MAD is > freed (around line 1035 of the current mad.c). > > In this case, timeout too short, etc., is this the correct behavior ? > Or should the receive packet be given to a matching MAD agent with a > receive handler (perhaps with a different status) ? The latter would > allow for an additional send model for requests which I don't think is > supported now at the cost of having the client throw away these receives > based on a new status code (perhaps some sort of timeout). I think that this is the behavior that you'd want, but I can see your view, and I'm open to changing it. From a client's perspective, dropping an unmatched MAD keeps the client from having to handle receive MADs without having a send outstanding. That is, I would think that a client that could make use of this MAD would have to be fairly complex. I see a couple of cases where this would happen. The first is the one you mention, where the timeout was too short. If the client retries the request, then they would need to deal with an unmatched response coming in before they issued the retry, while the retry is active (where the retry is sent after the received had checked for a match), or after the retry completed (with the need to handle multiple unmatched responses.) The second case where I can see this happening is if the client canceled the send, and I'm not sure that we'd want to give the client an unmatched response in this case. - Sean From halr at voltaire.com Wed Nov 10 11:07:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 14:07:32 -0500 Subject: [openib-general] [PATCH] agent: Handle out of order send completions Message-ID: <1100113652.2836.72.camel@hpc-1> agent: Handle out of order send completions (Issue pointed out by Sean) Index: agent_priv.h =================================================================== --- agent_priv.h (revision 1183) +++ agent_priv.h (working copy) @@ -46,7 +46,6 @@ struct ib_mad_agent *lr_smp_agent; /* LR SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ struct ib_mr *mr; - u64 wr_id; }; #endif /* __IB_AGENT_PRIV_H__ */ Index: agent.c =================================================================== --- agent.c (revision 1192) +++ agent.c (working copy) @@ -117,9 +117,9 @@ /* PCI mapping */ gather_list.addr = pci_map_single(mad_agent->device->dma_device, &mad->mad, - sizeof mad->mad, + sizeof(mad->mad), PCI_DMA_TODEVICE); - gather_list.length = sizeof mad->mad; + gather_list.length = sizeof(mad->mad); gather_list.lkey = (*port_priv->mr).lkey; send_wr.next = NULL; @@ -172,7 +172,7 @@ send_wr.wr.ud.remote_qkey = 0; /* for SMPs */ } send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr; - send_wr.wr_id = ++port_priv->wr_id; + send_wr.wr_id = (unsigned long)&agent_send_wr->send_list; pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); @@ -182,7 +182,7 @@ spin_unlock_irqrestore(&port_priv->send_list_lock, flags); pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof mad->mad, + sizeof(mad->mad), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); kfree(agent_send_wr); @@ -247,31 +247,18 @@ return; } - /* Completion corresponds to first entry on posted MAD send list */ spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (list_empty(&port_priv->send_posted_list)) { - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - printk(KERN_ERR SPFX "Send completion WR ID 0x%Lx but send " - "list is empty\n", - (unsigned long long) mad_send_wc->wr_id); - return; - } - - agent_send_wr = list_entry(&port_priv->send_posted_list, - struct ib_agent_send_wr, - send_list); - send_wr = agent_send_wr->send_list.next; - agent_send_wr = container_of(send_wr, struct ib_agent_send_wr, + send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id; + agent_send_wr = container_of(send_wr, struct ib_agent_send_wr, send_list); - - /* Remove from posted send MAD list */ + /* Remove completed send from posted send MAD list */ list_del(&agent_send_wr->send_list); spin_unlock_irqrestore(&port_priv->send_list_lock, flags); /* Unmap PCI */ pci_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), - sizeof agent_send_wr->mad->mad, + sizeof(agent_send_wr->mad->mad), PCI_DMA_TODEVICE); ib_destroy_ah(agent_send_wr->ah); @@ -306,7 +293,6 @@ memset(port_priv, 0, sizeof *port_priv); port_priv->port_num = port_num; - port_priv->wr_id = 0; spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->send_posted_list); From mshefty at ichips.intel.com Wed Nov 10 11:07:00 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Nov 2004 11:07:00 -0800 Subject: [openib-general] [PATCH] agent: Handle out of order send completions In-Reply-To: <1100113652.2836.72.camel@hpc-1> References: <1100113652.2836.72.camel@hpc-1> Message-ID: <419266D4.6040005@ichips.intel.com> Hal Rosenstock wrote: > - send_wr.wr_id = ++port_priv->wr_id; > + send_wr.wr_id = (unsigned long)&agent_send_wr->send_list; {snip} > + send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id; > + agent_send_wr = container_of(send_wr, struct ib_agent_send_wr, > send_list); I think it may be clearer to set the wr_id to agent_send_wr, rather than a subfield. Thanks for doing this btw; I can take it off my to do list. :) - Sean From halr at voltaire.com Wed Nov 10 12:43:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 15:43:03 -0500 Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun In-Reply-To: <1100108177.2836.48.camel@hpc-1> References: <41916B15.8050909@ichips.intel.com> <1100108177.2836.48.camel@hpc-1> Message-ID: <1100119383.2836.81.camel@hpc-1> On Wed, 2004-11-10 at 12:36, Hal Rosenstock wrote: > I will break this up into two chunks: > 2. the rest (mad changes) excluding the already applied patch (to bring > the ports up to ACTIVE) which I believe is temporary. A few minor questions (before applying this): 1. Why was BUG_ON removed from dequeue_mad ? 2. A couple of questions related to send_wr->num_sge checking. a. Should this be pushed down to mthca and detected there rather than at the MAD layer ? b. If it is to stay at the MAD layer, shouldn't there be a check inside the while (send_wr) loop rather than above it ? -- Hal From halr at voltaire.com Wed Nov 10 12:53:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 15:53:26 -0500 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <4192616C.7070905@ichips.intel.com> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> Message-ID: <1100120005.2836.86.camel@hpc-1> On Wed, 2004-11-10 at 13:43, Sean Hefty wrote: > Hal Rosenstock wrote: > > > Currently if no matching send request is found, the received MAD is > > freed (around line 1035 of the current mad.c). > > > > In this case, timeout too short, etc., is this the correct behavior ? > > Or should the receive packet be given to a matching MAD agent with a > > receive handler (perhaps with a different status) ? The latter would > > allow for an additional send model for requests which I don't think is > > supported now at the cost of having the client throw away these receives > > based on a new status code (perhaps some sort of timeout). > > I think that this is the behavior that you'd want, but I can see your > view, and I'm open to changing it. From a client's perspective, > dropping an unmatched MAD keeps the client from having to handle receive > MADs without having a send outstanding. That is, I would think that a > client that could make use of this MAD would have to be fairly complex. I don't know whether the SM or other managers would use this model so it's just a thought to keep in mind for the future. > I see a couple of cases where this would happen. The first is the one > you mention, where the timeout was too short. If the client retries the > request, then they would need to deal with an unmatched response coming > in before they issued the retry, while the retry is active (where the > retry is sent after the received had checked for a match), or after the > retry completed (with the need to handle multiple unmatched responses.) > > The second case where I can see this happening is if the client canceled > the send, and I'm not sure that we'd want to give the client an > unmatched response in this case. So we would also need to timeout send MAD cancellations (and not eliminate them totally immediately) so we wouldn't give a receive back in that case. -- Hal From halr at voltaire.com Wed Nov 10 12:57:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 15:57:26 -0500 Subject: [openib-general] MAD agent code comments In-Reply-To: <4190F7C6.8020509@ichips.intel.com> References: <41900EEB.7050109@ichips.intel.com> <1100009558.13933.3.camel@localhost.localdomain> <4190F7C6.8020509@ichips.intel.com> Message-ID: <1100120246.2836.90.camel@hpc-1> Hi Sean, On Tue, 2004-11-09 at 12:00, Sean Hefty wrote: > Hal Rosenstock wrote: > > Since the agent does not use solicited sends, are its sends completed in > > order (so this is only an issue for clients using solicited sends) ? > > I would think that solicited sends (i.e. responses) would be easier to > maintain order, since those wouldn't have a timeout. We are using solicited slightly differently. I am using it for sending a request which has a timeout and is expected to elicit a response. > But my preference > would be to not defined the API this way. It makes queuing for QP > overrun and error handling difficult. > > For example, a client posts 2 sends, both of which get queued. If the > first send gets posted, but the second send fails when posting to the > QP, then we'd need to delay reporting the second send's completion. > This also makes it more difficult to go to multi-threaded completion > handling, if that were shown to be beneficial. I posted a patch for this which you have seen. -- Hal From halr at voltaire.com Wed Nov 10 13:19:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 16:19:29 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52r7n22tgt.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> Message-ID: <1100121569.2836.112.camel@hpc-1> I haven't cleared the other issues before getting back to this but wanted to respond to some of the points below: On Tue, 2004-11-09 at 23:55, Roland Dreier wrote: > Roland> OK, I think I understand the problem, but I'm not sure > Roland> what the correct solution is. When a DR SMP arrives at a > Roland> CA from the SM, hop_cnt == hop_ptr == number of hops in > Roland> the directed route, > > Hal> What was the number ? > > For one port it was 4 and for another it was 6. It could really be > anything (it's just how many hops away the SM is). I think I understand how DR is supposed to work :-) I was just looking for the actual values in the failed case to try to understand what is code was doing as I don't have a configuration to recreate this (at least yet). >From what you indicated, it looks like it would be the following case so no response would be sent: /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { if (node_type != IB_NODE_SWITCH return 0; but I'm not sure whether those were the values on entry to the smi_dr_handle_smp_recv routine that was excised from the code. > Hal> I integrated this patch and checked it back in. I don't think > Hal> this is the solution for all cases (and something else is > Hal> broken). > > Could be. I had a hard time checking the code in smi.c (which is > split between smi_handle_dr_smp_recv() and smi_handle_dr_smp_send() as > well as smi_check_forward_dr_smp(), but which has outgoing and > returning DR handling mixed together) against the IB spec (which > splits outgoing and returning DR handling). I had to squint hard the first time I went through this too (and probably will again). I will explain how this works in sufficient detail if this is of interest. > Hal> The second call to smi_handle_dr_smp_recv was to validate the > Hal> DR in the response packet before sending it. The response > Hal> would be a returning DR packet (D bit 1). If hop_cnt == > Hal> hop_ptr, > > I guess the problem with calling smi_handle_dr_smp_recv() twice on the > same packet is that the function may alter the packet. No, the second call to smi_handle_dr_smp_recv() was on the outgoing response and not the incoming request. The thought was that a packet coming from process_mad is much like an incoming received packet and hence the call to smi_handle_dr_smp_recv. The routine validates the packet but also can do some fixups depending on which case it falls into. Guess it's only dangerous to validate this and wrong to fix it up. The key to me is the following: The split of responsibility on the DR header formation is a little unclear to me. In the case of the SM, are the DR headers fully formed before handing it to the MAD layer or is some DR fixup needed ? -- Hal From roland at topspin.com Wed Nov 10 13:29:40 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 10 Nov 2004 13:29:40 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100121569.2836.112.camel@hpc-1> (Hal Rosenstock's message of "Wed, 10 Nov 2004 16:19:29 -0500") References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <1100121569.2836.112.camel@hpc-1> Message-ID: <52pt2lz92z.fsf@topspin.com> Roland> I guess the problem with calling smi_handle_dr_smp_recv() Roland> twice on the same packet is that the function may alter Roland> the packet. Hal> No, the second call to smi_handle_dr_smp_recv() was on the Hal> outgoing response and not the incoming request. The thought Hal> was that a packet coming from process_mad is much like an Hal> incoming received packet and hence the call to Hal> smi_handle_dr_smp_recv. The routine validates the packet but Hal> also can do some fixups depending on which case it falls Hal> into. Guess it's only dangerous to validate this and wrong to Hal> fix it up. Maybe I'm misreading the code, but my patch deleted the call to smi_handle_dr_smp_recv() before the call to agent_send. agent_send() eventually ends up in ib_post_send_mad(), which calls handle_outgoing_smp() for directed route MADs, which ends up calling smi_handle_dr_smp_recv() again. Since smi_handle_dr_smp_recv() can change the packet, calling it twice on the same packet seems to break things. However I don't think it's a good idea to think of responses generated by process_mad as an incoming received packet. I think they should be thought of as returning DR SMPs being passed to the SMI for sending (as in section 14.2.2 of the IB spec). Hal> The key to me is the following: The split of responsibility Hal> on the DR header formation is a little unclear to me. In the Hal> case of the SM, are the DR headers fully formed before Hal> handing it to the MAD layer or is some DR fixup needed ? My suggestion would be to follow the IB spec, and assume that the SM follows the SMP initialization in 14.2.2.1 and have the MAD layer just implement the SMI processing in 14.2.2.2. (And I believe things should work similarly for responses generated by the SMA -- the MAD layer should just do SMI processing). - R. From mshefty at ichips.intel.com Wed Nov 10 13:30:13 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Nov 2004 13:30:13 -0800 Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun In-Reply-To: <1100119383.2836.81.camel@hpc-1> References: <41916B15.8050909@ichips.intel.com> <1100108177.2836.48.camel@hpc-1> <1100119383.2836.81.camel@hpc-1> Message-ID: <41928865.3020003@ichips.intel.com> Hal Rosenstock wrote: > 1. Why was BUG_ON removed from dequeue_mad ? That can be put back. I removed queue_mad, and was going to remove dequeue_mad, but decided to leave it. > 2. A couple of questions related to send_wr->num_sge checking. > a. Should this be pushed down to mthca and detected there rather than at > the MAD layer ? > b. If it is to stay at the MAD layer, shouldn't there be a check inside > the while (send_wr) loop rather than above it ? I put this check in the MAD layer, since it may be more restrictive than what mthca provides. Looking at that part of the code, we can push the check to mthca by making the following changes: Move sg_list[] in ib_mad_send_wr_private to the end of the structure. Change the sg_list array size from IB_MAD_SEND_REQ_MAX_SG to 1. Change the kmalloc in ib_post_send_mad() to use sizeof *mad_send_wr + sizeof *mad_send_wr->sg_list * (send_wr->num_sge - 1) You are correct that the check needs to be within the while loop if it remains in the MAD code. - Sean From paul.baxter at dsl.pipex.com Wed Nov 10 13:49:23 2004 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Wed, 10 Nov 2004 21:49:23 -0000 Subject: [openib-general] News: Roland, Hal, Sean et al might actually get paid! Message-ID: <008e01c4c76f$24c5da20$8000000a@blorp> Glad to see http://news.zdnet.com/2100-9593_22-5446887.html One snippet from the article '..the grant will fund 8-10 full-time programmers.' Does this equate to Sean, Roland, Hal working 80 hour weeks with some support from others merely working 40 hour weeks :) Just wanted to say well done for the work to date but you guys are allowed to take the weekend off occasionally. Lets hope this opportunity to sell the positive contribution to Infiniband and Linux gets heard. I know Roland wrote a partial rebuttal to Greg KH's LWN article, but I can't help feeling part of getting adoption of IB in Linux is the PR battle. From halr at voltaire.com Wed Nov 10 14:20:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 17:20:39 -0500 Subject: [openib-general] Re: [PATCH] handle QP0/1 send queue overrun In-Reply-To: <41928865.3020003@ichips.intel.com> References: <41916B15.8050909@ichips.intel.com> <1100108177.2836.48.camel@hpc-1> <1100119383.2836.81.camel@hpc-1> <41928865.3020003@ichips.intel.com> Message-ID: <1100125238.2836.125.camel@hpc-1> On Wed, 2004-11-10 at 16:30, Sean Hefty wrote: > Hal Rosenstock wrote: > > 1. Why was BUG_ON removed from dequeue_mad ? > > That can be put back. I removed queue_mad, and was going to remove > dequeue_mad, but decided to leave it. I added this back in. > > 2. A couple of questions related to send_wr->num_sge checking. > > a. Should this be pushed down to mthca and detected there rather than at > > the MAD layer ? > > b. If it is to stay at the MAD layer, shouldn't there be a check inside > > the while (send_wr) loop rather than above it ? > > I put this check in the MAD layer, since it may be more restrictive than > what mthca provides. Looking at that part of the code, we can push the > check to mthca by making the following changes: > > Move sg_list[] in ib_mad_send_wr_private to the end of the structure. > Change the sg_list array size from IB_MAD_SEND_REQ_MAX_SG to 1. > Change the kmalloc in ib_post_send_mad() to use sizeof *mad_send_wr + > sizeof *mad_send_wr->sg_list * (send_wr->num_sge - 1) > > You are correct that the check needs to be within the while loop if it > remains in the MAD code. I made the above changes by hand (moving the check down for at least the time being). Thanks! Applied. (Nice work). -- Hal From mshefty at ichips.intel.com Wed Nov 10 14:22:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Nov 2004 14:22:55 -0800 Subject: [openib-general] [PATCH] [TRIVIAL] remove unneeded locking in ib_mad_return_posted_recv_mads Message-ID: <419294BF.50207@ichips.intel.com> Removed locking, since this is in cleanup code. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1197) +++ core/mad.c (working copy) @@ -1602,7 +1602,6 @@ struct ib_mad_private *recv; struct ib_mad_list_head *mad_list; - spin_lock_irqsave(&qp_info->recv_queue.lock, flags); while (!list_empty(&qp_info->recv_queue.list)) { mad_list = list_entry(qp_info->recv_queue.list.next, @@ -1626,7 +1625,6 @@ } qp_info->recv_queue.count = 0; - spin_unlock_irqrestore(&qp_info->recv_queue.lock, flags); } /* From halr at voltaire.com Wed Nov 10 14:30:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 17:30:19 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52pt2lz92z.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <1100121569.2836.112.camel@hpc-1> <52pt2lz92z.fsf@topspin.com> Message-ID: <1100125819.2836.136.camel@hpc-1> On Wed, 2004-11-10 at 16:29, Roland Dreier wrote: > Roland> I guess the problem with calling smi_handle_dr_smp_recv() > Roland> twice on the same packet is that the function may alter > Roland> the packet. > > Hal> No, the second call to smi_handle_dr_smp_recv() was on the > Hal> outgoing response and not the incoming request. The thought > Hal> was that a packet coming from process_mad is much like an > Hal> incoming received packet and hence the call to > Hal> smi_handle_dr_smp_recv. The routine validates the packet but > Hal> also can do some fixups depending on which case it falls > Hal> into. Guess it's only dangerous to validate this and wrong to > Hal> fix it up. > > Maybe I'm misreading the code, but my patch deleted the call to > smi_handle_dr_smp_recv() before the call to agent_send. You're not. I was... > agent_send() eventually ends up in ib_post_send_mad(), which calls > handle_outgoing_smp() for directed route MADs, which ends up calling > smi_handle_dr_smp_recv() again. Since smi_handle_dr_smp_recv() can > change the packet, calling it twice on the same packet seems to break things. I'm with you now. > However I don't think it's a good idea to think of responses generated > by process_mad as an incoming received packet. I think they should be > thought of as returning DR SMPs being passed to the SMI for sending > (as in section 14.2.2 of the IB spec). Yup, there is a difference between a returning SMP being sent and an incoming SMP being received in terms of SMI. I was being imprecise again. > Hal> The key to me is the following: The split of responsibility > Hal> on the DR header formation is a little unclear to me. In the > Hal> case of the SM, are the DR headers fully formed before > Hal> handing it to the MAD layer or is some DR fixup needed ? > > My suggestion would be to follow the IB spec, and assume that the SM > follows the SMP initialization in 14.2.2.1 and have the MAD layer just > implement the SMI processing in 14.2.2.2. (And I believe things > should work similarly for responses generated by the SMA -- the MAD > layer should just do SMI processing). That was the intention. I will figure out what is broke but not just yet... I may want something tried by either you or Sean prior to my checking it in to be sure. I'll let you know. Thanks. -- Hal From mshefty at ichips.intel.com Wed Nov 10 14:33:44 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Nov 2004 14:33:44 -0800 Subject: [openib-general] [PATCH] adjust error checking in ib_post_send_mad Message-ID: <41929748.1030204@ichips.intel.com> Removes unneeded check and relocates other to while loop. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1197) +++ core/mad.c (working copy) @@ -518,14 +518,10 @@ if (!bad_send_wr) goto error1; - if (!mad_agent || !send_wr ) + if (!mad_agent || !send_wr) goto error2; - if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) - goto error2; - - if (!mad_agent->send_handler || - (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler)) + if (!mad_agent->send_handler) goto error2; mad_agent_priv = container_of(mad_agent, @@ -543,6 +539,9 @@ if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) goto error2; + if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) + goto error2; + if (!send_wr->wr.ud.mad_hdr) { printk(KERN_ERR PFX "MAD header must be supplied " "in WR %p\n", send_wr); From halr at voltaire.com Wed Nov 10 14:43:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 17:43:08 -0500 Subject: [openib-general] Re: [PATCH] [TRIVIAL] remove unneeded locking in ib_mad_return_posted_recv_mads In-Reply-To: <419294BF.50207@ichips.intel.com> References: <419294BF.50207@ichips.intel.com> Message-ID: <1100126588.2836.138.camel@hpc-1> On Wed, 2004-11-10 at 17:22, Sean Hefty wrote: > Removed locking, since this is in cleanup code. Thanks. Applied. -- Hal From halr at voltaire.com Wed Nov 10 14:54:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 17:54:02 -0500 Subject: [openib-general] Re: [PATCH] adjust error checking in ib_post_send_mad In-Reply-To: <41929748.1030204@ichips.intel.com> References: <41929748.1030204@ichips.intel.com> Message-ID: <1100127242.2836.140.camel@hpc-1> On Wed, 2004-11-10 at 17:33, Sean Hefty wrote: > Removes unneeded check and relocates other to while loop. Thanks. Applied. -- Hal From halr at voltaire.com Wed Nov 10 16:19:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 19:19:30 -0500 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <52sm7h1vtz.fsf@topspin.com> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com> <52wtwt1vxx.fsf@topspin.com> <52sm7h1vtz.fsf@topspin.com> Message-ID: <1100132370.3283.30.camel@localhost.localdomain> On Wed, 2004-11-10 at 12:02, Roland Dreier wrote: > By the way, if I am reading the code correctly, it looks like the MAD > layer only checks for IB_MAD_RESULT_REPLY and not > IB_MAD_RESULT_CONSUMED. You are reading the code correctly. > If IB_MAD_RESULT_CONSUMED is set then the > packet is something like a trap repress handled by the SMA or a > locally generated trap that the driver forwarded to the SM, so the > packet should not go through agent dispatch. This is a patch which should occur shortly. -- Hal From halr at voltaire.com Wed Nov 10 17:34:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 10 Nov 2004 20:34:18 -0500 Subject: [openib-general] [PATCH] mad: After calling process_mad, handle MAD being consumed Message-ID: <1100136858.2739.3.camel@hpc-1> mad: After calling process_mad, handle MAD being consumed Index: mad.c =================================================================== --- mad.c (revision 1199) +++ mad.c (working copy) @@ -400,16 +400,22 @@ smp->dr_slid, /* ? */ (struct ib_mad *)smp, (struct ib_mad *)&mad_priv->mad); - if ((ret & IB_MAD_RESULT_SUCCESS) && - (ret & IB_MAD_RESULT_REPLY)) { - if (!smi_handle_dr_smp_recv( + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) { + ret = 1; + goto error1; + } + if (ret & IB_MAD_RESULT_REPLY) { + if (!smi_handle_dr_smp_recv( (struct ib_smp *)&mad_priv->mad, mad_agent->device->node_type, mad_agent->port_num, mad_agent->device->phys_port_cnt)) { - ret = -EINVAL; - kmem_cache_free(ib_mad_cache, mad_priv); - goto error1; + ret = -EINVAL; + kmem_cache_free(ib_mad_cache, + mad_priv); + goto error1; + } } } } @@ -1147,6 +1153,8 @@ recv->header.recv_buf.mad, &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) + goto out; if (ret & IB_MAD_RESULT_REPLY) { /* Send response */ if (!agent_send(response, &recv->grh, wc, From halr at voltaire.com Thu Nov 11 04:45:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 07:45:54 -0500 Subject: [openib-general] [PATCH] agent: Handle out of order send completions In-Reply-To: <419266D4.6040005@ichips.intel.com> References: <1100113652.2836.72.camel@hpc-1> <419266D4.6040005@ichips.intel.com> Message-ID: <1100177153.3283.68.camel@localhost.localdomain> On Wed, 2004-11-10 at 14:07, Sean Hefty wrote: > Hal Rosenstock wrote: > > > - send_wr.wr_id = ++port_priv->wr_id; > > + send_wr.wr_id = (unsigned long)&agent_send_wr->send_list; > {snip} > > + send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id; > > + agent_send_wr = container_of(send_wr, struct ib_agent_send_wr, > > send_list); > > I think it may be clearer to set the wr_id to agent_send_wr, rather than > a subfield. Yes, that would be better (clearer and less code). Patch shortly for this. -- Hal From halr at voltaire.com Thu Nov 11 05:35:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 08:35:31 -0500 Subject: [openib-general] [PATCH] agent: Better wr_id in send WR makes for slightly simpler completion handling Message-ID: <1100180130.9470.38.camel@hpc-1> agent: Better wr_id in send WR makes for slightly simpler completion handling (comment from Sean) Index: agent.c =================================================================== --- agent.c (revision 1200) +++ agent.c (working copy) @@ -172,7 +172,7 @@ send_wr.wr.ud.remote_qkey = 0; /* for SMPs */ } send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr; - send_wr.wr_id = (unsigned long)&agent_send_wr->send_list; + send_wr.wr_id = (unsigned long)agent_send_wr; pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); @@ -236,7 +236,6 @@ { struct ib_agent_port_private *port_priv; struct ib_agent_send_wr *agent_send_wr; - struct list_head *send_wr; unsigned long flags; /* Find matching MAD agent */ @@ -247,10 +246,8 @@ return; } + agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; spin_lock_irqsave(&port_priv->send_list_lock, flags); - send_wr = (struct list_head *)(unsigned long)mad_send_wc->wr_id; - agent_send_wr = container_of(send_wr, struct ib_agent_send_wr, - send_list); /* Remove completed send from posted send MAD list */ list_del(&agent_send_wr->send_list); spin_unlock_irqrestore(&port_priv->send_list_lock, flags); From mlleinin at hpcn.ca.sandia.gov Thu Nov 11 06:01:00 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 11 Nov 2004 06:01:00 -0800 Subject: [openib-general] New OpenIB webpages Message-ID: <1100181660.14334.548.camel@trinity> As some of you may have noticed, we migrated over to the new OpenIB web pages yesterday. The FAQ and a few other items are still a work in progress. Let me know if there are any errors or if folks have other feedback/suggestions. Thanks, - Matt From tduffy at sun.com Thu Nov 11 07:06:53 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 07:06:53 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100181660.14334.548.camel@trinity> References: <1100181660.14334.548.camel@trinity> Message-ID: <1100185613.22128.5.camel@duffman> On Thu, 2004-11-11 at 06:01 -0800, Matt Leininger wrote: > > As some of you may have noticed, we migrated over to the new OpenIB > web pages yesterday. The FAQ and a few other items are still a work in > progress. Let me know if there are any errors or if folks have other > feedback/suggestions. Well done. The new page looks great. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Nov 11 07:41:44 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 07:41:44 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100181660.14334.548.camel@trinity> (Matt Leininger's message of "Thu, 11 Nov 2004 06:01:00 -0800") References: <1100181660.14334.548.camel@trinity> Message-ID: <52ekj0z93b.fsf@topspin.com> Matt> As some of you may have noticed, we migrated over to the Matt> new OpenIB web pages yesterday. The FAQ and a few other Matt> items are still a work in progress. Let me know if there Matt> are any errors or if folks have other feedback/suggestions. Looks great. One suggestions: under news, it's probably worth linking to or mentioning the PathForward funding announcement. - R. From tduffy at sun.com Thu Nov 11 08:14:21 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 08:14:21 -0800 Subject: [openib-general] openib.org/bugzilla Message-ID: <1100189661.25996.2.camel@duffman> I just signed up for an account, but the email confirmation had the wrong address. It said to go to: http://cvs-mirror.mozilla.org/webtools/bugzilla/userprefs.cgi Also, it seems there is no gen2 version in the query field. Thanks, -tduffy -- Tom Duffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Nov 11 08:31:57 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 08:31:57 -0800 Subject: [openib-general] [PATCH] ipoib: Free AHs Message-ID: <52actoz6rm.fsf@topspin.com> This patch corrects the fact that IPoIB leaks all of its address handles by creating a list of dead AHs and freeing an AH once all the sends using it complete. Index: ulp/ipoib/ipoib_verbs.c =================================================================== --- ulp/ipoib/ipoib_verbs.c (revision 1201) +++ ulp/ipoib/ipoib_verbs.c (working copy) @@ -171,16 +171,6 @@ return -EINVAL; } -void ipoib_qp_destroy(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - if (ib_destroy_qp(priv->qp)) - ipoib_warn(priv, "ib_qp_destroy failed\n"); - - priv->qp = NULL; -} - int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) { struct ipoib_dev_priv *priv = netdev_priv(dev); Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 1201) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -177,7 +177,7 @@ struct ipoib_path *path = path_ptr; struct ipoib_dev_priv *priv = netdev_priv(path->dev); struct sk_buff *skb; - struct ib_ah *ah; + struct ipoib_ah *ah; ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); @@ -195,10 +195,10 @@ .port_num = priv->port }; - ah = ib_create_ah(priv->pd, &av); + ah = ipoib_create_ah(path->dev, priv->pd, &av); } - if (IS_ERR(ah)) + if (!ah) goto err; path->ah = ah; @@ -299,7 +299,7 @@ { struct sk_buff *skb = skb_ptr; struct ipoib_dev_priv *priv = netdev_priv(skb->dev); - struct ib_ah *ah; + struct ipoib_ah *ah; ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); @@ -307,6 +307,10 @@ if (status) goto err; + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + goto err; + { struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), @@ -317,13 +321,15 @@ .port_num = priv->port }; - ah = ib_create_ah(priv->pd, &av); + ah->ah = ib_create_ah(priv->pd, &av); } - if (IS_ERR(ah)) + if (IS_ERR(ah->ah)) { + kfree(ah); goto err; + } - *(struct ib_ah **) skb->cb = ah; + *(struct ipoib_ah **) skb->cb = ah; if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed " @@ -337,10 +343,15 @@ static void unicast_arp_finish(struct sk_buff *skb) { - struct ib_ah *ah = *(struct ib_ah **) skb->cb; + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + struct ipoib_ah *ah = *(struct ipoib_ah **) skb->cb; + unsigned long flags; - if (ah) - ib_destroy_ah(ah); + if (ah) { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } } /* @@ -443,7 +454,7 @@ * now we can just send the packet. */ if (skb->destructor == unicast_arp_finish) { - ipoib_send(dev, skb, *(struct ib_ah **) skb->cb, + ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb, be32_to_cpup((u32 *) phdr->hwaddr)); return 0; } @@ -454,14 +465,7 @@ skb->dst ? "neigh" : "dst", be16_to_cpup((u16 *) skb->data), be32_to_cpup((u32 *) phdr->hwaddr), - phdr->hwaddr[ 4], phdr->hwaddr[ 5], - phdr->hwaddr[ 6], phdr->hwaddr[ 7], - phdr->hwaddr[ 8], phdr->hwaddr[ 9], - phdr->hwaddr[10], phdr->hwaddr[11], - phdr->hwaddr[12], phdr->hwaddr[13], - phdr->hwaddr[14], phdr->hwaddr[15], - phdr->hwaddr[16], phdr->hwaddr[17], - phdr->hwaddr[18], phdr->hwaddr[19]); + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); /* put the pseudoheader back on */ skb_push(skb, sizeof *phdr); @@ -529,10 +533,17 @@ static void ipoib_neigh_destructor(struct neighbour *neigh) { - ipoib_dbg(netdev_priv(neigh->dev), - "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + struct ipoib_dev_priv *priv = netdev_priv(neigh->dev); + struct ipoib_path *path = IPOIB_PATH(neigh); + + ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", be32_to_cpup((__be32 *) neigh->ha), IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4)))); + + if (path && path->ah) { + ipoib_put_ah(path->ah); + kfree(path); + } } static int ipoib_neigh_setup(struct neighbour *neigh) @@ -683,12 +694,14 @@ sema_init(&priv->mcast_mutex, 1); INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); } struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) Index: ulp/ipoib/ipoib_multicast.c =================================================================== --- ulp/ipoib/ipoib_multicast.c (revision 1201) +++ ulp/ipoib/ipoib_multicast.c (working copy) @@ -36,7 +36,7 @@ /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ struct ipoib_mcast { struct ib_sa_mcmember_rec mcmember; - struct ib_ah *address_handle; + struct ipoib_ah *ah; struct rb_node rb_node; struct list_head list; @@ -69,11 +69,8 @@ ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - if (mcast->address_handle != NULL) { - int ret = ib_destroy_ah(mcast->address_handle); - if (ret < 0) - ipoib_warn(priv, "ib_destroy_ah failed (ret = %d)\n", ret); - } + if (mcast->ah) + ipoib_put_ah(mcast->ah); while (!skb_queue_empty(&mcast->pkt_queue)) { struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); @@ -108,7 +105,7 @@ INIT_LIST_HEAD(&mcast->list); skb_queue_head_init(&mcast->pkt_queue); - mcast->address_handle = NULL; + mcast->ah = NULL; mcast->query = NULL; return mcast; @@ -224,14 +221,14 @@ av.grh.dgid = mcast->mcmember.mgid; - mcast->address_handle = ib_create_ah(priv->pd, &av); - if (IS_ERR(mcast->address_handle)) { + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); + if (!mcast->ah) { ipoib_warn(priv, "ib_address_create failed\n"); } else { ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT " AV %p, LID 0x%04x, SL %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), - mcast->address_handle, + mcast->ah->ah, be16_to_cpu(mcast->mcmember.mlid), mcast->mcmember.sl); } @@ -661,7 +658,7 @@ list_add_tail(&mcast->list, &priv->multicast_list); } - if (!mcast->address_handle) { + if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); else @@ -682,14 +679,15 @@ out: spin_unlock_irqrestore(&priv->lock, flags); - if (mcast && mcast->address_handle) { + if (mcast && mcast->ah) { if (skb->dst && skb->dst->neighbour && !IPOIB_PATH(skb->dst->neighbour)) { struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); if (path) { - path->ah = mcast->address_handle; + kref_get(&mcast->ah->ref); + path->ah = mcast->ah; path->qpn = IB_MULTICAST_QPN; path->dev = dev; path->neighbour = skb->dst->neighbour; @@ -697,7 +695,7 @@ } } - ipoib_send(dev, skb, mcast->address_handle, IB_MULTICAST_QPN); + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } } @@ -951,9 +949,9 @@ mcast = rb_entry(iter->rb_node, struct ipoib_mcast, rb_node); - *mgid = mcast->mcmember.mgid; - *created = mcast->created; - *queuelen = skb_queue_len(&mcast->pkt_queue); - *complete = mcast->address_handle != NULL; + *mgid = mcast->mcmember.mgid; + *created = mcast->created; + *queuelen = skb_queue_len(&mcast->pkt_queue); + *complete = !!mcast->ah; *send_only = (mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)) ? 1 : 0; } Index: ulp/ipoib/ipoib.h =================================================================== --- ulp/ipoib/ipoib.h (revision 1201) +++ ulp/ipoib/ipoib.h (working copy) @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -65,6 +66,7 @@ IPOIB_PKEY_STOP = 4, IPOIB_FLAG_SUBINTERFACE = 5, IPOIB_MCAST_RUN = 6, + IPOIB_STOP_REAPER = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -109,6 +111,7 @@ struct work_struct mcast_task; struct work_struct flush_task; struct work_struct restart_task; + struct work_struct ah_reap_task; struct ib_device *ca; u8 port; @@ -134,18 +137,28 @@ struct ib_wc ibwc[IPOIB_NUM_WC]; + struct list_head dead_ahs; + struct proc_dir_entry *mcast_proc_entry; struct ib_event_handler event_handler; struct net_device_stats stats; + struct list_head child_intfs; struct list_head list; - struct list_head child_intfs; }; +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + struct ipoib_path { - struct ib_ah *ah; + struct ipoib_ah *ah; u32 qpn; struct sk_buff_head queue; @@ -166,8 +179,17 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ib_ah *address, u32 qpn); + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); @@ -213,7 +235,6 @@ union ib_gid *mgid); int ipoib_qp_create(struct net_device *dev); -void ipoib_qp_destroy(struct net_device *dev); int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); void ipoib_transport_dev_cleanup(struct net_device *dev); Index: ulp/ipoib/ipoib_ib.c =================================================================== --- ulp/ipoib/ipoib_ib.c (revision 1201) +++ ulp/ipoib/ipoib_ib.c (working copy) @@ -29,10 +29,44 @@ static DECLARE_MUTEX(pkey_sem); -static int _ipoib_ib_receive(struct ipoib_dev_priv *priv, - u64 work_request_id, - dma_addr_t addr) +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) { + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); +} + +static int ipoib_ib_receive(struct ipoib_dev_priv *priv, + u64 work_request_id, + dma_addr_t addr) +{ struct ib_sge list = { .addr = addr, .length = IPOIB_BUF_SIZE, @@ -50,8 +84,8 @@ } /* =============================================================== */ -/*.._ipoib_ib_post_receive -- post a receive buffer */ -static int _ipoib_ib_post_receive(struct net_device *dev, int id) +/*..ipoib_ib_post_receive -- post a receive buffer */ +static int ipoib_ib_post_receive(struct net_device *dev, int id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; @@ -72,24 +106,24 @@ PCI_DMA_FROMDEVICE); pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); - ret = _ipoib_ib_receive(priv, id, addr); + ret = ipoib_ib_receive(priv, id, addr); if (ret) - ipoib_warn(priv, "_ipoib_ib_receive failed for buf %d (%d)\n", + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", id, ret); return ret; } /* =============================================================== */ -/*.._ipoib_ib_post_receives -- post all receive buffers */ -static int _ipoib_ib_post_receives(struct net_device *dev) +/*..ipoib_ib_post_receives -- post all receive buffers */ +static int ipoib_ib_post_receives(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int i; for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { - if (_ipoib_ib_post_receive(dev, i)) { - ipoib_warn(priv, "_ipoib_ib_post_receive failed for buf %d\n", i); + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); return -EIO; } } @@ -108,7 +142,7 @@ if (entry->status != IB_WC_SUCCESS) { ipoib_warn(priv, "got failed completion event " - "(status=%d, wrid=%d, op=%d)", + "(status=%d, wrid=%d, op=%d)\n", entry->status, wr_id, entry->opcode); if (entry->opcode == IB_WC_SEND) { @@ -163,8 +197,8 @@ } /* repost receive */ - if (_ipoib_ib_post_receive(dev, wr_id)) - ipoib_warn(priv, "_ipoib_ib_post_receive failed " + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " "for buf %d\n", wr_id); } else ipoib_warn(priv, "completion event with wrid %d\n", @@ -262,7 +296,7 @@ /* =============================================================== */ /*..ipoib_send -- schedule an IB send work request */ void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ib_ah *address, u32 qpn) + struct ipoib_ah *address, u32 qpn) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_buf *tx_req; @@ -302,7 +336,7 @@ pci_unmap_addr_set(tx_req, mapping, addr); if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), - address, qpn, addr, skb->len)) { + address->ah, qpn, addr, skb->len)) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; tx_req->skb = NULL; @@ -312,6 +346,7 @@ dev->trans_start = jiffies; + address->last_send = priv->tx_head; ++priv->tx_head; spin_lock_irqsave(&priv->lock, flags); @@ -323,6 +358,38 @@ } } +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -334,12 +401,15 @@ return -1; } - ret = _ipoib_ib_post_receives(dev); + ret = ipoib_ib_post_receives(dev); if (ret) { - ipoib_warn(priv, "_ipoib_ib_post_receives returned %d\n", ret); + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); return -1; } + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + return 0; } @@ -395,8 +465,10 @@ int i; /* Kill the existing QP and allocate a new one */ - if (priv->qp != NULL) - ipoib_qp_destroy(dev); + if (priv->qp != NULL) { + ib_destroy_qp(priv->qp); + priv->qp = NULL; + } for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { if (priv->rx_ring[i].skb) { @@ -463,14 +535,24 @@ /*..ipoib_ib_dev_cleanup -- clean up IB resources for iface */ void ipoib_ib_dev_cleanup(struct net_device *dev) { - ipoib_dbg(netdev_priv(dev), "cleaning up ib_dev\n"); + struct ipoib_dev_priv *priv = netdev_priv(dev); + ipoib_dbg(priv, "cleaning up ib_dev\n"); + ipoib_mcast_stop_thread(dev); /* Delete the broadcast address and the local address */ ipoib_mcast_dev_down(dev); ipoib_transport_dev_cleanup(dev); + + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + yield(); + } } /* From tduffy at sun.com Thu Nov 11 09:07:44 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 09:07:44 -0800 Subject: [openib-general] [Fwd: [Bug 1] New: kernel prints out error message for each ib interface] Message-ID: <1100192864.25996.5.camel@duffman> -------------- next part -------------- An embedded message was scrubbed... From: bugzilla-daemon at openib.org Subject: [Bug 1] New: kernel prints out error message for each ib interface Date: Thu, 11 Nov 2004 09:08:19 -0800 (PST) Size: 2770 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Thu Nov 11 09:07:55 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 09:07:55 -0800 Subject: [openib-general] [Fwd: [Bug 2] New: ipoib does not work with ipv6] Message-ID: <1100192875.25996.7.camel@duffman> -------------- next part -------------- An embedded message was scrubbed... From: bugzilla-daemon at openib.org Subject: [Bug 2] New: ipoib does not work with ipv6 Date: Thu, 11 Nov 2004 09:18:36 -0800 (PST) Size: 3781 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Nov 11 09:09:14 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 09:09:14 -0800 Subject: [openib-general] [PATCH] Remove use of SPIN_LOCK_UNLOCKED Message-ID: <521xf0z51h.fsf@topspin.com> In the upstream kernel, the use of SPIN_LOCK_UNLOCKED is being phased out (look for changesets like "Lock initializer unifying"). This patch converts the MAD layer to use spin_lock_init() instead, please apply. - R. Index: core/agent.c =================================================================== --- core/agent.c (revision 1202) +++ core/agent.c (working copy) @@ -30,7 +30,7 @@ #include -static spinlock_t ib_agent_port_list_lock = SPIN_LOCK_UNLOCKED; +spinlock_t ib_agent_port_list_lock; static LIST_HEAD(ib_agent_port_list); extern kmem_cache_t *ib_mad_cache; @@ -382,4 +382,3 @@ return 0; } - Index: core/mad.c =================================================================== --- core/mad.c (revision 1202) +++ core/mad.c (working copy) @@ -74,7 +74,7 @@ static u32 ib_mad_client_id = 0; /* Port list lock */ -static spinlock_t ib_mad_port_list_lock = SPIN_LOCK_UNLOCKED; +static spinlock_t ib_mad_port_list_lock; /* Forward declarations */ @@ -2132,6 +2132,9 @@ { int ret; + spin_lock_init(&ib_mad_port_list_lock); + spin_lock_init(&ib_agent_port_list_lock); + ib_mad_cache = kmem_cache_create("ib_mad", sizeof(struct ib_mad_private), 0, @@ -2171,4 +2174,3 @@ module_init(ib_mad_init_module); module_exit(ib_mad_cleanup_module); - Index: core/agent.h =================================================================== --- core/agent.h (revision 1202) +++ core/agent.h (working copy) @@ -26,6 +26,8 @@ #ifndef __AGENT_H_ #define __AGENT_H_ +extern spinlock_t ib_agent_port_list_lock; + extern int ib_agent_port_open(struct ib_device *device, int port_num); From halr at voltaire.com Thu Nov 11 09:23:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 12:23:50 -0500 Subject: [openib-general] [PATCH] ipoib: Free AHs In-Reply-To: <52actoz6rm.fsf@topspin.com> References: <52actoz6rm.fsf@topspin.com> Message-ID: <1100193829.28921.2.camel@hpc-1> On Thu, 2004-11-11 at 11:31, Roland Dreier wrote: > This patch corrects the fact that IPoIB leaks all of its address > handles by creating a list of dead AHs and freeing an AH once all the > sends using it complete. A couple of compile warnings: drivers/infiniband/ulp/ipoib/ipoib_main.c: In function `ipoib_neigh_destructor': drivers/infiniband/ulp/ipoib/ipoib_main.c:536: warning: unused variable `priv' and drivers/infiniband/ulp/ipoib/ipoib_multicast.c: In function `ipoib_mcast_free': drivers/infiniband/ulp/ipoib/ipoib_multicast.c:67: warning: unused variable `priv' Here's a trivial patch for these. -- Hal Index: ipoib_main.c =================================================================== --- ipoib_main.c (revision 1205) +++ ipoib_main.c (working copy) @@ -533,10 +533,10 @@ static void ipoib_neigh_destructor(struct neighbour *neigh) { - struct ipoib_dev_priv *priv = netdev_priv(neigh->dev); struct ipoib_path *path = IPOIB_PATH(neigh); - ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + ipoib_dbg(netdev_priv(neigh->dev), + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", be32_to_cpup((__be32 *) neigh->ha), IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4)))); Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 1205) +++ ipoib_multicast.c (working copy) @@ -64,9 +64,9 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast) { struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); - ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", + ipoib_dbg_mcast(netdev_priv(dev), + "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); if (mcast->ah) From roland at topspin.com Thu Nov 11 09:21:54 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 09:21:54 -0800 Subject: [openib-general] IPoIB w/ IBSRM? Message-ID: <52wtwsxpvx.fsf@topspin.com> Tom/Nitin, can you guys tell me if the latest IPoIB code works with IBSRM without any workarounds? I think the multicast group joining and creating should be spec compliant now but I'd like to make sure the old problems are really gone. Thanks, Roland From roland at topspin.com Thu Nov 11 09:24:11 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 09:24:11 -0800 Subject: [openib-general] [PATCH] ipoib: Free AHs In-Reply-To: <1100193829.28921.2.camel@hpc-1> (Hal Rosenstock's message of "Thu, 11 Nov 2004 12:23:50 -0500") References: <52actoz6rm.fsf@topspin.com> <1100193829.28921.2.camel@hpc-1> Message-ID: <52sm7gxps4.fsf@topspin.com> Thanks, applied. - R. From halr at voltaire.com Thu Nov 11 09:37:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 12:37:34 -0500 Subject: [openib-general] [PATCH] ipoib: Free AHs In-Reply-To: <52actoz6rm.fsf@topspin.com> References: <52actoz6rm.fsf@topspin.com> Message-ID: <1100194653.28921.16.camel@hpc-1> On Thu, 2004-11-11 at 11:31, Roland Dreier wrote: > This patch corrects the fact that IPoIB leaks all of its address > handles by creating a list of dead AHs and freeing an AH once all the > sends using it complete. Unfortunately I still see: ib0: ib_dealloc_pd failed when I removed ib_ipoib and then ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, ecb9b000 busy when I removed ib_mthca. Should the latter go away with this latest change ? -- Hal From roland at topspin.com Thu Nov 11 09:36:31 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 09:36:31 -0800 Subject: [openib-general] [PATCH] ipoib: Free AHs In-Reply-To: <1100194653.28921.16.camel@hpc-1> (Hal Rosenstock's message of "Thu, 11 Nov 2004 12:37:34 -0500") References: <52actoz6rm.fsf@topspin.com> <1100194653.28921.16.camel@hpc-1> Message-ID: <52oei4xp7k.fsf@topspin.com> Hal> Unfortunately I still see: Hal> ib0: ib_dealloc_pd failed Hal> when I removed ib_ipoib I understand why that happens: I try to free the PD before waiting for all the AHs to be reaped. This should be fixed soon. - R. From halr at voltaire.com Thu Nov 11 09:44:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 12:44:23 -0500 Subject: [openib-general] [PATCH] Remove use of SPIN_LOCK_UNLOCKED In-Reply-To: <521xf0z51h.fsf@topspin.com> References: <521xf0z51h.fsf@topspin.com> Message-ID: <1100195062.28921.24.camel@hpc-1> On Thu, 2004-11-11 at 12:09, Roland Dreier wrote: > In the upstream kernel, the use of SPIN_LOCK_UNLOCKED is being > phased out (look for changesets like "Lock initializer unifying"). > This patch converts the MAD layer to use spin_lock_init() instead, > please apply. Thanks. Applied. -- Hal From tduffy at sun.com Thu Nov 11 09:44:34 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 09:44:34 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <52wtwsxpvx.fsf@topspin.com> References: <52wtwsxpvx.fsf@topspin.com> Message-ID: <1100195074.25996.14.camel@duffman> On Thu, 2004-11-11 at 09:21 -0800, Roland Dreier wrote: > Tom/Nitin, can you guys tell me if the latest IPoIB code works with > IBSRM without any workarounds? I think the multicast group joining > and creating should be spec compliant now but I'd like to make sure Yes, this is working. (awesome!) As long as I only try to bring up the ib0.8001 interface. If I bring up ib0, ib_ipoib freaks out and continuously prints (very rapidly): ib0: multicast join failed for ff12:401b:7fff:0:0:0:ffff:ffff, status -22 I think this is an issue with IBSRM because pkey 7fff does not exist. I don't know whose fault this is. IBSRM continuously prints: mcast: smc_mcast_process_add_request: Could not add member: status 0x600 mcast: smc_mcast_check_new_group: Required components not set in comp_mask: required 0x00000000000130c6, set 0x0000000000010083 - mgid ff12401b7fff0000:00000000ffffffff mcast: smc_mcast_add_member: Could not verify attributes for new group: status 0x600 Bringing down ib0 stops the barrage. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Nov 11 09:47:11 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 09:47:11 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <1100195074.25996.14.camel@duffman> (Tom Duffy's message of "Thu, 11 Nov 2004 09:44:34 -0800") References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman> Message-ID: <52k6ssxops.fsf@topspin.com> Tom> As long as I only try to bring up the ib0.8001 interface. If Tom> I bring up ib0, ib_ipoib freaks out and continuously prints Tom> (very rapidly): Tom> ib0: multicast join failed for ff12:401b:7fff:0:0:0:ffff:ffff, status -22 Hmm, looks like the backoff code isn't working properly (this should only happen every 16 seconds or so). I'll try to figure out what's going on here. Thanks for testing. - R. From roland at topspin.com Thu Nov 11 09:52:17 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 09:52:17 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100181660.14334.548.camel@trinity> (Matt Leininger's message of "Thu, 11 Nov 2004 06:01:00 -0800") References: <1100181660.14334.548.camel@trinity> Message-ID: <52fz3gxoha.fsf@topspin.com> Matt> The FAQ and a few other items are still a work in progress. A couple of suggestions for the FAQ: in "How do I submit source code patches?" I suggest adding something like "Please make sure that patches are licensed under the same terms as the original code (dual GPL/BSD for most of the OpenIB stack)." in "What version of the Linux kernel do you support?" I suggest changing the answer to something like OpenIB supports the latest 2.6 kernel (currently 2.6.9). in "What are all these upper layer protocols like IPoIB, DAPL, MPI, SDP, SRP, and others?" add a link to the IETF ipoib WG at - R. From tduffy at sun.com Thu Nov 11 09:55:27 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 09:55:27 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <52fz3gxoha.fsf@topspin.com> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> Message-ID: <1100195727.25996.20.camel@duffman> On Thu, 2004-11-11 at 09:52 -0800, Roland Dreier wrote: > in "What are all these upper layer protocols like IPoIB, DAPL, MPI, SDP, > SRP, and others?" > > add a link to the IETF ipoib WG at Maybe also worth mentioning that only IPoIB is supported at this time. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Nitin.Hande at Sun.COM Thu Nov 11 09:58:45 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 11 Nov 2004 09:58:45 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface Message-ID: <4193A855.5030102@Sun.COM> signed off by: Nitin Hande I would appreciate if someone can review my patch to enable inet6 address on ib interface. This is the first cut, will like to hear from all. I plan to setup a bugzilla account and append this patch to the bug that Tom has created for inet6. diff -Nurp -X dontdiff /build1/nitin/linux/linux-2.6.9/net/ipv6/addrconf.c linux-2.6.9/net/ipv6/addrconf.c --- /build1/nitin/linux/linux-2.6.9/net/ipv6/addrconf.c 2004-11-10 14:43:53.568970000 -0800 +++ linux-2.6.9/net/ipv6/addrconf.c 2004-11-10 15:07:40.196227944 -0800 @@ -1110,6 +1110,13 @@ static int ipv6_generate_eui64(u8 *eui, memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + /* XXX: replace len with IPOIB_HW_ADDR_LEN later */ + if (dev->addr_len != 20) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] ^= 2; + return 0; } return -1; } @@ -1809,6 +1816,7 @@ static void addrconf_dev_config(struct n if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; -------------------------------------------- Usage and output: Playing with link local address: ================================ sins-stinger-8:~/ipoibcfg/src # ifconfig ib0.8001 inet6 up sins-stinger-8:~/ipoibcfg/src # ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 00-02-00-14-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.100.107 Bcast:192.168.100.255 Mask:255.255.255.0 inet6 addr: fe80::202:c901:976:1f81/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:31 errors:0 dropped:0 overruns:0 frame:0 TX packets:40 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:2832 (2.7 Kb) TX bytes:3532 (3.4 Kb) sins-stinger-8:~/ipoibcfg/src # ip -6 addr show 1: lo: mtu 16436 inet6 ::1/128 scope host valid_lft forever preferred_lft forever 7: ib0.8001: mtu 2044 qlen 128 inet6 fe80::202:c901:976:1f81/64 scope link valid_lft forever preferred_lft forever sins-stinger-8:~/ipoibcfg/src # route -A inet6 Kernel IPv6 routing table Destination Next Hop Flags Metric Ref Use Iface ::1/128 :: U 0 33 2 lo fe80::202:c901:976:1f81/128 :: U 0 9 2 lo fe80::202:c901:976:5161/128 fe80::202:c901:976:5161 UC 0 2 0 ib0.8001 fe80::/64 :: U 256 0 0 ib0.8001 ff00::/8 :: U 256 0 0 ib0.8001 sins-stinger-8:~/ipoibcfg/src # ping6 -I ib0.8001 fe80::202:c901:976:5161 PING fe80::202:c901:976:5161(fe80::202:c901:976:5161) from fe80::202:c901:976:1f81 ib0.8001: 56 data bytes 64 bytes from fe80::202:c901:976:5161: icmp_seq=1 ttl=64 time=2.77 ms 64 bytes from fe80::202:c901:976:5161: icmp_seq=2 ttl=64 time=0.067 ms 64 bytes from fe80::202:c901:976:5161: icmp_seq=3 ttl=64 time=0.066 ms ------------------------------------------------------------ global address and ssh test ================================ sins-stinger-8:~ # ifconfig ib0.8001 inet6 add 2222::2/64 sins-stinger-8:~ # ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00 inet6 addr: 2222::2/64 Scope:Global inet6 addr: fe80::202:c901:976:1f81/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:549 errors:0 dropped:0 overruns:0 frame:0 TX packets:174 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:41510 (40.5 Kb) TX bytes:25115 (24.5 Kb) sins-stinger-8:~/ipoibcfg/src # ssh 2222::1 The authenticity of host '2222::1 (2222::1)' can't be established. RSA key fingerprint is c5:47:5d:44:85:09:a9:b5:38:d7:48:78:f0:77:30:eb. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '2222::1' (RSA) to the list of known hosts. Password: Last login: Thu Nov 11 09:36:10 2004 from sr1-umpk-04.sfbay.sun.com sins-stinger-04:~ # ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00 inet6 addr: 2222::1/64 Scope:Global inet6 addr: fe80::202:c901:976:5161/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:703 errors:0 dropped:0 overruns:0 frame:0 TX packets:652 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:72617 (70.9 Kb) TX bytes:66817 (65.2 Kb) ----------------------------------------------- Interoperability between Solaris and Linux: ============================================== sins-stinger-04:~/ipoibcfg/src # ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00 inet6 addr: 2222::1/64 Scope:Global inet6 addr: fe80::202:c901:976:5161/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:726 errors:0 dropped:0 overruns:0 frame:0 TX packets:668 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:74673 (72.9 Kb) TX bytes:69193 (67.5 Kb) sins-stinger-04:~/ipoibcfg/src # uname -a Linux sins-stinger-04 2.6.9 #4 SMP Tue Nov 9 20:25:28 PST 2004 x86_64 x86_64 x86_64 GNU/Linux sins-stinger-04:~/ipoibcfg/src # ping6 -I ib0.8001 fe80::202:c901:976:5b01 PING fe80::202:c901:976:5b01(fe80::202:c901:976:5b01) from fe80::202:c901:976:5161 ib0.8001: 56 data bytes 64 bytes from fe80::202:c901:976:5b01: icmp_seq=1 ttl=255 time=0.401 ms 64 bytes from fe80::202:c901:976:5b01: icmp_seq=2 ttl=255 time=0.228 ms 64 bytes from fe80::202:c901:976:5b01: icmp_seq=3 ttl=255 time=0.237 ms --- fe80::202:c901:976:5b01 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.228/0.288/0.401/0.081 ms root at caseate# ifconfig ibd1 inet6 ibd1: flags=2000841 mtu 2044 index 4 inet6 fe80::202:c901:976:5b01/10 root at caseate# root at caseate# uname -a SunOS caseate.SFBay.Sun.COM 5.10 s10_70 sun4u sparc SUNW,Sun-Fire-280R root at caseate# ping fe80::202:c901:976:5161 fe80::202:c901:976:5161 is alive root at caseate# IThanks Nitin From roland at topspin.com Thu Nov 11 10:11:43 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 10:11:43 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <4193A855.5030102@Sun.COM> (Nitin Hande's message of "Thu, 11 Nov 2004 09:58:45 -0800") References: <4193A855.5030102@Sun.COM> Message-ID: <52bre4xnkw.fsf@topspin.com> Nitin> I would appreciate if someone can review my patch to enable Nitin> inet6 address on ib interface. This is the first cut, will Nitin> like to hear from all. I plan to setup a bugzilla account Nitin> and append this patch to the bug that Tom has created for Nitin> inet6. This looks right to me. My only questions are: + eui[0] ^= 2; I remember some discussion about whether IBTA GUIDs are already modified EUI-64 or not. Is this the correct transformation or should we be doing something like "eui[0] |= 2;" (ie assume the universal bit should always be set in our IPv6 address)? What does S10 do here? Do we need to add an ipv6_ib_mc_map() function and call it in ndisc.c? Also, does the IPoIB driver need any modification to use IPv6 multicast groups correctly? Obviously IPv6 is working for you -- are ND packets being sent to the IPv4 broadcast group? If it's OK with you, I'll check in this patch as linux-2.6.9-ipoib-ipv6.diff. Thanks, Roland From roland at topspin.com Thu Nov 11 10:15:17 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 10:15:17 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <4193A855.5030102@Sun.COM> (Nitin Hande's message of "Thu, 11 Nov 2004 09:58:45 -0800") References: <4193A855.5030102@Sun.COM> Message-ID: <527josxney.fsf@topspin.com> signed off by: Nitin Hande By the way, the proper format for signed off by: Nitin Hande is really Signed-off-by: Nitin Hande (see Documentation/SubmittingPatches). - R. From halr at voltaire.com Thu Nov 11 10:14:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 13:14:37 -0500 Subject: [openib-general] New OpenIB webpages In-Reply-To: <52fz3gxoha.fsf@topspin.com> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> Message-ID: <1100196877.3283.120.camel@localhost.localdomain> On Thu, 2004-11-11 at 12:52, Roland Dreier wrote: > in "What version of the Linux kernel do you support?" > > I suggest changing the answer to something like OpenIB > supports the latest 2.6 kernel (currently 2.6.9). Not indicating the current version (2.6.9) makes for less frequent web page updates. Is just saying latest 2.6 kernel sufficient ? -- Hal From halr at voltaire.com Thu Nov 11 10:27:22 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 13:27:22 -0500 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52bre4xnkw.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com> Message-ID: <1100197642.3283.131.camel@localhost.localdomain> On Thu, 2004-11-11 at 13:11, Roland Dreier wrote: > My only questions are: > > + eui[0] ^= 2; > > I remember some discussion about whether IBTA GUIDs are already > modified EUI-64 or not. Is this the correct transformation or should > we be doing something like "eui[0] |= 2;" (ie assume the universal bit > should always be set in our IPv6 address)? IBTA GUIDs are EUI-64. The only issue I recall was whether the polarity of the U/G bit was consistent with IEEE. This was updated at IBA 1.2. It now says "manufacturer assigns EUI-64 with global scope set. May also assign additional EUI-64 with local scope." > What does S10 do here? What's S10 ? > Do we need to add an ipv6_ib_mc_map() function and call it in ndisc.c? This is needed. IPv6 multicast mapping is slightly different from the IPv4 mapping. -- Hal From tduffy at sun.com Thu Nov 11 10:35:23 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 10:35:23 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <1100197642.3283.131.camel@localhost.localdomain> References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com> <1100197642.3283.131.camel@localhost.localdomain> Message-ID: <1100198123.25996.33.camel@duffman> On Thu, 2004-11-11 at 13:27 -0500, Hal Rosenstock wrote: > What's S10 ? Solaris 10. Which has IPv6oIB. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Thu Nov 11 10:39:58 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 10:39:58 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100196877.3283.120.camel@localhost.localdomain> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> <1100196877.3283.120.camel@localhost.localdomain> Message-ID: <1100198398.25996.35.camel@duffman> On Thu, 2004-11-11 at 13:14 -0500, Hal Rosenstock wrote: > Not indicating the current version (2.6.9) makes for less frequent web > page updates. Is just saying latest 2.6 kernel sufficient ? How about making the FAQ a WIKI :-) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Nov 11 10:46:21 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 10:46:21 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <1100197642.3283.131.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 11 Nov 2004 13:27:22 -0500") References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com> <1100197642.3283.131.camel@localhost.localdomain> Message-ID: <52sm7gw7eq.fsf@topspin.com> Hal> IBTA GUIDs are EUI-64. The only issue I recall was whether Hal> the polarity of the U/G bit was consistent with IEEE. This Hal> was updated at IBA 1.2. It now says "manufacturer assigns Hal> EUI-64 with global scope set. May also assign additional Hal> EUI-64 with local scope." Uh-oh -- none of the HCAs I have access to have the universal bit set in their port GUIDs. - R. From iod00d at hp.com Thu Nov 11 10:47:15 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 11 Nov 2004 10:47:15 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100196877.3283.120.camel@localhost.localdomain> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> <1100196877.3283.120.camel@localhost.localdomain> Message-ID: <20041111184715.GE32218@cup.hp.com> On Thu, Nov 11, 2004 at 01:14:37PM -0500, Hal Rosenstock wrote: > Not indicating the current version (2.6.9) makes for less frequent web > page updates. Is just saying latest 2.6 kernel sufficient ? Probably not since SLES9-ia64 is based on 2.6.5 and it won't work as-is. Making ithe FAQ a wiki (tduffy) is a good idea. grant From halr at voltaire.com Thu Nov 11 10:50:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 13:50:28 -0500 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52sm7gw7eq.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com> <1100197642.3283.131.camel@localhost.localdomain> <52sm7gw7eq.fsf@topspin.com> Message-ID: <1100199028.3283.157.camel@localhost.localdomain> On Thu, 2004-11-11 at 13:46, Roland Dreier wrote: > Hal> IBTA GUIDs are EUI-64. The only issue I recall was whether > Hal> the polarity of the U/G bit was consistent with IEEE. This > Hal> was updated at IBA 1.2. It now says "manufacturer assigns > Hal> EUI-64 with global scope set. May also assign additional > Hal> EUI-64 with local scope." > > Uh-oh -- none of the HCAs I have access to have the universal bit set > in their port GUIDs. That's the old way (where old < IBA 1.2). I can dig out more emails on this and any recommendations. In the older versions of IBA, the bit was inverted due to some language ambiguity. It was supposed to be global. I would think we want to be compliant with the IBA 1.2 definition but if there are practical matters with this... -- Hal From Nitin.Hande at Sun.COM Thu Nov 11 11:02:36 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 11 Nov 2004 11:02:36 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52bre4xnkw.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52bre4xnkw.fsf@topspin.com> Message-ID: <4193B74C.2060408@Sun.COM> All, Thanks for your comments, Roland Dreier wrote: > Nitin> I would appreciate if someone can review my patch to enable > Nitin> inet6 address on ib interface. This is the first cut, will > Nitin> like to hear from all. I plan to setup a bugzilla account > Nitin> and append this patch to the bug that Tom has created for > Nitin> inet6. > > This looks right to me. My only questions are: > > + eui[0] ^= 2; > > I remember some discussion about whether IBTA GUIDs are already > modified EUI-64 or not. Is this the correct transformation or should > we be doing something like "eui[0] |= 2;" (ie assume the universal bit > should always be set in our IPv6 address)? What does S10 do here? Yes, I see S10 setting the bit as eur[0] |= 2. I will update that in my patch. > > Do we need to add an ipv6_ib_mc_map() function and call it in ndisc.c? yes, I will code that function and send a new patch including the comments received so far... > > Also, does the IPoIB driver need any modification to use IPv6 > multicast groups correctly? > > Obviously IPv6 is working for you -- are ND packets being sent to the > IPv4 broadcast group? Yes. > > If it's OK with you, I'll check in this patch as linux-2.6.9-ipoib-ipv6.diff. Let me update the patch and if it looks okay, you can then go ahead. Hope that is fine.... Thanks Nitin > > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Thu Nov 11 11:31:53 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 11:31:53 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <4193A855.5030102@Sun.COM> (Nitin Hande's message of "Thu, 11 Nov 2004 09:58:45 -0800") References: <4193A855.5030102@Sun.COM> Message-ID: <52oei4w5au.fsf@topspin.com> I just tested, and the IPv6 ND packets are being sent to the MGID ff12:401b:ffff:0:0:0:ffff:ffff. This makes sense because net/ipv6/ndisc.c uses dev->broadcast in ndisc_mc_map() if it doesn't know about the interface type. I'll see if creating ipv6_ib_mc_map() helps. - R. From roland at topspin.com Thu Nov 11 12:13:42 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 12:13:42 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52oei4w5au.fsf@topspin.com> (Roland Dreier's message of "Thu, 11 Nov 2004 11:31:53 -0800") References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> Message-ID: <52bre4w3d5.fsf@topspin.com> OK, with the patch below all the correct IPv6 groups seem to be created and used. Ping works at least... One question about IPv6 and IPoIB: currently the IPoIB driver joins the IPv4 broadcast group and then uses those parameters to join or create (as needed) the other groups, including all IPv6 multicast groups. Is this correct, or is there a distinguished IPv6 MCG that is supposed to be used as a base as the IPv4 broadcast group is? Thanks, Roland Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier Index: linux-2.6.9/include/net/if_inet6.h =================================================================== --- linux-2.6.9.orig/include/net/if_inet6.h 2004-10-18 14:55:28.000000000 -0700 +++ linux-2.6.9/include/net/if_inet6.h 2004-11-11 11:38:20.000000000 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif Index: linux-2.6.9/net/ipv6/addrconf.c =================================================================== --- linux-2.6.9.orig/net/ipv6/addrconf.c 2004-10-18 14:55:24.000000000 -0700 +++ linux-2.6.9/net/ipv6/addrconf.c 2004-11-11 11:35:23.000000000 -0800 @@ -1110,6 +1110,13 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + /* XXX: replace len with IPOIB_HW_ADDR_LEN later */ + if (dev->addr_len != 20) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1809,6 +1816,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; Index: linux-2.6.9/net/ipv6/ndisc.c =================================================================== --- linux-2.6.9.orig/net/ipv6/ndisc.c 2004-10-18 14:54:32.000000000 -0700 +++ linux-2.6.9/net/ipv6/ndisc.c 2004-11-11 11:35:50.000000000 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Thu Nov 11 13:02:55 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 13:02:55 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <1100195074.25996.14.camel@duffman> (Tom Duffy's message of "Thu, 11 Nov 2004 09:44:34 -0800") References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman> Message-ID: <52y8h8umio.fsf@topspin.com> Tom> As long as I only try to bring up the ib0.8001 interface. If Tom> I bring up ib0, ib_ipoib freaks out and continuously prints Tom> (very rapidly): OK, I think I fixed this. When you get a chance to retest, try bringing up ib0 as see if it still acts freaky. Thanks, Roland From halr at voltaire.com Thu Nov 11 13:01:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 16:01:23 -0500 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52bre4w3d5.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> Message-ID: <1100206883.3283.211.camel@localhost.localdomain> On Thu, 2004-11-11 at 15:13, Roland Dreier wrote: > One question about IPv6 and IPoIB: currently the IPoIB driver joins > the IPv4 broadcast group and then uses those parameters to join or > create (as needed) the other groups, including all IPv6 multicast > groups. Is this correct, or is there a distinguished IPv6 MCG that is > supposed to be used as a base as the IPv4 broadcast group is? There are no broadcast addresses in IPv6, their function being superseded by multicast addresses. Neighbor Solicitation messages are multicast to the solicited-node multicast address of the target address. So I don't think there is a "master" IPv6 group. The IPoIB I-D does say that all group parameters should come from the broadcast group. But that's an IPv4 group so I'm not sure about IPv6 as a node could have an IPv6 interface but no IPv4 interface. -- Hal From roland at topspin.com Thu Nov 11 13:09:17 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 13:09:17 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52bre4w3d5.fsf@topspin.com> (Roland Dreier's message of "Thu, 11 Nov 2004 12:13:42 -0800") References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> Message-ID: <52u0rwum82.fsf@topspin.com> By the way, can anyone explain the following to me (an IPv6 rookie): # ping6 -I ib0 fe80::202:c901:78c:e461 PING fe80::202:c901:78c:e461(fe80::202:c901:78c:e461) from fe80::202:c901:7fc:c711 ib0: 56 data bytes 64 bytes from fe80::202:c901:78c:e461: icmp_seq=1 ttl=64 time=32.2 ms 64 bytes from fe80::202:c901:78c:e461: icmp_seq=2 ttl=64 time=14.7 ms 64 bytes from fe80::202:c901:78c:e461: icmp_seq=3 ttl=64 time=14.6 ms --- fe80::202:c901:78c:e461 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 14.682/20.557/32.274/8.286 ms # ping6 fe80::202:c901:78c:e461 connect: Invalid argument # ssh -6 fe80::202:c901:78c:e461 ssh: connect to host fe80::202:c901:78c:e461 port 22: Invalid argument ssh works fine if I assign non-autoconfig'ed addresses. Ethernet behaves the same way so I don't think it's something to do with the IPoIB driver, but I would like to understand it better (if only for my own edification). Thanks, Roland From Nitin.Hande at Sun.COM Thu Nov 11 14:03:13 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 11 Nov 2004 14:03:13 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52u0rwum82.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com> Message-ID: <4193E1A1.1060202@Sun.COM> Roland Dreier wrote: > By the way, can anyone explain the following to me (an IPv6 rookie): > > # ping6 -I ib0 fe80::202:c901:78c:e461 > PING fe80::202:c901:78c:e461(fe80::202:c901:78c:e461) from fe80::202:c901:7fc:c711 ib0: 56 data bytes > 64 bytes from fe80::202:c901:78c:e461: icmp_seq=1 ttl=64 time=32.2 ms > 64 bytes from fe80::202:c901:78c:e461: icmp_seq=2 ttl=64 time=14.7 ms > 64 bytes from fe80::202:c901:78c:e461: icmp_seq=3 ttl=64 time=14.6 ms > > --- fe80::202:c901:78c:e461 ping statistics --- > 3 packets transmitted, 3 received, 0% packet loss, time 2001ms > rtt min/avg/max/mdev = 14.682/20.557/32.274/8.286 ms > > # ping6 fe80::202:c901:78c:e461 > connect: Invalid argument In order to ping link local address you need to specify an outgoing interface. Thats mentioned in man ping. - I interface address Set source address to specified interface address. Argument may be numeric IP address or name of device.When pinging IPv6 link-local address this option is required. > > # ssh -6 fe80::202:c901:78c:e461 > ssh: connect to host fe80::202:c901:78c:e461 port 22: Invalid argument > > ssh works fine if I assign non-autoconfig'ed addresses. > > Ethernet behaves the same way so I don't think it's something to do > with the IPoIB driver, but I would like to understand it better (if > only for my own edification). On Solaris I see ssh just working fine on auto-config'ed address. Need more time to understand linux code. Thanks Nitin > > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Tom.Duffy at Sun.COM Thu Nov 11 14:12:24 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 11 Nov 2004 14:12:24 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <52y8h8umio.fsf@topspin.com> References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman> <52y8h8umio.fsf@topspin.com> Message-ID: <1100211144.25996.55.camel@duffman> On Thu, 2004-11-11 at 13:02 -0800, Roland Dreier wrote: > Tom> As long as I only try to bring up the ib0.8001 interface. If > Tom> I bring up ib0, ib_ipoib freaks out and continuously prints > Tom> (very rapidly): > > OK, I think I fixed this. When you get a chance to retest, try > bringing up ib0 as see if it still acts freaky. Yes, this is fixed now. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Tom.Duffy at Sun.COM Thu Nov 11 14:14:50 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 11 Nov 2004 14:14:50 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <52y8h8umio.fsf@topspin.com> References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman> <52y8h8umio.fsf@topspin.com> Message-ID: <1100211290.25996.58.camel@duffman> On Thu, 2004-11-11 at 13:02 -0800, Roland Dreier wrote: > OK, I think I fixed this. When you get a chance to retest, try > bringing up ib0 as see if it still acts freaky. Oops. Spoke too soon. It seems `ifconfig ib0 down` now hangs. Can't [ctrl]-c it either. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Nov 11 14:21:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 17:21:59 -0500 Subject: [Fwd: Re: [openib-general] [PATCH] Enable inet6 on ib interface] Message-ID: <1100211719.3283.238.camel@localhost.localdomain> Here's some text from the IPoIB I-D relative to this: [AARCH] requires the interface identifier be created in the "Modified EUI-64" format when derived from an EUI-64 identifier. [IBTA] is unclear if the GUID should use IEEE EUI-64 format or the "Modified EUI-64" format. Therefore, when creating an interface identifier from the GUID an implementation MUST do the following: => Determine if the GUID is a modified EUI-64 identifier ("u" bit is toggled) as defined by [AARCH] => If the GUID is a modified EUI-64 identifier then the "u" bit MUST NOT be toggled when creating the interface identifier => If the GUID is an umodified EUI-64 identifier then the "u" bit MUST be toggled in compliance with [AARCH] I'm not sure how one determines whether the GUID is modified or unmodified EUI-64. Here's an email from the LWG chair to the IPoIB WG back on August 9: [Ipoverib] Update on status of eui-64 in IB ________________________________________________________________________ * To: ipoverib at ietf.org * Subject: [Ipoverib] Update on status of eui-64 in IB * From: Daniel Cassiday * Date: Mon, 09 Aug 2004 17:28:42 -0400 * List-help: * List-id: IP over InfiniBand WG Discussion List * List-post: * List-subscribe: , * List-unsubscribe: , * Reply-to: Daniel.Cassiday at Sun.COM * Sender: ipoverib-bounces at ietf.org * User-agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 ________________________________________________________________________ A while back it was pointed out that the IB specification was unclear on how to set the universal/local bit in the EUI-64. This was causing a problem in the ipoverib wg on how to generate an interface identifier from this EUI-64. The IBTA has looked into this and planning is to modify the IB spec to clarify that the universal/local bit should be cleared when defining the EUI-64. The spec with this modification is currently under internal review. Pending approval (which is expected) the clarification will be included in the upcoming 1.2 release of the spec. This means that the IBA will conform to the IEEE definition of universal/local bit, and that for ipoverib, interface identifiers should be generated from the EUI-64 as per RFC 2373 (i.e. the universal/local bit should be inverted). (Note, at one point the IBTA Link WG considered using a special value in the OUI field (i.e. this is where the vendor id appears) to indicate local scope but this was discarded in favor of the simplier fix defined above.) _______________________________________________ IPoverIB mailing list IPoverIB at ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib -----Forwarded Message----- From: Hal Rosenstock To: Roland Dreier Cc: Nitin Hande , openib-general at openib.org Subject: Re: [openib-general] [PATCH] Enable inet6 on ib interface Date: 11 Nov 2004 13:50:28 -0500 On Thu, 2004-11-11 at 13:46, Roland Dreier wrote: > Hal> IBTA GUIDs are EUI-64. The only issue I recall was whether > Hal> the polarity of the U/G bit was consistent with IEEE. This > Hal> was updated at IBA 1.2. It now says "manufacturer assigns > Hal> EUI-64 with global scope set. May also assign additional > Hal> EUI-64 with local scope." > > Uh-oh -- none of the HCAs I have access to have the universal bit set > in their port GUIDs. That's the old way (where old < IBA 1.2). I can dig out more emails on this and any recommendations. In the older versions of IBA, the bit was inverted due to some language ambiguity. It was supposed to be global. I would think we want to be compliant with the IBA 1.2 definition but if there are practical matters with this... -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Nov 11 14:42:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 11 Nov 2004 14:42:39 -0800 Subject: [openib-general] QP error handling Message-ID: <4193EADF.50001@ichips.intel.com> I'm trying to force errors on QP0/1 to see if my changes can recover from them. I force the errors by sending with an invalid lkey. Based on the implementation of mthca, what can be expected? I'm not seeing the QP event handler get invoked. I do receive a completion error, followed by flushed work requests. Attempts to modify the QP directly to RTS fail -- I was hoping that the QP would enter SQE state. - Sean From roland at topspin.com Thu Nov 11 14:46:45 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 14:46:45 -0800 Subject: [openib-general] Re: QP error handling In-Reply-To: <4193EADF.50001@ichips.intel.com> (Sean Hefty's message of "Thu, 11 Nov 2004 14:42:39 -0800") References: <4193EADF.50001@ichips.intel.com> Message-ID: <52ekj0uhpm.fsf@topspin.com> Sean> I'm trying to force errors on QP0/1 to see if my changes can Sean> recover from them. I force the errors by sending with an Sean> invalid lkey. Based on the implementation of mthca, what Sean> can be expected? Sean> I'm not seeing the QP event handler get invoked. I do Sean> receive a completion error, followed by flushed work Sean> requests. Attempts to modify the QP directly to RTS fail -- Sean> I was hoping that the QP would enter SQE state. mthca currently doesn't handle these 'asynchronous' state transitions (ie transition to error). It continues to think the QP is in the RTS state. Proper handling needs to be implemented. However should there be a QP event for a send with invalid L_Key? I would have thought the failed completion entry would be all the consumer gets. - R. From mshefty at ichips.intel.com Thu Nov 11 14:58:05 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 11 Nov 2004 14:58:05 -0800 Subject: [openib-general] Re: QP error handling In-Reply-To: <52ekj0uhpm.fsf@topspin.com> References: <4193EADF.50001@ichips.intel.com> <52ekj0uhpm.fsf@topspin.com> Message-ID: <4193EE7D.6030800@ichips.intel.com> Roland Dreier wrote: > mthca currently doesn't handle these 'asynchronous' state transitions > (ie transition to error). It continues to think the QP is in the RTS > state. Proper handling needs to be implemented. Ok - thanks for the info. > However should there be a QP event for a send with invalid L_Key? I > would have thought the failed completion entry would be all the > consumer gets. I don't think an async event is necessary. I was working off the failed completion entry, but when the modify_qp call failed, I was trying to determine if the QP was going into the error state (which would disallow the transition) by checking for a callback to the async event handler. - Sean From roland at topspin.com Thu Nov 11 15:01:15 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 15:01:15 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <1100211290.25996.58.camel@duffman> (Tom Duffy's message of "Thu, 11 Nov 2004 14:14:50 -0800") References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman> <52y8h8umio.fsf@topspin.com> <1100211290.25996.58.camel@duffman> Message-ID: <52d5ykuh1g.fsf@topspin.com> Tom> Oops. Spoke too soon. It seems `ifconfig ib0 down` now hangs. I think this should fix it (already checked in). - R. Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 1213) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -379,6 +379,8 @@ if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + mcast->query = NULL; + down(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { if (status == -ETIMEDOUT) From Tom.Duffy at Sun.COM Thu Nov 11 15:31:10 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 11 Nov 2004 15:31:10 -0800 Subject: [openib-general] IPoIB w/ IBSRM? In-Reply-To: <52d5ykuh1g.fsf@topspin.com> References: <52wtwsxpvx.fsf@topspin.com> <1100195074.25996.14.camel@duffman> <52y8h8umio.fsf@topspin.com> <1100211290.25996.58.camel@duffman> <52d5ykuh1g.fsf@topspin.com> Message-ID: <1100215870.25996.64.camel@duffman> On Thu, 2004-11-11 at 15:01 -0800, Roland Dreier wrote: > Tom> Oops. Spoke too soon. It seems `ifconfig ib0 down` now hangs. > > I think this should fix it (already checked in). Yuppers. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Nov 11 15:56:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 18:56:34 -0500 Subject: [openib-general] Link Width Active Message-ID: <1100217394.3369.2.camel@localhost.localdomain> Hi, Is there a way to display PortInfo components other than PortState ? For example, LinkWidthActive might be useful (as might some others). I couldn't find it in /sys/class/infiniband/mthca0/port/1. Thanks. -- Hal From tduffy at sun.com Thu Nov 11 16:07:04 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 16:07:04 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52bre4w3d5.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> Message-ID: <1100218024.25996.73.camel@duffman> On Thu, 2004-11-11 at 12:13 -0800, Roland Dreier wrote: > OK, with the patch below all the correct IPv6 groups seem to be > created and used. Ping works at least... With the updated patch (and with Nitin's original patch), when I bring up ipv6, I am not getting the correct link local address. I can assign it a global address and ping just fine, but the lower 64 bits of the IPv6 address are NULL (except for the set link local 2 (mentioned earlier)): ib0.8001 Link encap:UNSPEC HWaddr 00-01-00-14-00-00-00-00-00-00-00-00-00-00-00-00 inet6 addr: 2222::2/64 Scope:Global inet6 addr: fe80::200:0:0:0/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:15 errors:0 dropped:0 overruns:0 frame:0 TX packets:18 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:1512 (1.4 Kb) TX bytes:1752 (1.7 Kb) Any ideas? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Nitin.Hande at Sun.COM Thu Nov 11 16:16:20 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 11 Nov 2004 16:16:20 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52u0rwum82.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com> Message-ID: <419400D4.2060900@Sun.COM> Roland Dreier wrote: > By the way, can anyone explain the following to me (an IPv6 rookie): > > # ping6 -I ib0 fe80::202:c901:78c:e461 > PING fe80::202:c901:78c:e461(fe80::202:c901:78c:e461) from fe80::202:c901:7fc:c711 ib0: 56 data bytes > 64 bytes from fe80::202:c901:78c:e461: icmp_seq=1 ttl=64 time=32.2 ms > 64 bytes from fe80::202:c901:78c:e461: icmp_seq=2 ttl=64 time=14.7 ms > 64 bytes from fe80::202:c901:78c:e461: icmp_seq=3 ttl=64 time=14.6 ms > > --- fe80::202:c901:78c:e461 ping statistics --- > 3 packets transmitted, 3 received, 0% packet loss, time 2001ms > rtt min/avg/max/mdev = 14.682/20.557/32.274/8.286 ms > > # ping6 fe80::202:c901:78c:e461 > connect: Invalid argument > > # ssh -6 fe80::202:c901:78c:e461 > ssh: connect to host fe80::202:c901:78c:e461 port 22: Invalid argument > > ssh works fine if I assign non-autoconfig'ed addresses. Allright, so looking further more now I can get ssh working, the sytax is very peculiar for linux sins-stinger-04:/etc/ssh # uname -a Linux sins-stinger-04 2.6.9 #5 SMP Thu Nov 11 12:54:00 PST 2004 x86_64 x86_64 x86_64 GNU/Linux sins-stinger-04:/etc/ssh # ssh fe80::209:3dff:fe00:4766%eth1 The authenticity of host 'fe80::209:3dff:fe00:4766%eth1 (fe80::209:3dff:fe00:4766%eth1)' can't be established. RSA key fingerprint is ce:f5:ea:82:2a:42:a2:f9:e0:01:ba:ef:63:3c:cb:2a. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'fe80::209:3dff:fe00:4766%eth1' (RSA) to the list of known hosts. Password: Last login: Thu Nov 11 17:15:01 2004 from sr1-umpk-04.sfbay.sun.com sins-stinger-8:~ # uname -a Linux sins-stinger-8 2.6.9 #9 SMP Wed Nov 10 09:42:29 PST 2004 x86_64 x86_64 x86_64 GNU/Linux sins-stinger-8:~ # Based on some googling I found that for linux, since Link Local addresses are not routable, you need to provide the scope (by specifying an outgoing interface) to ssh in linux. This is very different from Solaris implementation where it still derives the scope of link local address and thereby its outgoing interface too. Does that sounds okay ? Thanks Nitin > > Ethernet behaves the same way so I don't think it's something to do > with the IPoIB driver, but I would like to understand it better (if > only for my own edification). > > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Thu Nov 11 16:18:21 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 16:18:21 -0800 Subject: [openib-general] Re: Link Width Active In-Reply-To: <1100217394.3369.2.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 11 Nov 2004 18:56:34 -0500") References: <1100217394.3369.2.camel@localhost.localdomain> Message-ID: <523bzfvs1e.fsf@topspin.com> Hal> Hi, Is there a way to display PortInfo components other than Hal> PortState ? For example, LinkWidthActive might be useful (as Hal> might some others). I couldn't find it in Hal> /sys/class/infiniband/mthca0/port/1. Sure, we just need to add more attributes in core/sysfs.c. LinkWidthActive would require processing a PortInfo MAD (PortState comes from ib_port_query()) but it's not too much work to implement. - R. From roland at topspin.com Thu Nov 11 16:21:59 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 16:21:59 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <1100218024.25996.73.camel@duffman> (Tom Duffy's message of "Thu, 11 Nov 2004 16:07:04 -0800") References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman> Message-ID: <52y8h7udaw.fsf@topspin.com> Tom> With the updated patch (and with Nitin's original patch), Tom> when I bring up ipv6, I am not getting the correct link local Tom> address. I can assign it a global address and ping just Tom> fine, but the lower 64 bits of the IPv6 address are NULL Tom> (except for the set link local 2 (mentioned earlier)): Tom> Any ideas? Yup, looks like the device addr for child interfaces isn't being set correctly; compare: # ip addr show ib0 13: ib0: mtu 2044 qdisc pfifo_fast qlen 128 link/[32] 00:04:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:07:8c:e4:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vs. # ip addr show ib0.8001 15: ib0.8001: mtu 2044 qdisc noop qlen 128 link/[32] 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:80:01:00:00:00:00:00:00:ff:ff:ff:ff I should have a patch fairly soon. - R. From roland at topspin.com Thu Nov 11 16:38:57 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 16:38:57 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52y8h7udaw.fsf@topspin.com> (Roland Dreier's message of "Thu, 11 Nov 2004 16:21:59 -0800") References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman> <52y8h7udaw.fsf@topspin.com> Message-ID: <52u0rvucim.fsf@topspin.com> I think we just need to copy our address to the child interface. This patch seems to fix it for me (already checked in). (By the way, how does IPv6 handle autoconfig for VLAN interfaces? With this change you can get duplicate autoconfig'ed addresses, although they will be in different partitions. I'm not sure if this causes any problems...) - R. Index: infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- infiniband/ulp/ipoib/ipoib_vlan.c (revision 1212) +++ infiniband/ulp/ipoib/ipoib_vlan.c (working copy) @@ -74,6 +74,7 @@ priv->pkey = pkey; + memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, IPOIB_HW_ADDR_LEN); priv->dev->broadcast[8] = pkey >> 8; priv->dev->broadcast[9] = pkey & 0xff; From roland at topspin.com Thu Nov 11 16:46:37 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 16:46:37 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <419400D4.2060900@Sun.COM> (Nitin Hande's message of "Thu, 11 Nov 2004 16:16:20 -0800") References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com> <419400D4.2060900@Sun.COM> Message-ID: <52pt2juc5u.fsf@topspin.com> Nitin> Based on some googling I found that for linux, since Link Nitin> Local addresses are not routable, you need to provide the Nitin> scope (by specifying an outgoing interface) to ssh in Nitin> linux. This is very different from Solaris implementation Nitin> where it still derives the scope of link local address and Nitin> thereby its outgoing interface too. Does that sounds okay ? Thanks, that works for me. Not very intuitive but I guess it makes sense. In fact I don't see how Solaris can deduce the interface from a link local IPv6 address... - R. From tduffy at sun.com Thu Nov 11 17:40:53 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 11 Nov 2004 17:40:53 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52u0rvucim.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman> <52y8h7udaw.fsf@topspin.com> <52u0rvucim.fsf@topspin.com> Message-ID: <1100223653.24741.3.camel@duffman> On Thu, 2004-11-11 at 16:38 -0800, Roland Dreier wrote: > I think we just need to copy our address to the child interface. This > patch seems to fix it for me (already checked in). Yup, this fixes it. You rock. > (By the way, how does IPv6 handle autoconfig for VLAN interfaces? > With this change you can get duplicate autoconfig'ed addresses, > although they will be in different partitions. I'm not sure if this > causes any problems...) Would you really bring both interfaces up? If this is a problem, the spec should have the pkey be part of the link local address. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Thu Nov 11 17:41:22 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 11 Nov 2004 17:41:22 -0800 Subject: [openib-general] [PATCH] [1/2] SQE handling on MAD QPs Message-ID: <419414C2.4090300@ichips.intel.com> This patch recovers from send queue errors on QP 0/1. (It should also "work" in the case of fatal errors, but does not try to recover.) Code was tested by forcing send errors and checking that the port could still go to active. Patch can be applied separately from patch to mthca, but requires other patch to work properly. - Sean Index: mad.c =================================================================== --- mad.c (revision 1209) +++ mad.c (working copy) @@ -90,6 +90,8 @@ struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); static int solicited_mad(struct ib_mad *mad); +static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, + enum ib_qp_state cur_state); /* * Returns a ib_mad_port_private structure or NULL for a device/port. @@ -591,6 +593,7 @@ /* Timeout will be updated after send completes */ mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. ud.timeout_ms); + mad_send_wr->retry = 0; /* One reference for each work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); mad_send_wr->status = IB_WC_SUCCESS; @@ -1339,6 +1342,70 @@ } } +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + int flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) { + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup. + */ + return; + } + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests. + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send. */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + /* Transition QP to RTS and fail offending send. */ + ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - unable to " + "transition QP to RTS : %d\n", ret); + ib_mad_send_done_handler(port_priv, wc); + mark_sends_for_retry(qp_info); + } +} + /* * IB MAD completion callback */ @@ -1346,34 +1413,25 @@ { struct ib_mad_port_private *port_priv; struct ib_wc wc; - struct ib_mad_list_head *mad_list; - struct ib_mad_qp_info *qp_info; port_priv = (struct ib_mad_port_private*)data; ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { - if (wc.status != IB_WC_SUCCESS) { - /* Determine if failure was a send or receive */ - mad_list = (struct ib_mad_list_head *) - (unsigned long)wc.wr_id; - qp_info = mad_list->mad_queue->qp_info; - if (mad_list->mad_queue == &qp_info->send_queue) - wc.opcode = IB_WC_SEND; - else - wc.opcode = IB_WC_RECV; - } - switch (wc.opcode) { - case IB_WC_SEND: - ib_mad_send_done_handler(port_priv, &wc); - break; - case IB_WC_RECV: - ib_mad_recv_done_handler(port_priv, &wc); - break; - default: - BUG_ON(1); - break; - } + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); } } @@ -1717,7 +1775,8 @@ /* * Modify QP into Ready-To-Send state */ -static inline int ib_mad_change_qp_state_to_rts(struct ib_qp *qp) +static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, + enum ib_qp_state cur_state) { int ret; struct ib_qp_attr *attr; @@ -1729,11 +1788,12 @@ "ib_qp_attr\n"); return -ENOMEM; } - attr->qp_state = IB_QPS_RTS; - attr->sq_psn = IB_MAD_SEND_Q_PSN; - attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; - + attr_mask = IB_QP_STATE; + if (cur_state == IB_QPS_RTR) { + attr->sq_psn = IB_MAD_SEND_Q_PSN; + attr_mask |= IB_QP_SQ_PSN; + } ret = ib_modify_qp(qp, attr, attr_mask); kfree(attr); @@ -1793,7 +1853,8 @@ goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp); + ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp, + IB_QPS_RTR); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTS\n", i); @@ -1852,6 +1913,15 @@ } } +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + static void init_mad_queue(struct ib_mad_qp_info *qp_info, struct ib_mad_queue *mad_queue) { @@ -1884,6 +1954,8 @@ qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; qp_init_attr.port_num = port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); if (IS_ERR(qp_info->qp)) { printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", Index: mad_priv.h =================================================================== --- mad_priv.h (revision 1209) +++ mad_priv.h (working copy) @@ -127,6 +127,7 @@ u64 wr_id; /* client WR ID */ u64 tid; unsigned long timeout; + int retry; int refcount; enum ib_wc_status status; }; From mshefty at ichips.intel.com Thu Nov 11 17:45:22 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 11 Nov 2004 17:45:22 -0800 Subject: [openib-general] [PATCH] [2/2] change QP state to SQE Message-ID: <419415B2.3060907@ichips.intel.com> This should transition the QP state to SQE when encountering a send error on the CQ. There may be a better way of doing this; I didn't spend a lot of time studying the code. - Sean Index: mthca_dev.h =================================================================== --- mthca_dev.h (revision 1209) +++ mthca_dev.h (working copy) @@ -311,6 +311,7 @@ void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); +void mthca_qp_send_error(struct mthca_qp *qp); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr); Index: mthca_cq.c =================================================================== --- mthca_cq.c (revision 1209) +++ mthca_cq.c (working copy) @@ -330,6 +330,9 @@ break; } + if (cqe->syndrome != SYNDROME_WR_FLUSH_ERR && is_send) + mthca_qp_send_error(qp); + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); if (err) return err; Index: mthca_qp.c =================================================================== --- mthca_qp.c (revision 1209) +++ mthca_qp.c (working copy) @@ -288,6 +288,12 @@ wake_up(&qp->wait); } +void mthca_qp_send_error(struct mthca_qp *qp) +{ + if (qp->state == IB_QPS_RTS) + qp->state = IB_QPS_SQE; +} + static int to_mthca_state(enum ib_qp_state ib_state) { switch (ib_state) { From roland at topspin.com Thu Nov 11 18:09:58 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 18:09:58 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <1100223653.24741.3.camel@duffman> (Tom Duffy's message of "Thu, 11 Nov 2004 17:40:53 -0800") References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman> <52y8h7udaw.fsf@topspin.com> <52u0rvucim.fsf@topspin.com> <1100223653.24741.3.camel@duffman> Message-ID: <52lld7u8ax.fsf@topspin.com> Tom> Would you really bring both interfaces up? If this is a Tom> problem, the spec should have the pkey be part of the link Tom> local address. It actually seems to work fine to bring up multiple IPv6 interfaces that end up with the same link local address (like ib0 and ib0.8001). The fact that Linux forces you to specify an interface when using a link local address comes to the rescue. And I can't think of any issues, since different partitions are really pretty disjoint. - R. From roland at topspin.com Thu Nov 11 18:11:08 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 18:11:08 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <419415B2.3060907@ichips.intel.com> (Sean Hefty's message of "Thu, 11 Nov 2004 17:45:22 -0800") References: <419415B2.3060907@ichips.intel.com> Message-ID: <52hdnvu88z.fsf@topspin.com> Sean> This should transition the QP state to SQE when encountering Sean> a send error on the CQ. There may be a better way of doing Sean> this; I didn't spend a lot of time studying the code. Thanks for the patch... let me look at how I want to do this (and probably handle transitions to ERR while I'm at it). - R. From halr at voltaire.com Thu Nov 11 20:00:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 23:00:30 -0500 Subject: [openib-general] PD dealloc and AH busy problems remain Message-ID: <1100232030.3369.50.camel@localhost.localdomain> Hi, Don't know what the proper expectation is (whether the below meant that the PD dealloc problem and the AH busy problem should be gone), r1211 | roland | 2004-11-11 15:36:46 -0500 (Thu, 11 Nov 2004) | 1 line Move final reap of AHs to a more correct location but they are not (just in case you thought they should). The PD alloc problem is now intermittent on IPoIB module removal (an improvement). AH busy on mthca module removal is still regular. If the message in the log didn't mean this, ignore this. -- Hal From roland at topspin.com Thu Nov 11 20:41:23 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 20:41:23 -0800 Subject: [openib-general] Re: PD dealloc and AH busy problems remain In-Reply-To: <1100232030.3369.50.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 11 Nov 2004 23:00:30 -0500") References: <1100232030.3369.50.camel@localhost.localdomain> Message-ID: <52d5yju1ak.fsf@topspin.com> Hal> but they are not (just in case you thought they should). The Hal> PD alloc problem is now intermittent on IPoIB module removal Hal> (an improvement). AH busy on mthca module removal is still Hal> regular. Thanks for pointing this out, it was the kick in the rear I needed to really investigate this. It turns out there were two bugs (I think). In any case my logs are clean with these changes. - R. Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1212) +++ infiniband/core/sa_query.c (working copy) @@ -632,7 +632,7 @@ } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); -static void send_handler(struct ib_mad_agent *mad_agent, +static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { struct ib_sa_query *query; @@ -660,6 +660,12 @@ break; } + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + PCI_DMA_TODEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + query->release(query); spin_lock_irqsave(&idr_lock, flags); Index: infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 1217) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -210,7 +210,7 @@ { struct ipoib_dev_priv *priv = netdev_priv(dev); - if (priv->qp != NULL) { + if (priv->qp) { if (ib_destroy_qp(priv->qp)) ipoib_warn(priv, "ib_qp_destroy failed\n"); Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1212) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -306,10 +306,6 @@ if (status) goto err; - ah = kmalloc(sizeof *ah, GFP_KERNEL); - if (!ah) - goto err; - { struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), @@ -320,13 +316,11 @@ .port_num = priv->port }; - ah->ah = ib_create_ah(priv->pd, &av); + ah = ipoib_create_ah(skb->dev, priv->pd, &av); } - if (IS_ERR(ah->ah)) { - kfree(ah); + if (!ah) goto err; - } *(struct ipoib_ah **) skb->cb = ah; @@ -459,13 +453,17 @@ return 0; } - if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " IPOIB_GID_FMT "\n", skb->dst ? "neigh" : "dst", be16_to_cpup((u16 *) skb->data), be32_to_cpup((u32 *) phdr->hwaddr), IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + return 0; + } /* put the pseudoheader back on */ skb_push(skb, sizeof *phdr); Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1216) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -48,7 +48,8 @@ if (IS_ERR(ah->ah)) { kfree(ah); ah = NULL; - } + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); return ah; } @@ -61,7 +62,12 @@ unsigned long flags; spin_lock_irqsave(&priv->lock, flags); - list_add_tail(&ah->list, &priv->dead_ahs); + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else + list_add_tail(&ah->list, &priv->dead_ahs); spin_unlock_irqrestore(&priv->lock, flags); } From halr at voltaire.com Thu Nov 11 20:54:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 11 Nov 2004 23:54:34 -0500 Subject: [openib-general] IPv6 MGID formation question Message-ID: <1100235273.3369.65.camel@localhost.localdomain> It looks to me like the MGIDs for IPv6 are only getting the low 32 bits of the address rather than 80 bits. | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | +------ -+----+----+-----------------+---------+-------------------+ |11111111|0001|scop||< P_Key >| group ID | +--------+----+----+-----------------+---------+-------------------+ Local interface address: inet6 addr: fe80::208:f104:396:71/64 Scope:Link MGID is displayed as MGID ff12:601b:ffff:0:0:1:ff96:71 (Haven't looked on the IB wire yet). IPv6 comes up in IPv6 over IPv4 tunneling mode but I don't think this should affect the MGID used. I have the latest bits (and patches) installed. -- Hal From roland at topspin.com Thu Nov 11 21:18:10 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 11 Nov 2004 21:18:10 -0800 Subject: [openib-general] Re: IPv6 MGID formation question In-Reply-To: <1100235273.3369.65.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 11 Nov 2004 23:54:34 -0500") References: <1100235273.3369.65.camel@localhost.localdomain> Message-ID: <528y97tzl9.fsf@topspin.com> Hal> It looks to me like the MGIDs for IPv6 are only getting the Hal> low 32 bits of the address rather than 80 bits. I think your setup is fine: Hal> Local interface address: inet6 addr: fe80::208:f104:396:71/64 The IPv6 solicited-node multicast address corresponding to this address is ff02:0:0:0:0:1:ff96:71. The ND code will join this group when the interface is brought up. Hal> MGID is displayed as MGID ff12:601b:ffff:0:0:1:ff96:71 This is the correct MGID for that solicited-node address. (If you're actually sending to other IPv6 multicast addresses and getting the wrong MGID then something is screwy. It's hard to think of what could be wrong with out ipv6_ib_mc_map() function though...) - R. From mlleinin at hpcn.ca.sandia.gov Thu Nov 11 21:20:28 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 11 Nov 2004 21:20:28 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <52ekj0z93b.fsf@topspin.com> References: <1100181660.14334.548.camel@trinity> <52ekj0z93b.fsf@topspin.com> Message-ID: <1100236828.3722.560.camel@trinity> On Thu, 2004-11-11 at 07:41 -0800, Roland Dreier wrote: > Matt> As some of you may have noticed, we migrated over to the > Matt> new OpenIB web pages yesterday. The FAQ and a few other > Matt> items are still a work in progress. Let me know if there > Matt> are any errors or if folks have other feedback/suggestions. > > Looks great. One suggestions: under news, it's probably worth linking > to or mentioning the PathForward funding announcement. > I didn't have time to add this before another day of SC04 started. The PathForward announcements are now links under News. - Matt From mlleinin at hpcn.ca.sandia.gov Thu Nov 11 21:33:19 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 11 Nov 2004 21:33:19 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <52fz3gxoha.fsf@topspin.com> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> Message-ID: <1100237599.14334.564.camel@trinity> On Thu, 2004-11-11 at 09:52 -0800, Roland Dreier wrote: > Matt> The FAQ and a few other items are still a work in progress. > > A couple of suggestions for the FAQ: > > in "How do I submit source code patches?" > > I suggest adding something like "Please make sure that patches are > licensed under the same terms as the original code (dual GPL/BSD > for most of the OpenIB stack)." > > in "What version of the Linux kernel do you support?" > > I suggest changing the answer to something like OpenIB > supports the latest 2.6 kernel (currently 2.6.9). > > in "What are all these upper layer protocols like IPoIB, DAPL, MPI, SDP, > SRP, and others?" > > add a link to the IETF ipoib WG at > Done. Thanks. - Matt From mlleinin at hpcn.ca.sandia.gov Thu Nov 11 21:33:48 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 11 Nov 2004 21:33:48 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100196877.3283.120.camel@localhost.localdomain> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> <1100196877.3283.120.camel@localhost.localdomain> Message-ID: <1100237628.14336.566.camel@trinity> On Thu, 2004-11-11 at 13:14 -0500, Hal Rosenstock wrote: > On Thu, 2004-11-11 at 12:52, Roland Dreier wrote: > > in "What version of the Linux kernel do you support?" > > > > I suggest changing the answer to something like OpenIB > > supports the latest 2.6 kernel (currently 2.6.9). > > Not indicating the current version (2.6.9) makes for less frequent web > page updates. Is just saying latest 2.6 kernel sufficient ? > I don't mind keeping it updated. - Matt From mlleinin at hpcn.ca.sandia.gov Thu Nov 11 22:03:07 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 11 Nov 2004 22:03:07 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <20041111184715.GE32218@cup.hp.com> References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> <1100196877.3283.120.camel@localhost.localdomain> <20041111184715.GE32218@cup.hp.com> Message-ID: <1100239387.3722.581.camel@trinity> On Thu, 2004-11-11 at 10:47 -0800, Grant Grundler wrote: > On Thu, Nov 11, 2004 at 01:14:37PM -0500, Hal Rosenstock wrote: > > Not indicating the current version (2.6.9) makes for less frequent web > > page updates. Is just saying latest 2.6 kernel sufficient ? > > Probably not since SLES9-ia64 is based on 2.6.5 and it won't work as-is. > Making ithe FAQ a wiki (tduffy) is a good idea. > FAQ wiki does sound good. I'll look into it. - Matt From itoumsn at nttdata.co.jp Thu Nov 11 22:14:21 2004 From: itoumsn at nttdata.co.jp (Masanori ITOH) Date: Fri, 12 Nov 2004 15:14:21 +0900 (JST) Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA Message-ID: <20041112.151421.120503395.itoumsn@nttdata.co.jp> Hello folks, As I mentioned fomerly on this list, I have a working u/kDAPL on top of the gen1 stack and I've finally finished all internal procedures to make it public. # Actually, it took me about one month and a half. Sigh... :( I would like to put that into the OpenIB contributors area (Somewhere like 'https://openib.org/svn/trunk/contrib/nttdata/'.), and could anyone tell me how I can do that? Thanks in advance, Masanori --- Masanori ITOH Open Source Software Development Center, NTT DATA CORPORATION e-mail: itoumsn at nttdata.co.jp phone : +81-3-3523-8122 (ext. 172-7199) From halr at voltaire.com Fri Nov 12 06:55:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 09:55:42 -0500 Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs In-Reply-To: <419414C2.4090300@ichips.intel.com> References: <419414C2.4090300@ichips.intel.com> Message-ID: <1100271340.6671.1.camel@hpc-1> On Thu, 2004-11-11 at 20:41, Sean Hefty wrote: > This patch recovers from send queue errors on QP 0/1. (It should also "work" in the case > of fatal errors, but does not try to recover.) Code was tested by forcing send errors and > checking that the port could still go to active. > > Patch can be applied separately from patch to mthca, but requires other patch to work > properly. I am having difficulty applying this patch. For some reason, all the changes are rejected. Could this be a patch version issue ? My version of patch is 2.5.4. Should I upgrade and try ? -- Hal From Nitin.Hande at Sun.COM Fri Nov 12 07:44:04 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Fri, 12 Nov 2004 07:44:04 -0800 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52lld7u8ax.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <1100218024.25996.73.camel@duffman> <52y8h7udaw.fsf@topspin.com> <52u0rvucim.fsf@topspin.com> <1100223653.24741.3.camel@duffman> <52lld7u8ax.fsf@topspin.com> Message-ID: <4194DA44.5060003@Sun.COM> Roland Dreier wrote: > Tom> Would you really bring both interfaces up? If this is a > Tom> problem, the spec should have the pkey be part of the link > Tom> local address. > > It actually seems to work fine to bring up multiple IPv6 interfaces > that end up with the same link local address (like ib0 and ib0.8001). > The fact that Linux forces you to specify an interface when using a > link local address comes to the rescue. And I can't think of any > issues, since different partitions are really pretty disjoint. Btw on vlan, I know that vlan-id's are a part of link local addresses. That way all the link local addresses are unique and as a result during DAD process they join different solicited multicast groups. Thanks Nitin > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Fri Nov 12 08:05:32 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 12 Nov 2004 08:05:32 -0800 Subject: [openib-general] New OpenIB webpages In-Reply-To: <1100239387.3722.581.camel@trinity> (Matt Leininger's message of "Thu, 11 Nov 2004 22:03:07 -0800") References: <1100181660.14334.548.camel@trinity> <52fz3gxoha.fsf@topspin.com> <1100196877.3283.120.camel@localhost.localdomain> <20041111184715.GE32218@cup.hp.com> <1100239387.3722.581.camel@trinity> Message-ID: <52oei3rr1v.fsf@topspin.com> Matt> FAQ wiki does sound good. I'll look into it. In general having a wiki would be great (there have been a few times in the past where I would have liked to have been able to create a quick wiki page). - R. From halr at voltaire.com Fri Nov 12 08:06:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 11:06:14 -0500 Subject: [openib-general] Re: PD dealloc and AH busy problems remain In-Reply-To: <52d5yju1ak.fsf@topspin.com> References: <1100232030.3369.50.camel@localhost.localdomain> <52d5yju1ak.fsf@topspin.com> Message-ID: <1100275574.3369.419.camel@localhost.localdomain> On Thu, 2004-11-11 at 23:41, Roland Dreier wrote: > Thanks for pointing this out, it was the kick in the rear I needed to > really investigate this. It turns out there were two bugs (I think). > In any case my logs are clean with these changes. So are mine now :-) I'll keep an eye on this reoccuring but otherwise assume this is fixed. Thanks. -- Hal From halr at voltaire.com Fri Nov 12 08:16:15 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 11:16:15 -0500 Subject: [openib-general] [PATCH] Enable inet6 on ib interface In-Reply-To: <52pt2juc5u.fsf@topspin.com> References: <4193A855.5030102@Sun.COM> <52oei4w5au.fsf@topspin.com> <52bre4w3d5.fsf@topspin.com> <52u0rwum82.fsf@topspin.com> <419400D4.2060900@Sun.COM> <52pt2juc5u.fsf@topspin.com> Message-ID: <1100276175.3369.440.camel@localhost.localdomain> On Thu, 2004-11-11 at 19:46, Roland Dreier wrote: > In fact I don't see how Solaris can deduce the interface from > a link local IPv6 address... I don't see how this would work either (at least for Linux): Here's my config: eth1 inet6 addr: fe80::230:48ff:fe27:212f/64 Scope:Link ib0 inet6 addr: fe80::208:f104:396:71/64 Scope:Link ip -6 route show fe80::/64 dev eth1 metric 256 mtu 1500 advmss 1440 fe80::/64 dev ib0 metric 256 mtu 2044 advmss 1984 ff00::/8 dev eth1 metric 256 mtu 1500 advmss 1440 ff00::/8 dev ib0 metric 256 mtu 2044 advmss 1984 So some help looks like it is needed to select the outgoing local interface. It's not just a routing calculating on the destination address as it appears to be in Solaris. -- Hal From mshefty at ichips.intel.com Fri Nov 12 09:13:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 09:13:35 -0800 Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs In-Reply-To: <1100271340.6671.1.camel@hpc-1> References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1> Message-ID: <4194EF3F.80608@ichips.intel.com> Hal Rosenstock wrote: > On Thu, 2004-11-11 at 20:41, Sean Hefty wrote: > >>This patch recovers from send queue errors on QP 0/1. (It should also "work" in the case >>of fatal errors, but does not try to recover.) Code was tested by forcing send errors and >>checking that the port could still go to active. >> >>Patch can be applied separately from patch to mthca, but requires other patch to work >>properly. > > > I am having difficulty applying this patch. For some reason, all the > changes are rejected. Could this be a patch version issue ? My version > of patch is 2.5.4. Should I upgrade and try ? Not sure what the issue is. Let me make sure that I've pulled the latest code and resubmit the patch. - Sean From mshefty at ichips.intel.com Fri Nov 12 09:15:56 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 09:15:56 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <52hdnvu88z.fsf@topspin.com> References: <419415B2.3060907@ichips.intel.com> <52hdnvu88z.fsf@topspin.com> Message-ID: <4194EFCC.4070802@ichips.intel.com> Roland Dreier wrote: > Sean> This should transition the QP state to SQE when encountering > Sean> a send error on the CQ. There may be a better way of doing > Sean> this; I didn't spend a lot of time studying the code. > > Thanks for the patch... let me look at how I want to do this (and > probably handle transitions to ERR while I'm at it). That's fine. This was just the easiest change that I could find in order to test my mad changes. - Sean From halr at voltaire.com Fri Nov 12 09:18:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 12:18:32 -0500 Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs In-Reply-To: <4194EF3F.80608@ichips.intel.com> References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1> <4194EF3F.80608@ichips.intel.com> Message-ID: <1100279912.3369.507.camel@localhost.localdomain> On Fri, 2004-11-12 at 12:13, Sean Hefty wrote: > Not sure what the issue is. Let me make sure that I've pulled the latest code and > resubmit the patch. It looks right to me. Does it work for you ? Can you send a normal rather than unified diff ? -- Hal From mshefty at ichips.intel.com Fri Nov 12 09:21:50 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 09:21:50 -0800 Subject: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length In-Reply-To: <1100107204.2836.36.camel@hpc-1> References: <1100033372.17687.3.camel@hpc-1> <52bre6692h.fsf@topspin.com> <527jou68xy.fsf@topspin.com> <1100033742.2170.11.camel@localhost.localdomain> <52llda4m00.fsf@topspin.com> <1100057166.17621.23.camel@hpc-1> <52r7n22tgt.fsf@topspin.com> <52ekj22qow.fsf@topspin.com> <1100096891.801.25.camel@hpc-1> <41924806.8060509@ichips.intel.com> <52wtwt1vxx.fsf@topspin.com> <1100107204.2836.36.camel@hpc-1> Message-ID: <4194F12E.9040205@ichips.intel.com> Hal Rosenstock wrote: > On Wed, 2004-11-10 at 11:59, Roland Dreier wrote: > >> Sean> What exactly does it mean then when process_mad returns >> Sean> success? Do any of the return bits from process_mad >> Sean> indicate that the MAD was for the HCA driver? >> >>SUCCESS means that process_mad didn't encounter any errors. If REPLY >>or CONSUMED is set then process_mad actually handled the packet. > > > I would assume that REPLY and CONSUMED are also mutually exclusive. I believe that's the case, but maybe it would make more sense if they weren't, and let CONSUMED indicate that MAD was for the HCA driver. From an API perspective, I think we only need to know if the HCA driver intercepted the MAD, and if so, was a reply generated. - Sean From roland at topspin.com Fri Nov 12 09:41:55 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 12 Nov 2004 09:41:55 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <419415B2.3060907@ichips.intel.com> (Sean Hefty's message of "Thu, 11 Nov 2004 17:45:22 -0800") References: <419415B2.3060907@ichips.intel.com> Message-ID: <52bre3rml8.fsf@topspin.com> I thought about this a little, and it seems that having the CQ poll operation update the QP state is not the right solution. It seems it would be better to add support for the "Current QP state" modifier for the modify QP operation and expect the consumer to use that to indicate that the QP is in SQE state. - R. From mshefty at ichips.intel.com Fri Nov 12 09:52:06 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 09:52:06 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <52bre3rml8.fsf@topspin.com> References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com> Message-ID: <4194F846.5030703@ichips.intel.com> Roland Dreier wrote: > I thought about this a little, and it seems that having the CQ poll > operation update the QP state is not the right solution. It seems it > would be better to add support for the "Current QP state" modifier for > the modify QP operation and expect the consumer to use that to > indicate that the QP is in SQE state. That would work fine, and be only a minor update to the MAD code. Will you be generated a patch for mthca? - Sean From mshefty at ichips.intel.com Fri Nov 12 09:54:51 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 09:54:51 -0800 Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs In-Reply-To: <1100279912.3369.507.camel@localhost.localdomain> References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1> <4194EF3F.80608@ichips.intel.com> <1100279912.3369.507.camel@localhost.localdomain> Message-ID: <20041112095451.206ce08c.mshefty@ichips.intel.com> On Fri, 12 Nov 2004 12:18:32 -0500 Hal Rosenstock wrote: > On Fri, 2004-11-12 at 12:13, Sean Hefty wrote: > > Not sure what the issue is. Let me make sure that I've pulled the latest code and > > resubmit the patch. > > It looks right to me. Does it work for you ? Can you send a normal > rather than unified diff ? Can you try this version? I'll also revert back to the original code and see if I can apply the patch. - Sean Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1221) +++ include/ib_mad.h (working copy) @@ -250,6 +250,8 @@ * @mad_agent - Specifies the associated registration to post the send to. * @send_wr - Specifies the information needed to send the MAD(s). * @bad_send_wr - Specifies the MAD on which an error was encountered. + * + * Sent MADs are not guaranteed to complete in the order that they were posted. */ int ib_post_send_mad(struct ib_mad_agent *mad_agent, struct ib_send_wr *send_wr, Index: core/mad.c =================================================================== --- core/mad.c (revision 1221) +++ core/mad.c (working copy) @@ -90,6 +90,8 @@ struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); static int solicited_mad(struct ib_mad *mad); +static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, + enum ib_qp_state cur_state); /* * Returns a ib_mad_port_private structure or NULL for a device/port. @@ -591,6 +593,7 @@ /* Timeout will be updated after send completes */ mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. ud.timeout_ms); + mad_send_wr->retry = 0; /* One reference for each work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); mad_send_wr->status = IB_WC_SUCCESS; @@ -1339,6 +1342,70 @@ } } +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + int flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) { + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup. + */ + return; + } + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests. + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send. */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + /* Transition QP to RTS and fail offending send. */ + ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - unable to " + "transition QP to RTS : %d\n", ret); + ib_mad_send_done_handler(port_priv, wc); + mark_sends_for_retry(qp_info); + } +} + /* * IB MAD completion callback */ @@ -1346,34 +1413,25 @@ { struct ib_mad_port_private *port_priv; struct ib_wc wc; - struct ib_mad_list_head *mad_list; - struct ib_mad_qp_info *qp_info; port_priv = (struct ib_mad_port_private*)data; ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { - if (wc.status != IB_WC_SUCCESS) { - /* Determine if failure was a send or receive */ - mad_list = (struct ib_mad_list_head *) - (unsigned long)wc.wr_id; - qp_info = mad_list->mad_queue->qp_info; - if (mad_list->mad_queue == &qp_info->send_queue) - wc.opcode = IB_WC_SEND; - else - wc.opcode = IB_WC_RECV; - } - switch (wc.opcode) { - case IB_WC_SEND: - ib_mad_send_done_handler(port_priv, &wc); - break; - case IB_WC_RECV: - ib_mad_recv_done_handler(port_priv, &wc); - break; - default: - BUG_ON(1); - break; - } + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); } } @@ -1717,7 +1775,8 @@ /* * Modify QP into Ready-To-Send state */ -static inline int ib_mad_change_qp_state_to_rts(struct ib_qp *qp) +static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, + enum ib_qp_state cur_state) { int ret; struct ib_qp_attr *attr; @@ -1729,11 +1788,12 @@ "ib_qp_attr\n"); return -ENOMEM; } - attr->qp_state = IB_QPS_RTS; - attr->sq_psn = IB_MAD_SEND_Q_PSN; - attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; - + attr_mask = IB_QP_STATE; + if (cur_state == IB_QPS_RTR) { + attr->sq_psn = IB_MAD_SEND_Q_PSN; + attr_mask |= IB_QP_SQ_PSN; + } ret = ib_modify_qp(qp, attr, attr_mask); kfree(attr); @@ -1793,7 +1853,8 @@ goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp); + ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp, + IB_QPS_RTR); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTS\n", i); @@ -1852,6 +1913,15 @@ } } +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + static void init_mad_queue(struct ib_mad_qp_info *qp_info, struct ib_mad_queue *mad_queue) { @@ -1884,6 +1954,8 @@ qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; qp_init_attr.port_num = port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); if (IS_ERR(qp_info->qp)) { printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1221) +++ core/mad_priv.h (working copy) @@ -127,6 +127,7 @@ u64 wr_id; /* client WR ID */ u64 tid; unsigned long timeout; + int retry; int refcount; enum ib_wc_status status; }; From halr at voltaire.com Fri Nov 12 10:04:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 13:04:24 -0500 Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs In-Reply-To: <20041112095451.206ce08c.mshefty@ichips.intel.com> References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1> <4194EF3F.80608@ichips.intel.com> <1100279912.3369.507.camel@localhost.localdomain> <20041112095451.206ce08c.mshefty@ichips.intel.com> Message-ID: <1100282664.3369.556.camel@localhost.localdomain> On Fri, 2004-11-12 at 12:54, Sean Hefty wrote: > On Fri, 12 Nov 2004 12:18:32 -0500 > Can you try this version? I'll also revert back to the original code and see if > I can apply the patch. Don't bother (if you haven't already). This patch worked. -- Hal From halr at voltaire.com Fri Nov 12 10:19:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 13:19:44 -0500 Subject: [openib-general] Re: [PATCH] [1/2] SQE handling on MAD QPs In-Reply-To: <20041112095451.206ce08c.mshefty@ichips.intel.com> References: <419414C2.4090300@ichips.intel.com> <1100271340.6671.1.camel@hpc-1> <4194EF3F.80608@ichips.intel.com> <1100279912.3369.507.camel@localhost.localdomain> <20041112095451.206ce08c.mshefty@ichips.intel.com> Message-ID: <1100283584.3369.573.camel@localhost.localdomain> On Fri, 2004-11-12 at 12:54, Sean Hefty wrote: > Can you try this version? Thanks. Applied. -- Hal From robert.j.woodruff at intel.com Fri Nov 12 11:44:23 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 12 Nov 2004 11:44:23 -0800 Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C2D8D8@orsmsx408> Hi Masanori, Matt Leninger from Sandia controls who has access to the svn tree. You should probably contact him for providing contributions. cheers woody -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Masanori ITOH Sent: Thursday, November 11, 2004 10:14 PM To: openib-general at openib.org Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA Hello folks, As I mentioned fomerly on this list, I have a working u/kDAPL on top of the gen1 stack and I've finally finished all internal procedures to make it public. # Actually, it took me about one month and a half. Sigh... :( I would like to put that into the OpenIB contributors area (Somewhere like 'https://openib.org/svn/trunk/contrib/nttdata/'.), and could anyone tell me how I can do that? Thanks in advance, Masanori --- Masanori ITOH Open Source Software Development Center, NTT DATA CORPORATION e-mail: itoumsn at nttdata.co.jp phone : +81-3-3523-8122 (ext. 172-7199) _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Fri Nov 12 11:46:50 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 12 Nov 2004 11:46:50 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <4194F846.5030703@ichips.intel.com> (Sean Hefty's message of "Fri, 12 Nov 2004 09:52:06 -0800") References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com> <4194F846.5030703@ichips.intel.com> Message-ID: <52vfcargt1.fsf@topspin.com> Sean> That would work fine, and be only a minor update to the MAD Sean> code. Will you be generated a patch for mthca? Yes, eventually. (ib_verbs.h will also need an update to add the field to ib_qp_attr) - R. From mshefty at ichips.intel.com Fri Nov 12 12:39:12 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 12:39:12 -0800 Subject: [openib-general] [PATCH] Remove unneeded call in MAD code Message-ID: <20041112123912.04171c0a.mshefty@ichips.intel.com> This patch removes ib_mad_return_posted_send_mads, which isn't needed when shutting down. There cannot be any sends outstanding at this point, or clients still exist. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1222) +++ core/mad.c (working copy) @@ -1692,21 +1692,6 @@ } /* - * Return all the posted send MADs - */ -static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info) -{ - unsigned long flags; - - /* Just clear port send posted MAD list... revisit!!! */ - spin_lock_irqsave(&qp_info->send_queue.lock, flags); - INIT_LIST_HEAD(&qp_info->send_queue.list); - qp_info->send_queue.count = 0; - INIT_LIST_HEAD(&qp_info->overflow_list); - spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); -} - -/* * Modify QP into Init state */ static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp) @@ -1909,7 +1894,6 @@ i); } ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); - ib_mad_return_posted_send_mads(&port_priv->qp_info[i]); } } From halr at voltaire.com Fri Nov 12 12:58:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 15:58:50 -0500 Subject: [openib-general] Re: [PATCH] Remove unneeded call in MAD code In-Reply-To: <20041112123912.04171c0a.mshefty@ichips.intel.com> References: <20041112123912.04171c0a.mshefty@ichips.intel.com> Message-ID: <1100293130.3369.658.camel@localhost.localdomain> On Fri, 2004-11-12 at 15:39, Sean Hefty wrote: > This patch removes ib_mad_return_posted_send_mads, which isn't needed when > shutting down. There cannot be any sends outstanding at this point, or > clients still exist. Thanks. Applied. -- Hal From mshefty at ichips.intel.com Fri Nov 12 16:45:09 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 16:45:09 -0800 Subject: [openib-general] [PATCH] collapse MAD function calls Message-ID: <20041112164509.561e90de.mshefty@ichips.intel.com> This patch callapses several function calls into one when activating the MAD QPs. This avoids repeated allocation/freeing of memory. I have plans to examine the QP transitions to the reset state to see if these are necessary and if a race condition exists between shutting down a port and processing a receive completion. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1222) +++ core/mad.c (working copy) @@ -90,8 +90,6 @@ struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); static int solicited_mad(struct ib_mad *mad); -static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, - enum ib_qp_state cur_state); /* * Returns a ib_mad_port_private structure or NULL for a device/port @@ -1396,13 +1394,21 @@ } else ib_mad_send_done_handler(port_priv, wc); } else { + struct ib_qp_attr *attr; + /* Transition QP to RTS and fail offending send */ - ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE); - if (ret) - printk(KERN_ERR PFX "mad_error_handler - unable to " - "transition QP to RTS : %d\n", ret); + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + ret = ib_modify_qp(qp_info->qp, attr, IB_QP_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } ib_mad_send_done_handler(port_priv, wc); - mark_sends_for_retry(qp_info); } } @@ -1692,172 +1698,51 @@ } /* - * Return all the posted send MADs - */ -static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info) -{ - unsigned long flags; - - /* Just clear port send posted MAD list... revisit!!! */ - spin_lock_irqsave(&qp_info->send_queue.lock, flags); - INIT_LIST_HEAD(&qp_info->send_queue.list); - qp_info->send_queue.count = 0; - INIT_LIST_HEAD(&qp_info->overflow_list); - spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); -} - -/* - * Modify QP into Init state - */ -static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - - attr->qp_state = IB_QPS_INIT; - /* - * PKey index for QP1 is irrelevant but - * one is needed for the Reset to Init transition. - */ - attr->pkey_index = 0; - /* QKey is 0 for QP0 */ - if (qp->qp_num == 0) - attr->qkey = 0; - else - attr->qkey = IB_QP1_QKEY; - attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Ready-To-Receive state - */ -static inline int ib_mad_change_qp_state_to_rtr(struct ib_qp *qp) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - - attr->qp_state = IB_QPS_RTR; - attr_mask = IB_QP_STATE; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Ready-To-Send state - */ -static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, - enum ib_qp_state cur_state) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - attr->qp_state = IB_QPS_RTS; - attr_mask = IB_QP_STATE; - if (cur_state == IB_QPS_RTR) { - attr->sq_psn = IB_MAD_SEND_Q_PSN; - attr_mask |= IB_QP_SQ_PSN; - } - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Reset state + * Start the port */ -static inline int ib_mad_change_qp_state_to_reset(struct ib_qp *qp) +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) { - int ret; + int ret, i; struct ib_qp_attr *attr; - int attr_mask; + struct ib_qp *qp; attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); return -ENOMEM; } - attr->qp_state = IB_QPS_RESET; - attr_mask = IB_QP_STATE; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset " - "ret = %d\n", ret); - return ret; -} - -/* - * Start the port - */ -static int ib_mad_port_start(struct ib_mad_port_private *port_priv) -{ - int ret, i, ret2; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp); + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition. + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "INIT\n", i); + "INIT: %d\n", i, ret); goto error; } - ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp); + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "RTR\n", i); + "RTR: %d\n", i, ret); goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp, - IB_QPS_RTR); + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "RTS\n", i); + "RTS: %d\n", i, ret); goto error; } } @@ -1865,30 +1750,28 @@ ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); if (ret) { printk(KERN_ERR PFX "Failed to request completion " - "notification\n"); + "notification: %d\n", ret); goto error; } for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { - printk(KERN_ERR PFX "Couldn't post receive " - "requests\n"); + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); goto error; } } - return 0; + goto out; error: for (i = 0; i < IB_MAD_QPS_CORE; i++) { + attr->qp_state = IB_QPS_RESET; + ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE); ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); - ret2 = ib_mad_change_qp_state_to_reset(port_priv-> - qp_info[i].qp); - if (ret2) { - printk(KERN_ERR PFX "ib_mad_port_start: Couldn't " - "change QP%d state to RESET\n", i); - } } + +out: + kfree(attr); return ret; } @@ -1898,19 +1781,26 @@ static void ib_mad_port_stop(struct ib_mad_port_private *port_priv) { int i, ret; + struct ib_qp_attr *attr; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset( - port_priv->qp_info[i].qp); - if (ret) { - printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change" - " %s port %d QP%d state to RESET\n", - port_priv->device->name, port_priv->port_num, - i); + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RESET; + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, + IB_QP_STATE); + if (ret) + printk(KERN_ERR PFX "ib_mad_port_stop: " + "Couldn't change %s port %d QP%d " + "state to RESET\n", + port_priv->device->name, + port_priv->port_num, i); } - ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); - ib_mad_return_posted_send_mads(&port_priv->qp_info[i]); + kfree(attr); } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); } static void qp_event_handler(struct ib_event *event, void *qp_context) From halr at voltaire.com Fri Nov 12 19:08:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 12 Nov 2004 22:08:14 -0500 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <20041112164509.561e90de.mshefty@ichips.intel.com> References: <20041112164509.561e90de.mshefty@ichips.intel.com> Message-ID: <1100315294.3369.682.camel@localhost.localdomain> On Fri, 2004-11-12 at 19:45, Sean Hefty wrote: > This patch callapses several function calls into one when activating > the MAD QPs. This avoids repeated allocation/freeing of memory. > > I have plans to examine the QP transitions to the reset > state to see if these are necessary and if a race condition exists > between shutting down a port and processing a receive completion. This patch looks like it includes the previous patch and due to this 2 large hunks are rejected. Can you regenerate this ? -- Hal From sean.hefty at intel.com Fri Nov 12 20:04:22 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 12 Nov 2004 20:04:22 -0800 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <1100315294.3369.682.camel@localhost.localdomain> Message-ID: >On Fri, 2004-11-12 at 19:45, Sean Hefty wrote: >> This patch callapses several function calls into one when activating >> the MAD QPs. This avoids repeated allocation/freeing of memory. >> >> I have plans to examine the QP transitions to the reset >> state to see if these are necessary and if a race condition exists >> between shutting down a port and processing a receive completion. > >This patch looks like it includes the previous patch and due to this 2 >large hunks are rejected. Can you regenerate this ? Oops, sorry about that. I'll do this as soon as I get back in touch with my systems. - Sean From roland at topspin.com Fri Nov 12 20:21:39 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 12 Nov 2004 20:21:39 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <4194F846.5030703@ichips.intel.com> (Sean Hefty's message of "Fri, 12 Nov 2004 09:52:06 -0800") References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com> <4194F846.5030703@ichips.intel.com> Message-ID: <52fz3eqsz0.fsf@topspin.com> OK, here's a patch that adds support for "Current QP state" in the modify QP verb. Does this look OK? Thanks, Roland Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 1223) +++ infiniband/include/ib_verbs.h (working copy) @@ -421,7 +421,8 @@ enum ib_qp_attr_mask { IB_QP_STATE = 1, - IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<1), + IB_QP_CUR_STATE = (1<<1), + IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<2), IB_QP_ACCESS_FLAGS = (1<<3), IB_QP_PKEY_INDEX = (1<<4), IB_QP_PORT = (1<<5), @@ -460,6 +461,7 @@ struct ib_qp_attr { enum ib_qp_state qp_state; + enum ib_qp_state cur_qp_state; enum ib_mtu path_mtu; enum ib_mig_state path_mig_state; u32 qkey; Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 1223) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -394,13 +394,16 @@ [MLX] = IB_QP_SQ_PSN, }, .opt_param = { - [UD] = IB_QP_QKEY, - [RC] = (IB_QP_ALT_PATH | + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_MIN_RNR_TIMER | IB_QP_PATH_MIG_STATE), - [MLX] = IB_QP_QKEY, + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), } } }, @@ -410,12 +413,14 @@ [IB_QPS_RTS] = { .trans = MTHCA_TRANS_RTS2RTS, .opt_param = { - [UD] = IB_QP_QKEY, + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), [RC] = (IB_QP_ACCESS_FLAGS | IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE | IB_QP_MIN_RNR_TIMER), - [MLX] = IB_QP_QKEY, + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), } }, [IB_QPS_SQD] = { @@ -427,9 +432,36 @@ [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, [IB_QPS_RTS] = { .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } }, [IB_QPS_SQD] = { .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } } }, [IB_QPS_SQE] = { @@ -437,6 +469,14 @@ [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, [IB_QPS_RTS] = { .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } } }, [IB_QPS_ERR] = { @@ -490,9 +530,19 @@ u8 status; int err; - spin_lock_irq(&qp->lock); - cur_state = qp->state; - spin_unlock_irq(&qp->lock); + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } if (attr_mask & IB_QP_STATE) { if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) From halr at voltaire.com Sat Nov 13 06:39:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 13 Nov 2004 09:39:42 -0500 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <52fz3eqsz0.fsf@topspin.com> References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com> <4194F846.5030703@ichips.intel.com> <52fz3eqsz0.fsf@topspin.com> Message-ID: <1100356781.3369.692.camel@localhost.localdomain> On Fri, 2004-11-12 at 23:21, Roland Dreier wrote: > OK, here's a patch that adds support for "Current QP state" in the > modify QP verb. Does this look OK? Looks good to me. A few comments/questions relative to IBA 1.2 vol 1 table 91 (p.569-572): For SQD2SQD, path migration state is missing as is remote node address vector, . Is IB_QP_TIMEOUT local ACK timeout ? Also, does MAX_QP_RD_ATOMIC handle both local and destination ? I presume the omission of number of WQEs is intentional. -- Hal From roland at topspin.com Sat Nov 13 15:07:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 13 Nov 2004 15:07:14 -0800 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE In-Reply-To: <1100356781.3369.692.camel@localhost.localdomain> (Hal Rosenstock's message of "Sat, 13 Nov 2004 09:39:42 -0500") References: <419415B2.3060907@ichips.intel.com> <52bre3rml8.fsf@topspin.com> <4194F846.5030703@ichips.intel.com> <52fz3eqsz0.fsf@topspin.com> <1100356781.3369.692.camel@localhost.localdomain> Message-ID: <52y8h5pcv1.fsf@topspin.com> Hal> For SQD2SQD, path migration state is missing as is remote Hal> node address vector, . Good catch on the IB_QP_AV (actually IB_QP_PATH_MIG_STATE was there). Hal> Is IB_QP_TIMEOUT local ACK timeout ? Yes. Hal> Also, does MAX_QP_RD_ATOMIC handle both local and destination? No, actually there is also IB_QP_MAX_DEST_RD_ATOMIC. I need to audit where that's missing from my table (although no mthca RDMA support is written yet). Hal> I presume the omission of number of WQEs is intentional. Yes, I'm not planning to try and support resizing of QPs. Thanks, Roland From gdror at mellanox.co.il Sun Nov 14 15:15:30 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Mon, 15 Nov 2004 01:15:30 +0200 Subject: [openib-general] Re: [PATCH] [2/2] change QP state to SQE Message-ID: <506C3D7B14CDD411A52C00025558DED6067481F8@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, November 12, 2004 7:42 PM > > > I thought about this a little, and it seems that having the > CQ poll operation update the QP state is not the right > solution. It seems it would be better to add support for the > "Current QP state" modifier for the modify QP operation and > expect the consumer to use that to indicate that the QP is in > SQE state. > Actually I recall adding "current QP state" as an input modifier to the modify QP verb as part of the IB 1.1 errata (if I remember correctly). The main intention is to avoid the ambiguity when a consumer moves a QP into RTS state but can't tell if the QP was in SQError/Error or SQDrain. According to the spec, current QP state should only be valid when moving QP into RTS state. Hope that helps. -Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Sun Nov 14 21:25:05 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 14 Nov 2004 21:25:05 -0800 Subject: [openib-general] MAD handling Message-ID: <52d5yfptu6.fsf@topspin.com> A few questions about MAD handling: - What is supposed to happen to MADs that are received and are considered "solicited" because they have a method like GetResp, but which don't match any outstanding sends? Right now it looks as if they will be silently dropped in find_mad_agent(). Unfortunately this doesn't work very well with the current user_mad.c stuff -- I post all sends with a timeout of 0 and expect userspace to register an agent to get responses. I could have user_mad.c use timeouts, but then we need to come up with a way for the timeouts to be passed up to userspace. I'd sort of prefer to let userspace handle its own timeouts, although I could be persuaded otherwise. - It looks as if the case of response DR SMPs going to the SM is not handled in smi.c. smi_check_forward_dr_smp() doesn't handle the case of hop_ptr == 0, and smi_handle_dr_smp_send() just says /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */ and returns 0, which will lead to the packet being dropped. How should this be fixed? - Also, if I'm reading the code correctly, it seems that in handle_outgoing_smp, mad_priv->mad will be dispatched even if no response was generated by the call to process_mad (ie we might pass garbage to the receive handler). Thanks, Roland From halr at voltaire.com Mon Nov 15 05:42:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 08:42:30 -0500 Subject: [openib-general] MAD handling In-Reply-To: <52d5yfptu6.fsf@topspin.com> References: <52d5yfptu6.fsf@topspin.com> Message-ID: <1100526150.3369.2119.camel@localhost.localdomain> On Mon, 2004-11-15 at 00:25, Roland Dreier wrote: > A few questions about MAD handling: > > - What is supposed to happen to MADs that are received and are > considered "solicited" because they have a method like GetResp, but > which don't match any outstanding sends? Right now it looks as if > they will be silently dropped in find_mad_agent(). This issue was brought up on the list last week in a thread entitled "Solicited response with no matching send request". There seemed to be no pressing need for this at the time. > Unfortunately this doesn't work very well with the current user_mad.c stuff -- I > post all sends with a timeout of 0 and expect userspace to register > an agent to get responses. I can work on a patch for this. One issue raised with this was not providing an unmatched response if the client cancelled the send. This means that the cancellations need to be kept around (at least for some time period). > I could have user_mad.c use timeouts, but then we need to come up > with a way for the timeouts to be passed up to userspace. I'd sort > of prefer to let userspace handle its own timeouts, although I could > be persuaded otherwise. Seems to me like the SM would/could.should be using soliticed sends with time outs. Maybe that's not the way this would be today just porting what is already there. > - It looks as if the case of response DR SMPs going to the SM is not > handled in smi.c. smi_check_forward_dr_smp() doesn't handle the > case of hop_ptr == 0, and smi_handle_dr_smp_send() just says > > /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */ > > and returns 0, which will lead to the packet being dropped. How > should this be fixed? I will be working on SMI today/tomorrow to hopefully fix the remaining cases. > - Also, if I'm reading the code correctly, it seems that in > handle_outgoing_smp, mad_priv->mad will be dispatched even if no > response was generated by the call to process_mad (ie we might pass > garbage to the receive handler). Are you referring to if SUCCESS is set without CONSUMED or REPLY (after calling process_mad for a local MAD) ? Is this the trap repress case (from the SM to the local SMA) ? -- Hal From roland at topspin.com Mon Nov 15 07:20:37 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 07:20:37 -0800 Subject: [openib-general] MAD handling In-Reply-To: <1100526150.3369.2119.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 15 Nov 2004 08:42:30 -0500") References: <52d5yfptu6.fsf@topspin.com> <1100526150.3369.2119.camel@localhost.localdomain> Message-ID: <527jonp29m.fsf@topspin.com> Hal> Seems to me like the SM would/could.should be using soliticed Hal> sends with time outs. Maybe that's not the way this would be Hal> today just porting what is already there. I guess I'll extend user_mad.c to handle timeouts then. Roland> - Also, if I'm reading the code correctly, it seems that Roland> in handle_outgoing_smp, mad_priv->mad will be dispatched Roland> even if no response was generated by the call to Roland> process_mad (ie we might pass garbage to the receive Roland> handler). Hal> Are you referring to if SUCCESS is set without CONSUMED or Hal> REPLY (after calling process_mad for a local MAD) ? Is this Hal> the trap repress case (from the SM to the local SMA) ? I'm just talking about the code starting below in handle_outgoing_smp(): /* See if response is solicited and there is a recv handler */ mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); if (solicited_mad(&mad_priv->mad.mad) && mad_agent_priv->agent.recv_handler) { It seems we will start passing the MAD to the recv_handler without checking that process_mad() generated a reply (indeed without checking that we even called process_mad()). - Roland From halr at voltaire.com Mon Nov 15 07:22:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 10:22:44 -0500 Subject: [openib-general] MAD handling In-Reply-To: <527jonp29m.fsf@topspin.com> References: <52d5yfptu6.fsf@topspin.com> <1100526150.3369.2119.camel@localhost.localdomain> <527jonp29m.fsf@topspin.com> Message-ID: <1100532164.3369.2137.camel@localhost.localdomain> On Mon, 2004-11-15 at 10:20, Roland Dreier wrote: > Roland> - Also, if I'm reading the code correctly, it seems that > Roland> in handle_outgoing_smp, mad_priv->mad will be dispatched > Roland> even if no response was generated by the call to > Roland> process_mad (ie we might pass garbage to the receive > Roland> handler). > > Hal> Are you referring to if SUCCESS is set without CONSUMED or > Hal> REPLY (after calling process_mad for a local MAD) ? Is this > Hal> the trap repress case (from the SM to the local SMA) ? > > I'm just talking about the code starting below in handle_outgoing_smp(): > > /* See if response is solicited and there is a recv handler */ > mad_agent_priv = container_of(mad_agent, > struct ib_mad_agent_private, > agent); > if (solicited_mad(&mad_priv->mad.mad) && > mad_agent_priv->agent.recv_handler) { > > It seems we will start passing the MAD to the recv_handler without > checking that process_mad() generated a reply (indeed without checking > that we even called process_mad()). I see what you mean. There are a number of cases which should skip this and just call the send handler. I'll issue a patch for this. Thanks. -- Hal From roland at topspin.com Mon Nov 15 08:40:09 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 08:40:09 -0800 Subject: [openib-general] MAD handling In-Reply-To: <1100532164.3369.2137.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 15 Nov 2004 10:22:44 -0500") References: <52d5yfptu6.fsf@topspin.com> <1100526150.3369.2119.camel@localhost.localdomain> <527jonp29m.fsf@topspin.com> <1100532164.3369.2137.camel@localhost.localdomain> Message-ID: <523bzboyl2.fsf@topspin.com> Oh yeah, one more slight glitch in the MAD API. It turns out that if a 0-hop DR SMP is passed to ib_post_send_mad(), the client's recv_handler will be called back directly from the same context. This means that the client has to be very careful to avoid deadlocking by taking the same lock in both the send posting code and the receive handling code. I fixed up the locking in user_mad.c to handle this but we may want to think about changing the MAD code to avoid this case. - R. From roland at topspin.com Mon Nov 15 08:52:15 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 08:52:15 -0800 Subject: [openib-general] Upstream submission Message-ID: <52y8h3njgg.fsf@topspin.com> Just to focus our minds, I would like to propose that we aim to post a first version of InfiniBand patches for review to linux-kernel next Monday, November 22. The plan would be to produce a series of patches that adds the code in our gen2/trunk: the IB core, mad layer, mthca, IPoIB and user MAD modules. I believe the code we have now is good enough to be reviewed, and I don't think it's going to get much better without input from the wider Linux community. I still need to update IPoIB driver to remove the use of /proc (more on this later) and add timeout handling to user_mad.c. This work should be finished today or tomorrow. Then I'll work on some scripts to take our svn tree and turn it into a series of patches for posting to lkml. I'll post a preliminary patch series just to openib-general by Friday morning, and if everything looks good I'll post the same series to lkml (cc'ed to openib-general so that we get replies as well). Unfortunately we missed the 2.6.10 release train (in yesterday's announcment of 2.6.10-rc2, Linus said: "Ok, the -rc2 changes are almost as big as the -rc1 changes, and we should now calm down, so I do not want to see anything but bug-fixes until 2.6.10 is released"). Still, I think by starting code review as soon as possible, we maximize our chances at getting merged as soon as 2.6.11 opens up. Comments? Objections? Thanks, Roland From roland at topspin.com Mon Nov 15 09:09:12 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 09:09:12 -0800 Subject: [openib-general] [PATCH] umad: pass timeouts to userspace Message-ID: <52u0rrnio7.fsf@topspin.com> OK, this adds a status and a timeout_ms field to struct ib_user_mad and passes timeouts up to userspace. Seem OK? - R. Index: infiniband/include/ib_user_mad.h =================================================================== --- infiniband/include/ib_user_mad.h (revision 1223) +++ infiniband/include/ib_user_mad.h (working copy) @@ -37,6 +37,10 @@ * ib_user_mad - MAD packet * @data - Contents of MAD * @id - ID of agent MAD received with/to be sent with + * @status - 0 on successfuly receive, ETIMEDOUT if no response + * received (transaction ID in data[] will be set to TID of original + * request) (ignored on send) + * @timeout_ms - Milliseconds to wait for response (unset on receive) * @qpn - Remote QP number received from/to be sent to * @qkey - Remote Q_Key to be sent with (unset on receive) * @lid - Remote lid received from/to be sent to @@ -54,6 +58,8 @@ struct ib_user_mad { __u8 data[256]; __u32 id; + __u32 status; + __u32 timeout_ms; __u32 qpn; __u32 qkey; __u16 lid; Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1231) +++ infiniband/core/user_mad.c (working copy) @@ -84,17 +84,50 @@ static void ib_umad_add_one(struct ib_device *device); static void ib_umad_remove_one(struct ib_device *device); +static int queue_packet(struct ib_umad_file *file, + struct ib_mad_agent *agent, + struct ib_umad_packet *packet) +{ + int ret = 1; + + down_read(&file->agent_mutex); + for (packet->mad.id = 0; + packet->mad.id < IB_UMAD_MAX_AGENTS; + packet->mad.id++) + if (agent == file->agent[packet->mad.id]) { + spin_lock_irq(&file->recv_lock); + list_add_tail(&packet->list, &file->recv_list); + spin_unlock_irq(&file->recv_lock); + wake_up_interruptible(&file->recv_wait); + ret = 0; + break; + } + + up_read(&file->agent_mutex); + + return ret; +} + static void send_handler(struct ib_mad_agent *agent, - struct ib_mad_send_wc *mad_send_wc) + struct ib_mad_send_wc *send_wc) { + struct ib_umad_file *file = agent->context; struct ib_umad_packet *packet = - (void *) (unsigned long) mad_send_wc->wr_id; + (void *) (unsigned long) send_wc->wr_id; pci_unmap_single(agent->device->dma_device, pci_unmap_addr(packet, mapping), sizeof packet->mad.data, PCI_DMA_TODEVICE); ib_destroy_ah(packet->ah); + + if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { + packet->mad.status = ETIMEDOUT; + + if (!queue_packet(file, agent, packet)) + return; + } + kfree(packet); } @@ -114,6 +147,7 @@ memset(packet, 0, sizeof *packet); memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data); + packet->mad.status = 0; packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); packet->mad.sl = mad_recv_wc->wc->sl; @@ -128,23 +162,9 @@ packet->mad.flow_label = 0; } - down_read(&file->agent_mutex); - for (packet->mad.id = 0; - packet->mad.id < IB_UMAD_MAX_AGENTS; - packet->mad.id++) - if (agent == file->agent[packet->mad.id]) { - spin_lock_irq(&file->recv_lock); - list_add_tail(&packet->list, &file->recv_list); - spin_unlock_irq(&file->recv_lock); - wake_up_interruptible(&file->recv_wait); - goto agent; - } + if (queue_packet(file, agent, packet)) + kfree(packet); - kfree(packet); - -agent: - up_read(&file->agent_mutex); - out: ib_free_recv_mad(mad_recv_wc); } @@ -259,6 +279,7 @@ wr.wr.ud.ah = packet->ah; wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); + wr.wr.ud.timeout_ms = packet->mad.timeout_ms; wr.wr_id = (unsigned long) packet; Index: docs/user_mad.txt =================================================================== --- docs/user_mad.txt (revision 1223) +++ docs/user_mad.txt (working copy) @@ -39,6 +39,10 @@ fields will be filled in with information on the received MAD. For example, the remote LID will be in mad.lid. + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + poll()/select() may be used to wait until a MAD can be read. Sending MADs From robert.j.woodruff at intel.com Mon Nov 15 09:10:13 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 15 Nov 2004 09:10:13 -0800 Subject: [openib-general] Upstream submission Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C7E39D@orsmsx408> >Comments? Objections? >Thanks, > Roland Getting code review as early as possible is probably a good idea. woody From tduffy at sun.com Mon Nov 15 09:23:18 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 15 Nov 2004 09:23:18 -0800 Subject: [openib-general] Upstream submission In-Reply-To: <52y8h3njgg.fsf@topspin.com> References: <52y8h3njgg.fsf@topspin.com> Message-ID: <1100539398.13150.7.camel@duffman> On Mon, 2004-11-15 at 08:52 -0800, Roland Dreier wrote: > The plan would be to produce a series of patches > that adds the code in our gen2/trunk: the IB core, mad layer, mthca, > IPoIB and user MAD modules. Is there a reason to break up into patches code in drivers/infiniband? There seem to already be 4 patches outside of drivers/infiniband: linux-2.6.9-infiniband.diff linux-2.6.9-ioctl.diff linux-2.6.9-ipoib-ipv6.diff linux-2.6.9-ipoib-multicast.diff It is not like the parts of drivers/infiniband would be accepted and others not. That would not make much sense (who would want IPoIB without any layers below it to run on). -tduffy P.S. I think submitting the patches to lkml next Monday is a great idea. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Mon Nov 15 09:55:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 12:55:19 -0500 Subject: [openib-general] IPoIB removal issue Message-ID: <1100541319.3369.2318.camel@localhost.localdomain> Hi, The ethernet on this machine is DHCP'd. Some network glitch (I think) followed by trying to remove the ipoib modules caused the following to be display in the console logs. Any ideas ? Thanks. -- Hal Nov 15 10:44:28 hpc-1 network: Shutting down interface eth1: succeeded Nov 15 10:44:28 hpc-1 network: Shutting down loopback interface: succeeded Nov 15 10:44:28 hpc-1 sysctl: net.ipv4.ip_forward = 0 Nov 15 10:44:28 hpc-1 sysctl: net.ipv4.conf.default.rp_filter = 1 Nov 15 10:44:28 hpc-1 sysctl: kernel.sysrq = 0 Nov 15 10:44:28 hpc-1 sysctl: kernel.core_uses_pid = 1 Nov 15 10:44:28 hpc-1 network: Setting network parameters: succeeded Nov 15 10:44:28 hpc-1 network: Bringing up loopback interface: succeeded Nov 15 10:44:28 hpc-1 ifup: Nov 15 10:44:28 hpc-1 ifup: Determining IP information for eth0... Nov 15 10:44:34 hpc-1 ifup: failed; no link present. Check cable? Nov 15 10:44:34 hpc-1 network: Bringing up interface eth0: failed Nov 15 10:44:34 hpc-1 ifup: Nov 15 10:44:34 hpc-1 ifup: Determining IP information for eth1... Nov 15 10:44:34 hpc-1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex Nov 15 10:44:34 hpc-1 dhclient: ib1: unknown hardware address type 32 Nov 15 10:44:34 hpc-1 dhclient: sit0: unknown hardware address type 776 Nov 15 10:44:34 hpc-1 dhclient: ib0: unknown hardware address type 32 Nov 15 10:44:35 hpc-1 dhclient: ib1: unknown hardware address type 32 Nov 15 10:44:35 hpc-1 dhclient: sit0: unknown hardware address type 776 Nov 15 10:44:35 hpc-1 dhclient: ib0: unknown hardware address type 32 Nov 15 10:44:36 hpc-1 dhclient: DHCPREQUEST on eth1 to 255.255.255.255 port 67 Nov 15 10:44:36 hpc-1 dhclient: DHCPACK from 10.0.2.1 Nov 15 10:44:36 hpc-1 dhclient: bound to 10.0.2.4 -- renewal in 1463 seconds. Nov 15 10:44:36 hpc-1 ifup: done. Nov 15 10:44:36 hpc-1 network: Bringing up interface eth1: succeeded Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:601b:ffff:0:0:1:ff96:71 Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:601b:ffff:0:0:0:0:1 Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:401b:ffff:0:0:0:0:1 Nov 15 11:07:00 hpc-1 kernel: leaving MGID ff12:401b:ffff:0:0:0:ffff:ffff Message from syslogd at hpc-1 at Mon Nov 15 11:07:10 2004 ... hpc-1 kernel: unregister_netdevice: waiting for ib0 to become free. Usage count = 1 Nov 15 11:07:10 hpc-1 kernel: unregister_netdevice: waiting for ib0 to become free. Usage count = 1 Message from syslogd at hpc-1 at Mon Nov 15 11:07:50 2004 ... hpc-1 last message repeated 4 times Nov 15 11:07:50 hpc-1 last message repeated 4 times From roland at topspin.com Mon Nov 15 10:11:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 10:11:38 -0800 Subject: [openib-general] Upstream submission In-Reply-To: <1100539398.13150.7.camel@duffman> (Tom Duffy's message of "Mon, 15 Nov 2004 09:23:18 -0800") References: <52y8h3njgg.fsf@topspin.com> <1100539398.13150.7.camel@duffman> Message-ID: <52hdnrnfs5.fsf@topspin.com> Tom> Is there a reason to break up into patches code in Tom> drivers/infiniband? I think so: ease of review. A single 15000 line patch is not going to be very readable. Breaking it up into multiple pieces makes the architecture a little clearer and also helps naturally organize the replies into multiple threads. - R. From roland at topspin.com Mon Nov 15 10:14:31 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 10:14:31 -0800 Subject: [openib-general] Re: IPoIB removal issue In-Reply-To: <1100541319.3369.2318.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 15 Nov 2004 12:55:19 -0500") References: <1100541319.3369.2318.camel@localhost.localdomain> Message-ID: <52d5yfnfnc.fsf@topspin.com> Hal> unregister_netdevice: waiting for ib0 to become free. Usage count = 1 Someone is still holding a reference to the ib0 device. I don't see anything in the IPoIB code that could be doing it, so it seems like someone outside the driver must be doing it. - R. From roland at topspin.com Mon Nov 15 10:16:53 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 10:16:53 -0800 Subject: [openib-general] Signed-off-by: lines In-Reply-To: <52y8h3njgg.fsf@topspin.com> (Roland Dreier's message of "Mon, 15 Nov 2004 08:52:15 -0800") References: <52y8h3njgg.fsf@topspin.com> Message-ID: <526547nfje.fsf@topspin.com> By the way, for our initial submission upstream, I am planning on submitting all the patches with my own Signed-off-by: Roland Dreier line, of course preserving any other Signed-off-by: lines that already exist. However, for the future, it would be a good idea to make sure that all patches come with a properly formatted Signed-off-by: line(s) and preserve all such lines in the svn commit messages. (Read Documentation/SubmittingPatches in the kernel tree for full details) Thanks, Roland From mshefty at ichips.intel.com Mon Nov 15 10:29:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 10:29:18 -0800 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <1100315294.3369.682.camel@localhost.localdomain> References: <20041112164509.561e90de.mshefty@ichips.intel.com> <1100315294.3369.682.camel@localhost.localdomain> Message-ID: <20041115102918.29e7dcdb.mshefty@ichips.intel.com> On Fri, 12 Nov 2004 22:08:14 -0500 Hal Rosenstock wrote: > This patch looks like it includes the previous patch and due to this 2 > large hunks are rejected. Can you regenerate this ? Updated patch. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1232) +++ core/mad.c (working copy) @@ -90,8 +90,6 @@ struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); static int solicited_mad(struct ib_mad *mad); -static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, - enum ib_qp_state cur_state); /* * Returns a ib_mad_port_private structure or NULL for a device/port @@ -1397,13 +1395,21 @@ } else ib_mad_send_done_handler(port_priv, wc); } else { + struct ib_qp_attr *attr; + /* Transition QP to RTS and fail offending send */ - ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE); - if (ret) - printk(KERN_ERR PFX "mad_error_handler - unable to " - "transition QP to RTS : %d\n", ret); + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + ret = ib_modify_qp(qp_info->qp, attr, IB_QP_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } ib_mad_send_done_handler(port_priv, wc); - mark_sends_for_retry(qp_info); } } @@ -1699,151 +1705,45 @@ { int ret; struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - - attr->qp_state = IB_QPS_INIT; - /* - * PKey index for QP1 is irrelevant but - * one is needed for the Reset to Init transition. - */ - attr->pkey_index = 0; - /* QKey is 0 for QP0 */ - if (qp->qp_num == 0) - attr->qkey = 0; - else - attr->qkey = IB_QP1_QKEY; - attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Ready-To-Receive state - */ -static inline int ib_mad_change_qp_state_to_rtr(struct ib_qp *qp) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - - attr->qp_state = IB_QPS_RTR; - attr_mask = IB_QP_STATE; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Ready-To-Send state - */ -static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, - enum ib_qp_state cur_state) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - attr->qp_state = IB_QPS_RTS; - attr_mask = IB_QP_STATE; - if (cur_state == IB_QPS_RTR) { - attr->sq_psn = IB_MAD_SEND_Q_PSN; - attr_mask |= IB_QP_SQ_PSN; - } - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Reset state - */ -static inline int ib_mad_change_qp_state_to_reset(struct ib_qp *qp) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; + struct ib_qp *qp; attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); return -ENOMEM; } - attr->qp_state = IB_QPS_RESET; - attr_mask = IB_QP_STATE; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset " - "ret = %d\n", ret); - return ret; -} - -/* - * Start the port - */ -static int ib_mad_port_start(struct ib_mad_port_private *port_priv) -{ - int ret, i, ret2; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp); + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition. + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "INIT\n", i); + "INIT: %d\n", i, ret); goto error; } - ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp); + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "RTR\n", i); + "RTR: %d\n", i, ret); goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp, - IB_QPS_RTR); + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "RTS\n", i); + "RTS: %d\n", i, ret); goto error; } } @@ -1851,30 +1751,28 @@ ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); if (ret) { printk(KERN_ERR PFX "Failed to request completion " - "notification\n"); + "notification: %d\n", ret); goto error; } for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { - printk(KERN_ERR PFX "Couldn't post receive " - "requests\n"); + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); goto error; } } - return 0; + goto out; error: for (i = 0; i < IB_MAD_QPS_CORE; i++) { + attr->qp_state = IB_QPS_RESET; + ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE); ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); - ret2 = ib_mad_change_qp_state_to_reset(port_priv-> - qp_info[i].qp); - if (ret2) { - printk(KERN_ERR PFX "ib_mad_port_start: Couldn't " - "change QP%d state to RESET\n", i); - } } + +out: + kfree(attr); return ret; } @@ -1884,18 +1782,26 @@ static void ib_mad_port_stop(struct ib_mad_port_private *port_priv) { int i, ret; + struct ib_qp_attr *attr; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset( - port_priv->qp_info[i].qp); - if (ret) { - printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change" - " %s port %d QP%d state to RESET\n", - port_priv->device->name, port_priv->port_num, - i); + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RESET; + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, + IB_QP_STATE); + if (ret) + printk(KERN_ERR PFX "ib_mad_port_stop: " + "Couldn't change %s port %d QP%d " + "state to RESET\n", + port_priv->device->name, + port_priv->port_num, i); } - ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); + kfree(attr); } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); } static void qp_event_handler(struct ib_event *event, void *qp_context) From dledford at redhat.com Mon Nov 15 10:52:26 2004 From: dledford at redhat.com (Doug Ledford) Date: Mon, 15 Nov 2004 13:52:26 -0500 Subject: [openib-general] Upstream submission In-Reply-To: <52y8h3njgg.fsf@topspin.com> References: <52y8h3njgg.fsf@topspin.com> Message-ID: <1100544746.3712.26.camel@compaq-rhel4.xsintricity.com> On Mon, 2004-11-15 at 08:52 -0800, Roland Dreier wrote: > Just to focus our minds, I would like to propose that we aim to post a > first version of InfiniBand patches for review to linux-kernel next > Monday, November 22. Boo! ;-) I'll echo the sentiment that this is a good idea. While I'm piping up I'll go ahead and introduce myself. I'm a kernel engineer for Red Hat (been here about 6 years). I've been assigned the task of helping aid integration of the OpenIB work into our products. Obviously, upstream inclusion makes my task easier, so that's certainly welcome. I've also been assigned the task of assisting with ongoing development efforts. It'll be a little bit before I'm up to speed on things and able to effectively contribute (well, that and I'm going to have to line up some test hardware). I started out by subscribing to this list and lurking in the background. Been here about 3 weeks now. Next I'm planning on doing what's necessary to get access to the current IB specs and downloading the current code base and starting to familiarize myself with the spec and the current state of the code base. By the time I've made myself familiar with things I will have hopefully worked out the hardware issue and be able to get to work. Suggestions for items I can read, web sites I should visit in order to help get me up to speed, etc. welcomed. I'm sure you'll hear more from me in the future ;-) -- Doug Ledford Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From roland at topspin.com Mon Nov 15 11:15:41 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 11:15:41 -0800 Subject: [openib-general] Upstream submission In-Reply-To: <1100544746.3712.26.camel@compaq-rhel4.xsintricity.com> (Doug Ledford's message of "Mon, 15 Nov 2004 13:52:26 -0500") References: <52y8h3njgg.fsf@topspin.com> <1100544746.3712.26.camel@compaq-rhel4.xsintricity.com> Message-ID: <52wtwmncte.fsf@topspin.com> Doug> Suggestions for items I can read, web sites I should visit Doug> in order to help get me up to speed, etc. welcomed. Doug, First off, welcome! Unfortunately there's not much to read about InfiniBand beyond the current IB spec. However, I think chapter 3 is actually quite a nice introduction. As far as our current codebase goes, documentation there is pretty sparse. However, the latest tree (svn at https://openib.org/svn/gen2/trunk/src/linux-kernel) should be reasonably understandable, since we've chopped the code down to a minimum. In any case please ask if there's anything that needs clarification. (By the way, I think we're working with Brian Stevens to get your hardware situation sorted out) - Roland From halr at voltaire.com Mon Nov 15 11:50:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 14:50:06 -0500 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <20041115102918.29e7dcdb.mshefty@ichips.intel.com> References: <20041112164509.561e90de.mshefty@ichips.intel.com> <1100315294.3369.682.camel@localhost.localdomain> <20041115102918.29e7dcdb.mshefty@ichips.intel.com> Message-ID: <1100548206.2767.1.camel@hpc-1> On Mon, 2004-11-15 at 13:29, Sean Hefty wrote: > On Fri, 12 Nov 2004 22:08:14 -0500 > Hal Rosenstock wrote: > > This patch looks like it includes the previous patch and due to this 2 > > large hunks are rejected. Can you regenerate this ? > > Updated patch. Patch now applies but I get the following compile errors: drivers/infiniband/core/mad.c: In function `ib_mad_change_qp_state_to_init': drivers/infiniband/core/mad.c:1708: warning: declaration of `qp' shadows a parameter drivers/infiniband/core/mad.c:1716: `i' undeclared (first use in this function) drivers/infiniband/core/mad.c:1716: (Each undeclared identifier is reported only once drivers/infiniband/core/mad.c:1716: for each function it appears in.) drivers/infiniband/core/mad.c:1717: `port_priv' undeclared (first use in this function) drivers/infiniband/core/mad.c: In function `ib_mad_port_open': drivers/infiniband/core/mad.c:1944: warning: implicit declaration of function `ib_mad_port_start' From mshefty at ichips.intel.com Mon Nov 15 11:48:23 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 11:48:23 -0800 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <1100548206.2767.1.camel@hpc-1> References: <20041112164509.561e90de.mshefty@ichips.intel.com> <1100315294.3369.682.camel@localhost.localdomain> <20041115102918.29e7dcdb.mshefty@ichips.intel.com> <1100548206.2767.1.camel@hpc-1> Message-ID: <41990807.8030604@ichips.intel.com> Hal Rosenstock wrote: > Patch now applies but I get the following compile errors: > > drivers/infiniband/core/mad.c: In function > `ib_mad_change_qp_state_to_init': > drivers/infiniband/core/mad.c:1708: warning: declaration of `qp' shadows > a parameter > drivers/infiniband/core/mad.c:1716: `i' undeclared (first use in this > function) > drivers/infiniband/core/mad.c:1716: (Each undeclared identifier is > reported only once > drivers/infiniband/core/mad.c:1716: for each function it appears in.) > drivers/infiniband/core/mad.c:1717: `port_priv' undeclared (first use in > this function) > drivers/infiniband/core/mad.c: In function `ib_mad_port_open': > drivers/infiniband/core/mad.c:1944: warning: implicit declaration of > function `ib_mad_port_start' Something didn't merge right here between the two patches. I don't think this function name is even correct. Let me recheck this. - Sean From halr at voltaire.com Mon Nov 15 12:06:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 15:06:25 -0500 Subject: [openib-general] [PATCH] mad: In handle_outgoing_smp, only match response if generated Message-ID: <1100549185.2767.11.camel@hpc-1> mad: In handle_outgoing_smp, only match response if generated (based on comment from Roland) Index: mad.c =================================================================== --- mad.c (revision 1230) +++ mad.c (working copy) @@ -394,6 +394,10 @@ goto error1; } + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + if (mad_agent->device->process_mad) { ret = mad_agent->device->process_mad( mad_agent->device, @@ -418,46 +422,50 @@ mad_priv); goto error1; } - } - } - } - /* See if response is solicited and there is a recv handler */ - mad_agent_priv = container_of(mad_agent, - struct ib_mad_agent_private, - agent); - if (solicited_mad(&mad_priv->mad.mad) && - mad_agent_priv->agent.recv_handler) { - struct ib_wc wc; + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) { + struct ib_wc wc; - /* - * Defined behavior is to complete response - * before request - */ - wc.wr_id = send_wr->wr_id; - wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; - wc.byte_len = sizeof(struct ib_mad); - wc.src_qp = 0; /* IB_QPT_SMI ? */ - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = IB_LID_PERMISSIVE; - wc.sl = 0; - wc.dlid_path_bits = 0; - mad_priv->header.recv_wc.wc = &wc; - mad_priv->header.recv_wc.mad_len = + /* + * Defined behavior is to + * complete response before + * request + */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = 0; /* IB_QPT_SMI ? */ + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); - mad_priv->header.recv_buf.grh = NULL; - mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; - mad_priv->header.recv_wc.recv_buf = - &mad_priv->header.recv_buf; - mad_agent_priv->agent.recv_handler( - mad_agent, - &mad_priv->header.recv_wc); - } else - kmem_cache_free(ib_mad_cache, mad_priv); + INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); + mad_priv->header.recv_buf.grh = NULL; + mad_priv->header.recv_buf.mad = + &mad_priv->mad.mad; + mad_priv->header.recv_wc.recv_buf = + &mad_priv->header.recv_buf; + mad_agent_priv->agent.recv_handler( + mad_agent, + &mad_priv->header.recv_wc); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + } if (mad_agent_priv->agent.send_handler) { /* Now, complete send */ From paul.baxter at dsl.pipex.com Mon Nov 15 12:04:03 2004 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Mon, 15 Nov 2004 20:04:03 -0000 Subject: [openib-general] Upstream submission References: <52y8h3njgg.fsf@topspin.com> <1100539398.13150.7.camel@duffman> <52hdnrnfs5.fsf@topspin.com> Message-ID: <009601c4cb4e$41eb16f0$8000000a@blorp> While I am delighted that the lower layers are suffficiently stable to warrant being considered for code review/inclusion in the kernel, I am slightly surprised. Has the code been used in anger enough? There seem to be a lot of bugs still being discovered daily. Wouldn't having at least a preliminary set of user capabilities help assessment of the low level code and allow a wider set of people to evaluate it. Are there sufficient test tools and documentation to allow an IB novice (kernel expert) to evaluate the offering. Perhaps over the next month Doug could be a 'dry run guinea pig' for kernel inclusion and highlight the documentation and coding areas of difficulty prior to submission for a wider audience. I am concerned that an overly changeable or buggy submission may do more harm than good. Good luck with it though as I'm really looking forward to the fruits of this development. Just an opinion Paul ----- Original Message ----- From: "Roland Dreier" To: "Tom Duffy" Cc: <> Sent: Monday, November 15, 2004 6:11 PM Subject: Re: [openib-general] Upstream submission > Tom> Is there a reason to break up into patches code in > Tom> drivers/infiniband? > > I think so: ease of review. A single 15000 line patch is not going to > be very readable. Breaking it up into multiple pieces makes the > architecture a little clearer and also helps naturally organize the > replies into multiple threads. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Mon Nov 15 12:04:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 12:04:11 -0800 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <1100548206.2767.1.camel@hpc-1> References: <20041112164509.561e90de.mshefty@ichips.intel.com> <1100315294.3369.682.camel@localhost.localdomain> <20041115102918.29e7dcdb.mshefty@ichips.intel.com> <1100548206.2767.1.camel@hpc-1> Message-ID: <20041115120411.7ae02766.mshefty@ichips.intel.com> On Mon, 15 Nov 2004 14:50:06 -0500 Hal Rosenstock wrote: > On Mon, 2004-11-15 at 13:29, Sean Hefty wrote: > > On Fri, 12 Nov 2004 22:08:14 -0500 > > Hal Rosenstock wrote: > > > This patch looks like it includes the previous patch and due to this 2 > > > large hunks are rejected. Can you regenerate this ? > > > > Updated patch. This should fix the merge/compilation issues. Also, I re-examined the initial patch, and I don't see why it would have failed. Oh well... - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1232) +++ core/mad.c (working copy) @@ -90,8 +90,6 @@ struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); static int solicited_mad(struct ib_mad *mad); -static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, - enum ib_qp_state cur_state); /* * Returns a ib_mad_port_private structure or NULL for a device/port @@ -1397,13 +1395,23 @@ } else ib_mad_send_done_handler(port_priv, wc); } else { + struct ib_qp_attr *attr; + /* Transition QP to RTS and fail offending send */ - ret = ib_mad_change_qp_state_to_rts(qp_info->qp, IB_QPS_SQE); - if (ret) - printk(KERN_ERR PFX "mad_error_handler - unable to " - "transition QP to RTS : %d\n", ret); + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + attr->cur_qp_state = IB_QPS_SQE; + ret = ib_modify_qp(qp_info->qp, attr, + IB_QP_STATE | IB_QP_CUR_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } ib_mad_send_done_handler(port_priv, wc); - mark_sends_for_retry(qp_info); } } @@ -1693,157 +1701,51 @@ } /* - * Modify QP into Init state - */ -static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - - attr->qp_state = IB_QPS_INIT; - /* - * PKey index for QP1 is irrelevant but - * one is needed for the Reset to Init transition. - */ - attr->pkey_index = 0; - /* QKey is 0 for QP0 */ - if (qp->qp_num == 0) - attr->qkey = 0; - else - attr->qkey = IB_QP1_QKEY; - attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_init " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Ready-To-Receive state + * Start the port. */ -static inline int ib_mad_change_qp_state_to_rtr(struct ib_qp *qp) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - - attr->qp_state = IB_QPS_RTR; - attr_mask = IB_QP_STATE; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rtr " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Ready-To-Send state - */ -static int ib_mad_change_qp_state_to_rts(struct ib_qp *qp, - enum ib_qp_state cur_state) -{ - int ret; - struct ib_qp_attr *attr; - int attr_mask; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); - return -ENOMEM; - } - attr->qp_state = IB_QPS_RTS; - attr_mask = IB_QP_STATE; - if (cur_state == IB_QPS_RTR) { - attr->sq_psn = IB_MAD_SEND_Q_PSN; - attr_mask |= IB_QP_SQ_PSN; - } - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_rts " - "ret = %d\n", ret); - return ret; -} - -/* - * Modify QP into Reset state - */ -static inline int ib_mad_change_qp_state_to_reset(struct ib_qp *qp) +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) { - int ret; + int ret, i; struct ib_qp_attr *attr; - int attr_mask; + struct ib_qp *qp; attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) { - printk(KERN_ERR PFX "Couldn't allocate memory for " - "ib_qp_attr\n"); + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); return -ENOMEM; } - attr->qp_state = IB_QPS_RESET; - attr_mask = IB_QP_STATE; - - ret = ib_modify_qp(qp, attr, attr_mask); - kfree(attr); - - if (ret) - printk(KERN_WARNING PFX "ib_mad_change_qp_state_to_reset " - "ret = %d\n", ret); - return ret; -} - -/* - * Start the port - */ -static int ib_mad_port_start(struct ib_mad_port_private *port_priv) -{ - int ret, i, ret2; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp_info[i].qp); + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition. + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "INIT\n", i); + "INIT: %d\n", i, ret); goto error; } - ret = ib_mad_change_qp_state_to_rtr(port_priv->qp_info[i].qp); + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "RTR\n", i); + "RTR: %d\n", i, ret); goto error; } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp_info[i].qp, - IB_QPS_RTR); + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " - "RTS\n", i); + "RTS: %d\n", i, ret); goto error; } } @@ -1851,30 +1753,28 @@ ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); if (ret) { printk(KERN_ERR PFX "Failed to request completion " - "notification\n"); + "notification: %d\n", ret); goto error; } for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { - printk(KERN_ERR PFX "Couldn't post receive " - "requests\n"); + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); goto error; } } - return 0; + goto out; error: for (i = 0; i < IB_MAD_QPS_CORE; i++) { + attr->qp_state = IB_QPS_RESET; + ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE); ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); - ret2 = ib_mad_change_qp_state_to_reset(port_priv-> - qp_info[i].qp); - if (ret2) { - printk(KERN_ERR PFX "ib_mad_port_start: Couldn't " - "change QP%d state to RESET\n", i); - } } + +out: + kfree(attr); return ret; } @@ -1884,18 +1784,26 @@ static void ib_mad_port_stop(struct ib_mad_port_private *port_priv) { int i, ret; + struct ib_qp_attr *attr; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset( - port_priv->qp_info[i].qp); - if (ret) { - printk(KERN_ERR PFX "ib_mad_port_stop: Couldn't change" - " %s port %d QP%d state to RESET\n", - port_priv->device->name, port_priv->port_num, - i); + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RESET; + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, + IB_QP_STATE); + if (ret) + printk(KERN_ERR PFX "ib_mad_port_stop: " + "Couldn't change %s port %d QP%d " + "state to RESET\n", + port_priv->device->name, + port_priv->port_num, i); } - ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); + kfree(attr); } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) + ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); } static void qp_event_handler(struct ib_event *event, void *qp_context) From roland at topspin.com Mon Nov 15 12:23:26 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 12:23:26 -0800 Subject: [openib-general] Upstream submission In-Reply-To: <009601c4cb4e$41eb16f0$8000000a@blorp> (Paul Baxter's message of "Mon, 15 Nov 2004 20:04:03 -0000") References: <52y8h3njgg.fsf@topspin.com> <1100539398.13150.7.camel@duffman> <52hdnrnfs5.fsf@topspin.com> <009601c4cb4e$41eb16f0$8000000a@blorp> Message-ID: <52oehyn9oh.fsf@topspin.com> Paul> Has the code been used in anger enough? Paul> There seem to be a lot of bugs still being discovered daily. I think in most scenarios IPoIB is quite stable. I've run many gigabytes of traffic without trouble. There may still be corner cases with module unloading and the like, but I think the best way to fix those is to get enough testers so that we can start to see a pattern to the problems. Paul> Wouldn't having at least a preliminary set of user Paul> capabilities help assessment of the low level code and allow Paul> a wider set of people to evaluate it. I think we're better served in starting with as small and digestible a chunk of code as possible and building on that. Getting a foot in the door and all that... Paul> Perhaps over the next month Doug could be a 'dry run guinea Paul> pig' for kernel inclusion and highlight the documentation Paul> and coding areas of difficulty prior to submission for a Paul> wider audience. I think we're really at the point where we're ready for a full lkml code review. Certainly I think we're at a level of stability and functionality that is appropriate for inclusion in an -mm kernel if not Linus's tree. The code doesn't need to be perfect before it gets in the kernel -- just good enough to be usable and benefit from the increased test coverage. I do plan on marking the InfiniBand Kconfig options as EXPERIMENTAL, so that should help set expectations. Paul> I am concerned that an overly changeable or buggy submission Paul> may do more harm than good. In the past there certainly have been submissions to lkml that were far too early. However our current tree is definitely not an embarrassment: there aren't any gross violations of coding standards, we use modern interfaces like sysfs correctly, and so on. Thanks, Roland From halr at voltaire.com Mon Nov 15 12:32:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 15:32:23 -0500 Subject: [openib-general] Re: [PATCH] collapse MAD function calls In-Reply-To: <20041115120411.7ae02766.mshefty@ichips.intel.com> References: <20041112164509.561e90de.mshefty@ichips.intel.com> <1100315294.3369.682.camel@localhost.localdomain> <20041115102918.29e7dcdb.mshefty@ichips.intel.com> <1100548206.2767.1.camel@hpc-1> <20041115120411.7ae02766.mshefty@ichips.intel.com> Message-ID: <1100550743.2767.32.camel@hpc-1> On Mon, 2004-11-15 at 15:04, Sean Hefty wrote: > On Mon, 15 Nov 2004 14:50:06 -0500 > Hal Rosenstock wrote: > > > On Mon, 2004-11-15 at 13:29, Sean Hefty wrote: > > > On Fri, 12 Nov 2004 22:08:14 -0500 > > > Hal Rosenstock wrote: > > > > This patch looks like it includes the previous patch and due to this 2 > > > > large hunks are rejected. Can you regenerate this ? > > > > > > Updated patch. > > This should fix the merge/compilation issues. Thanks. Applied. > Also, I re-examined the initial patch, > and I don't see why it would have failed. Oh well... I can see some differences (other than line numbers): 28,30c28 < + attr->cur_qp_state = IB_QPS_SQE; < + ret = ib_modify_qp(qp_info->qp, attr, < + IB_QP_STATE | IB_QP_CUR_STATE); --- > + ret = ib_modify_qp(qp_info->qp, attr, IB_QP_STATE); 43,52c41,44 < @@ -1693,157 +1701,51 @@ < } < < /* < - * Modify QP into Init state < - */ < -static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp) < -{ < - int ret; < - struct ib_qp_attr *attr; --- > @@ -1699,151 +1705,45 @@ > { > int ret; > struct ib_qp_attr *attr; 86,87c78 < + * Start the port. < */ --- > - */ 148,149c139 < +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) < { --- > -{ 151,152c141 < + int ret, i; < struct ib_qp_attr *attr; --- > - struct ib_qp_attr *attr; -- Hal From robert.j.woodruff at intel.com Mon Nov 15 13:30:51 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 15 Nov 2004 13:30:51 -0800 Subject: [openib-general] Upstream submission Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C7E9C3@orsmsx408> Doug Ledford wrote, >Boo! ;-) Ditto what Roland said, Welcome. woody From Tom.Duffy at Sun.COM Mon Nov 15 13:31:37 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Mon, 15 Nov 2004 13:31:37 -0800 Subject: [openib-general] [PATCH] fix sparse warnings in mthca Message-ID: <1100554297.13150.23.camel@duffman> Was getting warnings like: "warning: Using plain integer as NULL pointer" when sparse checking on x86_64. Signed-off-by: Tom Duffy Index: drivers/infiniband/hw/mthca/mthca_doorbell.h =================================================================== --- drivers/infiniband/hw/mthca/mthca_doorbell.h (revision 1234) +++ drivers/infiniband/hw/mthca/mthca_doorbell.h (working copy) @@ -40,7 +40,7 @@ #define MTHCA_DECLARE_DOORBELL_LOCK(name) #define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) -#define MTHCA_GET_DOORBELL_LOCK(ptr) (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) @@ -53,7 +53,7 @@ #define MTHCA_DECLARE_DOORBELL_LOCK(name) #define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) -#define MTHCA_GET_DOORBELL_LOCK(ptr) (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) static inline unsigned long mthca_get_fpu(void) { -- Tom Duffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Mon Nov 15 13:36:29 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 13:36:29 -0800 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <1100552937.2767.69.camel@hpc-1> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> Message-ID: <4199215D.3070401@ichips.intel.com> Hal Rosenstock wrote: > After Roland's query this AM, I am looking at this some more: > > On Wed, 2004-11-10 at 13:43, Sean Hefty wrote: > >>The second case where I can see this happening is if the client canceled >>the send, and I'm not sure that we'd want to give the client an >>unmatched response in this case. > > > So do we just keep the cancel around for some time period to make sure > this doesn't occur ? If so, should cancel also have its own timeout or > should some arbitrary timeout be used to handle this case ? My personal take would be to avoid adding that complexity. E.g. a client sends a MAD with TID 5, cancels 5, sends 5, cancels 5, sends 5. A response is now received. What should the MAD layer do? I don't see issues with silently dropping any MAD that we're not ready to receive. For unsolicited MADs, I don't see a reasonable alternative. For solicited (response) MADs, I have a hard time seeing why a client would ever want an unmatched MAD, unless they're trying to duplicate MAD layer functionality higher in the stack. For user-mode, this may make sense, but I'm not convinced that duplicating the request-response functionality in user-mode is the best option (versus moving all of RMPP to user-mode). For the sourceforge stack, we handled this by defining "raw" MAD services that did nothing other than send/receive MADs. Clients using a raw service were responsible for performing RMPP, request/response matching, and handling timeouts. This worked, but the MAD layer still needed to route received MADs to the correct client. - Sean From roland at topspin.com Mon Nov 15 13:44:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 13:44:54 -0800 Subject: [openib-general] Re: [PATCH] fix sparse warnings in mthca In-Reply-To: <1100554297.13150.23.camel@duffman> (Tom Duffy's message of "Mon, 15 Nov 2004 13:31:37 -0800") References: <1100554297.13150.23.camel@duffman> Message-ID: <52y8h2lrc9.fsf@topspin.com> Thanks, applied. I'm cross-compiling for lots of archs but I only run sparse on i386. It's always something... ;) (Thanks for the Signed-off-by: line too) - R. From robert.j.woodruff at intel.com Mon Nov 15 13:48:05 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 15 Nov 2004 13:48:05 -0800 Subject: [openib-general] Upstream submission Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002C7EA23@orsmsx408> Paul Baxter wrote, >While I am delighted that the lower layers are suffficiently stable to >warrant being considered for code review/inclusion in the kernel, I am >slightly surprised. >Has the code been used in anger enough? I think that Roland is suggesting we submit it for review now, not inclusion. The team can then incorporate the comments from lkml before submitting it for inclusion. Given the past IBA projects where we developed a lot of code/capabilities and tested it fully before getting review by lkml only to have the code flamed to death when we did submit it, I think that sending in code early is better than later (IMO). >There seem to be a lot of bugs still being discovered daily. >Wouldn't having at least a preliminary set of user capabilities help >assessment of the low level code and allow a wider set of people to evaluate >it. I think the initial set of capabilities is the ability to run IPoIB. >Are there sufficient test tools and documentation to allow an IB novice >(kernel expert) to evaluate the offering. Initial test tools can be anything that runs on top of a network stack today. I do think that it is important to have good enough documentation for people to configure/run the stuff. I actually think that having it included in the kernel tree will make it easier for people to try it out, rather than having to check out the code from svn. >Perhaps over the next month Doug could be a 'dry run guinea pig' for kernel >inclusion and highlight the documentation and coding areas of difficulty >prior to submission for a wider audience. Good idea. From halr at voltaire.com Mon Nov 15 13:08:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 16:08:58 -0500 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <4192616C.7070905@ichips.intel.com> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> Message-ID: <1100552937.2767.69.camel@hpc-1> After Roland's query this AM, I am looking at this some more: On Wed, 2004-11-10 at 13:43, Sean Hefty wrote: > The second case where I can see this happening is if the client canceled > the send, and I'm not sure that we'd want to give the client an > unmatched response in this case. So do we just keep the cancel around for some time period to make sure this doesn't occur ? If so, should cancel also have its own timeout or should some arbitrary timeout be used to handle this case ? -- Hal From halr at voltaire.com Mon Nov 15 13:56:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 16:56:06 -0500 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <4199215D.3070401@ichips.intel.com> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> <4199215D.3070401@ichips.intel.com> Message-ID: <1100555766.2767.102.camel@hpc-1> On Mon, 2004-11-15 at 16:36, Sean Hefty wrote: > Hal Rosenstock wrote: > > > After Roland's query this AM, I am looking at this some more: > > > > On Wed, 2004-11-10 at 13:43, Sean Hefty wrote: > > > >>The second case where I can see this happening is if the client canceled > >>the send, and I'm not sure that we'd want to give the client an > >>unmatched response in this case. > > > > > > So do we just keep the cancel around for some time period to make sure > > this doesn't occur ? If so, should cancel also have its own timeout or > > should some arbitrary timeout be used to handle this case ? > > My personal take would be to avoid adding that complexity. E.g. a > client sends a MAD with TID 5, cancels 5, sends 5, cancels 5, sends 5. > A response is now received. What should the MAD layer do? > I don't see issues with silently dropping any MAD that we're not ready > to receive. What does "ready to receive" mean in this context ? Does it mean there is no matching send if it is a solicited response ? > For unsolicited MADs, I don't see a reasonable alternative. Not sure what you mean by alternative for unsolicited MADs. For unsolicited MADs, there is only the version/class/method based routing. If there is no client, the receive is dropped. > For solicited (response) MADs, I have a hard time seeing why a client > would ever want an unmatched MAD, unless they're trying to duplicate MAD > layer functionality higher in the stack. Yes, I too would view this as a duplication of MAD layer services. > For user-mode, this may make sense, but I'm not convinced that duplicating the request-response > functionality in user-mode is the best option (versus moving all of RMPP > to user-mode). What functionality are you referring to being duplicated ? Request/response matching with timeouts ? Wouldn't moving RMPP to user mode be a duplication ? There are certain things in the kernel that might want to use RMPP. > For the sourceforge stack, we handled this by defining "raw" MAD > services that did nothing other than send/receive MADs. Clients using a > raw service were responsible for performing RMPP, request/response > matching, and handling timeouts. This worked, but the MAD layer still > needed to route received MADs to the correct client. Yes, but there are two types of routing: TID based and version/class/method based. -- Hal From halr at voltaire.com Mon Nov 15 13:17:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 16:17:39 -0500 Subject: [openib-general] OpenIB BuiltIn Support ? Message-ID: <1100553459.2767.80.camel@hpc-1> Hi Roland, Should IB build as either built-in or modules ? (I usually build everything as modules). If built-in should work, does everything IB need to be built in rather than as modules ? Just wondering what the expectations should be here. Thanks. -- Hal From mlleinin at hpcn.ca.sandia.gov Mon Nov 15 13:55:41 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 15 Nov 2004 13:55:41 -0800 Subject: [openib-general] Signed-off-by: lines In-Reply-To: <526547nfje.fsf@topspin.com> References: <52y8h3njgg.fsf@topspin.com> <526547nfje.fsf@topspin.com> Message-ID: <1100555741.14334.699.camel@trinity> On Mon, 2004-11-15 at 10:16 -0800, Roland Dreier wrote: > By the way, for our initial submission upstream, I am planning on > submitting all the patches with my own > > Signed-off-by: Roland Dreier > > line, of course preserving any other Signed-off-by: lines that already > exist. However, for the future, it would be a good idea to make sure > that all patches come with a properly formatted Signed-off-by: line(s) > and preserve all such lines in the svn commit messages. > > (Read Documentation/SubmittingPatches in the kernel tree for full details) > I added the "signed-off by" requirement to the OpenIB FAQ. We probably need to have an 'SVN acceptable use policy' that covers the licensing and "signed-off by" requirements. I'll put something together and put it up on openib.org for review. - Matt From roland at topspin.com Mon Nov 15 13:58:34 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 13:58:34 -0800 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <1100552937.2767.69.camel@hpc-1> (Hal Rosenstock's message of "Mon, 15 Nov 2004 16:08:58 -0500") References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> Message-ID: <52ekiulqph.fsf@topspin.com> Hal> So do we just keep the cancel around for some time period to Hal> make sure this doesn't occur ? If so, should cancel also have Hal> its own timeout or should some arbitrary timeout be used to Hal> handle this case ? I don't think we should worry about this. If a consumer sends two requests with the same TID close enough together that we can't tell which is which, that's the consumer's fault and if it breaks they should just get both pieces. - R. From roland at topspin.com Mon Nov 15 14:00:18 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 14:00:18 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <1100553459.2767.80.camel@hpc-1> (Hal Rosenstock's message of "Mon, 15 Nov 2004 16:17:39 -0500") References: <1100553459.2767.80.camel@hpc-1> Message-ID: <52actilqml.fsf@topspin.com> Hal> Should IB build as either built-in or modules ? (I usually Hal> build everything as modules). If built-in should work, does Hal> everything IB need to be built in rather than as modules ? I haven't actually tried it but I think any combination of 'y' and 'm' for config options that is allowed by the kernel config system should work. If it doesn't then it should be fairly easy to fix. - R. From peter at pantasys.com Mon Nov 15 14:11:54 2004 From: peter at pantasys.com (Peter Buckingham) Date: Mon, 15 Nov 2004 14:11:54 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <52actilqml.fsf@topspin.com> References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> Message-ID: <419929AA.1010409@pantasys.com> Roland Dreier wrote: > Hal> Should IB build as either built-in or modules ? (I usually > Hal> build everything as modules). If built-in should work, does > Hal> everything IB need to be built in rather than as modules ? > > I haven't actually tried it but I think any combination of 'y' and 'm' > for config options that is allowed by the kernel config system should > work. If it doesn't then it should be fairly easy to fix. I have tried this with gen1 and things don't seem to play nice.. I've only tried it with mellanox's hca driver, does mthca work better when built-in? thanks, peter From mshefty at ichips.intel.com Mon Nov 15 14:15:50 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 14:15:50 -0800 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <1100555766.2767.102.camel@hpc-1> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> <4199215D.3070401@ichips.intel.com> <1100555766.2767.102.camel@hpc-1> Message-ID: <41992A96.9060000@ichips.intel.com> Hal Rosenstock wrote: >>My personal take would be to avoid adding that complexity. E.g. a >>client sends a MAD with TID 5, cancels 5, sends 5, cancels 5, sends 5. >>A response is now received. What should the MAD layer do? >>I don't see issues with silently dropping any MAD that we're not ready >>to receive. > > What does "ready to receive" mean in this context ? Does it mean there > is no matching send if it is a solicited response ? Meaning that we have a client that has requested to receive a MAD, by either asking for an unsolicited MADs via ib_register_mad_agent, or by asking for a solicited MAD by sending a request via ib_post_send_mad. >> For unsolicited MADs, I don't see a reasonable alternative. > > Not sure what you mean by alternative for unsolicited MADs. For > unsolicited MADs, there is only the version/class/method based routing. > If there is no client, the receive is dropped. Correct - the receive is dropped. There really isn't an alternative. >>For user-mode, this may make sense, but I'm not convinced that duplicating the request-response >>functionality in user-mode is the best option (versus moving all of RMPP >>to user-mode). > > > What functionality are you referring to being duplicated ? > Request/response matching with timeouts ? Wouldn't moving RMPP to user > mode be a duplication ? There are certain things in the kernel that > might want to use RMPP. I'm not suggesting relocating RMPP or request/response matching to user-mode, but I would consider duplicating those services in user-mode for user-mode clients if there was a strong enough reason. > Yes, but there are two types of routing: TID based and > version/class/method based. Correct, and we're talking mainly about TID based routing in this case. I guess my view is that I don't think that the code should trust anything that comes off the wire. If we receive a MAD that results in TID based routing that doesn't match with an existing request, then dropping it seems like the safest solution. If we want to route the MAD to the corresponding agent, however, we can do that. But doing this only seems useful if a client is duplicating functionality, which only makes sense to me for user-mode clients. - Sean From halr at voltaire.com Mon Nov 15 14:34:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 17:34:30 -0500 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <41992A96.9060000@ichips.intel.com> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> <4199215D.3070401@ichips.intel.com> <1100555766.2767.102.camel@hpc-1> <41992A96.9060000@ichips.intel.com> Message-ID: <1100558070.2767.119.camel@hpc-1> On Mon, 2004-11-15 at 17:15, Sean Hefty wrote: > If we want to route the MAD to the corresponding agent, however, we can > do that. But doing this only seems useful if a client is duplicating > functionality, which only makes sense to me for user-mode clients. If we want to limit this to user mode clients only, we would need an extra parameter on register to indicate whether the client was kernel or user mode. This clearly wouldn't be very trustworthy. Is there a better way ? -- Hal From tduffy at sun.com Mon Nov 15 14:33:07 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 15 Nov 2004 14:33:07 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <419929AA.1010409@pantasys.com> References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> Message-ID: <1100557988.13150.28.camel@duffman> On Mon, 2004-11-15 at 14:11 -0800, Peter Buckingham wrote: > I have tried this with gen1 and things don't seem to play nice.. I've > only tried it with mellanox's hca driver, does mthca work better when > built-in? I just tried with the latest gen2 openib bits on 2.6.10-rc2, mthca and ipoib builtin and everything builds and boots fine (at least on x86_64). -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Mon Nov 15 14:35:37 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 14:35:37 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <419929AA.1010409@pantasys.com> (Peter Buckingham's message of "Mon, 15 Nov 2004 14:11:54 -0800") References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> Message-ID: <526546lozq.fsf@topspin.com> Peter> I have tried this with gen1 and things don't seem to play Peter> nice.. I've only tried it with mellanox's hca driver, does Peter> mthca work better when built-in? Yes, I'm sure gen1 is completely broken, as is mellanox's driver. mthca should work since it uses the correct PCI driver API. I'll try building a kernel with IB built-in and report back... - R. From mshefty at ichips.intel.com Mon Nov 15 14:37:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 14:37:16 -0800 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <1100558070.2767.119.camel@hpc-1> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> <4199215D.3070401@ichips.intel.com> <1100555766.2767.102.camel@hpc-1> <41992A96.9060000@ichips.intel.com> <1100558070.2767.119.camel@hpc-1> Message-ID: <41992F9C.80900@ichips.intel.com> Hal Rosenstock wrote: >>If we want to route the MAD to the corresponding agent, however, we can >>do that. But doing this only seems useful if a client is duplicating >>functionality, which only makes sense to me for user-mode clients. > > > If we want to limit this to user mode clients only, we would need an > extra parameter on register to indicate whether the client was kernel or > user mode. This clearly wouldn't be very trustworthy. Is there a better > way ? Since the registration would actually be done in the kernel, I think that we can trust it. It's just that before supporting this, I'd like to make sure that routing unmatched responses is really the right solution. I.e. Is this something that kernel mode clients would need? Does it make sense for clients to duplicate additional functionality, such as RMPP? Would a solution that duplicated RMPP functionality in user-mode be better than one that only allowed for managing timeouts? - Sean From halr at voltaire.com Mon Nov 15 14:55:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 15 Nov 2004 17:55:14 -0500 Subject: [openib-general] Solicited response with no matching send request In-Reply-To: <41992F9C.80900@ichips.intel.com> References: <1100111569.2836.61.camel@hpc-1> <4192616C.7070905@ichips.intel.com> <1100552937.2767.69.camel@hpc-1> <4199215D.3070401@ichips.intel.com> <1100555766.2767.102.camel@hpc-1> <41992A96.9060000@ichips.intel.com> <1100558070.2767.119.camel@hpc-1> <41992F9C.80900@ichips.intel.com> Message-ID: <1100559314.2767.124.camel@hpc-1> On Mon, 2004-11-15 at 17:37, Sean Hefty wrote: > It's just that before supporting this, I'd like > to make sure that routing unmatched responses is really the right solution. > > I.e. Is this something that kernel mode clients would need? I think you mean user mode clients. > Does it > make sense for clients to duplicate additional functionality, such as > RMPP? Would a solution that duplicated RMPP functionality in user-mode > be better than one that only allowed for managing timeouts? It doesn't to me but this might be the "naive" port to get OpenSM up and running as quickly as possible. -- Hal From roland at topspin.com Mon Nov 15 14:50:39 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 14:50:39 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <1100557988.13150.28.camel@duffman> (Tom Duffy's message of "Mon, 15 Nov 2004 14:33:07 -0800") References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman> Message-ID: <52wtwmk9q8.fsf@topspin.com> Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2, Tom> mthca and ipoib builtin and everything builds and boots fine Tom> (at least on x86_64). Cool, thanks for testing. For what it's worth, it works here on i386 as well. (Not very convenient for development though :) - R. From peter at pantasys.com Mon Nov 15 15:22:45 2004 From: peter at pantasys.com (Peter Buckingham) Date: Mon, 15 Nov 2004 15:22:45 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <52wtwmk9q8.fsf@topspin.com> References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman> <52wtwmk9q8.fsf@topspin.com> Message-ID: <41993A45.7040406@pantasys.com> Roland Dreier wrote: > Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2, > Tom> mthca and ipoib builtin and everything builds and boots fine > Tom> (at least on x86_64). > > Cool, thanks for testing. For what it's worth, it works here on i386 > as well. (Not very convenient for development though :) So gen2 works. From what I understand OpenSM is not yet supported for gen2. What other things are still missing between gen1 and gen2? (sorry, this is probably a FAQ...) thanks, peter From roland at topspin.com Mon Nov 15 15:34:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 15:34:38 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <41993A45.7040406@pantasys.com> (Peter Buckingham's message of "Mon, 15 Nov 2004 15:22:45 -0800") References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman> <52wtwmk9q8.fsf@topspin.com> <41993A45.7040406@pantasys.com> Message-ID: <52sm7ak7ox.fsf@topspin.com> Peter> So gen2 works. From what I understand OpenSM is not yet Peter> supported for gen2. What other things are still missing Peter> between gen1 and gen2? (sorry, this is probably a FAQ...) Easier to answer what works now in gen2: only IPoIB. Everything else (userspace verbs, CM, SDP, etc.) needs to be implemented or ported forward. - R. From Tom.Duffy at Sun.COM Mon Nov 15 15:35:27 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Mon, 15 Nov 2004 15:35:27 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <41993A45.7040406@pantasys.com> References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman> <52wtwmk9q8.fsf@topspin.com> <41993A45.7040406@pantasys.com> Message-ID: <1100561727.13150.31.camel@duffman> On Mon, 2004-11-15 at 15:22 -0800, Peter Buckingham wrote: > Roland Dreier wrote: > > Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2, > > Tom> mthca and ipoib builtin and everything builds and boots fine > > Tom> (at least on x86_64). > > > > Cool, thanks for testing. For what it's worth, it works here on i386 > > as well. (Not very convenient for development though :) > > So gen2 works. From what I understand OpenSM is not yet supported for > gen2. What other things are still missing between gen1 and gen2? (sorry, > this is probably a FAQ...) Well, all the ULP's except for IPoIB. So, for now, no SRP, SDP, *DAPL, NFSoRDMA, etc. Also, no 2.4.x kernel support. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Mon Nov 15 15:48:26 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Nov 2004 15:48:26 -0800 Subject: [openib-general] [PATCH] fix cleanup in MAD code when unloading HCA driver Message-ID: <20041115154826.2162f686.mshefty@ichips.intel.com> After looking at the code, I believe that there's a race condition cleaning up in the MAD code when unloading the HCA driver. The MAD layer can be processing a received MAD when the driver unloads, which can result in accessing the receive queue after all MADs on the receive queue have been freed. This patch should correct that issue, by delaying cleanup of the receive queues until after processing completions. A similar fix is applied recovering from errors when initializing the port. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1237) +++ core/mad.c (working copy) @@ -1677,7 +1677,7 @@ /* * Return all the posted receive MADs */ -static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info *qp_info) +static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) { struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_private *recv; @@ -1737,7 +1737,7 @@ if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "INIT: %d\n", i, ret); - goto error; + goto out; } attr->qp_state = IB_QPS_RTR; @@ -1745,7 +1745,7 @@ if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTR: %d\n", i, ret); - goto error; + goto out; } attr->qp_state = IB_QPS_RTS; @@ -1754,7 +1754,7 @@ if (ret) { printk(KERN_ERR PFX "Couldn't change QP%d state to " "RTS: %d\n", i, ret); - goto error; + goto out; } } @@ -1762,58 +1762,21 @@ if (ret) { printk(KERN_ERR PFX "Failed to request completion " "notification: %d\n", ret); - goto error; + goto out; } for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { printk(KERN_ERR PFX "Couldn't post receive WRs\n"); - goto error; + goto out; } } - goto out; - -error: - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - attr->qp_state = IB_QPS_RESET; - ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, IB_QP_STATE); - ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); - } - out: kfree(attr); return ret; } -/* - * Stop the port - */ -static void ib_mad_port_stop(struct ib_mad_port_private *port_priv) -{ - int i, ret; - struct ib_qp_attr *attr; - - attr = kmalloc(sizeof *attr, GFP_KERNEL); - if (attr) { - attr->qp_state = IB_QPS_RESET; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_modify_qp(port_priv->qp_info[i].qp, attr, - IB_QP_STATE); - if (ret) - printk(KERN_ERR PFX "ib_mad_port_stop: " - "Couldn't change %s port %d QP%d " - "state to RESET\n", - port_priv->device->name, - port_priv->port_num, i); - } - kfree(attr); - } - - for (i = 0; i < IB_MAD_QPS_CORE; i++) - ib_mad_return_posted_recv_mads(&port_priv->qp_info[i]); -} - static void qp_event_handler(struct ib_event *event, void *qp_context) { struct ib_mad_qp_info *qp_info = qp_context; @@ -1832,21 +1795,24 @@ INIT_LIST_HEAD(&mad_queue->list); } -static int create_mad_qp(struct ib_mad_port_private *port_priv, - struct ib_mad_qp_info *qp_info, - enum ib_qp_type qp_type) +static void init_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info) { - struct ib_qp_init_attr qp_init_attr; - int ret; - qp_info->port_priv = port_priv; init_mad_queue(qp_info, &qp_info->send_queue); init_mad_queue(qp_info, &qp_info->recv_queue); INIT_LIST_HEAD(&qp_info->overflow_list); +} + +static int create_mad_qp(struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; memset(&qp_init_attr, 0, sizeof qp_init_attr); - qp_init_attr.send_cq = port_priv->cq; - qp_init_attr.recv_cq = port_priv->cq; + qp_init_attr.send_cq = qp_info->port_priv->cq; + qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; @@ -1854,10 +1820,10 @@ qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; - qp_init_attr.port_num = port_priv->port_num; + qp_init_attr.port_num = qp_info->port_priv->port_num; qp_init_attr.qp_context = qp_info; qp_init_attr.event_handler = qp_event_handler; - qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr); + qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr); if (IS_ERR(qp_info->qp)) { printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", get_spl_qp_index(qp_type)); @@ -1903,11 +1869,13 @@ printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); return -ENOMEM; } - memset(port_priv, 0, sizeof *port_priv); port_priv->device = device; port_priv->port_num = port_num; spin_lock_init(&port_priv->reg_lock); + INIT_LIST_HEAD(&port_priv->agent_list); + init_mad_qp(port_priv, &port_priv->qp_info[0]); + init_mad_qp(port_priv, &port_priv->qp_info[1]); cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; port_priv->cq = ib_create_cq(port_priv->device, @@ -1934,16 +1902,13 @@ goto error5; } - ret = create_mad_qp(port_priv, &port_priv->qp_info[0], IB_QPT_SMI); + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); if (ret) goto error6; - ret = create_mad_qp(port_priv, &port_priv->qp_info[1], IB_QPT_GSI); + ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); if (ret) goto error7; - spin_lock_init(&port_priv->reg_lock); - INIT_LIST_HEAD(&port_priv->agent_list); - port_priv->wq = create_workqueue("ib_mad"); if (!port_priv->wq) { ret = -ENOMEM; @@ -1974,6 +1939,8 @@ ib_dealloc_pd(port_priv->pd); error4: ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); error3: kfree(port_priv); @@ -2000,7 +1967,7 @@ list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - ib_mad_port_stop(port_priv); + /* Stop processing completions. */ flush_workqueue(port_priv->wq); destroy_workqueue(port_priv->wq); destroy_mad_qp(&port_priv->qp_info[1]); @@ -2008,6 +1975,8 @@ ib_dereg_mr(port_priv->mr); ib_dealloc_pd(port_priv->pd); ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); /* XXX: Handle deallocation of MAD registration tables */ kfree(port_priv); From roland at topspin.com Mon Nov 15 20:18:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 20:18:38 -0800 Subject: [openib-general] [PATCH] fix warning in mad.c Message-ID: <52oehyjujl.fsf@topspin.com> flags for spin lock should be unsigned long, not int. - R. Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1239) +++ infiniband/core/mad.c (working copy) @@ -1353,7 +1353,7 @@ { struct ib_mad_send_wr_private *mad_send_wr; struct ib_mad_list_head *mad_list; - int flags; + unsigned long flags; spin_lock_irqsave(&qp_info->send_queue.lock, flags); list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { From roland at topspin.com Mon Nov 15 20:43:06 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 20:43:06 -0800 Subject: [openib-general] [PATCH] Get rid of /proc/infiniband/ipoib_vlan Message-ID: <52k6smjtet.fsf@topspin.com> This kills off /proc/infiniband/ipoib_vlan in favor of a simpler sysfs interface. To create ib0.8001, you can now just do # echo 0x8001 > /sys/class/net/ib0/create_child and to get rid of the interface, # echo 0x8001 > /sys/class/net/ib0/delete_child (Better names for these files gladly accepted) To see a child interface's parent (in case interfaces have been renamed to something nonobvious): # cat /sys/class/net/ib0.8001/parent ib0 and to check the P_Key of an interface: # cat /sys/class/net/ib0/pkey 0xffff - Roland Index: infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- infiniband/ulp/ipoib/ipoib_vlan.c (revision 1239) +++ infiniband/ulp/ipoib/ipoib_vlan.c (working copy) @@ -32,43 +32,62 @@ #include "ipoib.h" -struct ipoib_vlan_iter { - struct list_head *pintf_cur; - struct list_head *intf_cur; -}; +/* + * We use this mutex to serialize child interface creation. This + * closes the race where userspace might create the same child + * interface twice at exactly the same time. + */ +static DECLARE_MUTEX(vlan_mutex); -static DECLARE_MUTEX(proc_mutex); +static ssize_t show_parent(struct class_device *class_dev, char *buf) +{ + struct net_device *dev = + container_of(class_dev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); -int ipoib_vlan_add(struct net_device *pdev, char *intf_name, - unsigned short pkey) + return sprintf(buf, "%s\n", priv->parent->name); +} +static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) { struct ipoib_dev_priv *ppriv, *priv; - int result = -ENOMEM; + char intf_name[IFNAMSIZ]; + int result; if (!capable(CAP_NET_ADMIN)) return -EPERM; + down(&vlan_mutex); + ppriv = netdev_priv(pdev); /* * First ensure this isn't a duplicate. We check the parent device and * then all of the child interfaces to make sure the Pkey doesn't match. */ - if (ppriv->pkey == pkey) - return -ENOTUNIQ; + if (ppriv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } down(&ipoib_device_mutex); list_for_each_entry(priv, &ppriv->child_intfs, list) { if (priv->pkey == pkey) { up(&ipoib_device_mutex); - return -ENOTUNIQ; + result = -ENOTUNIQ; + goto err; } } up(&ipoib_device_mutex); + snprintf(intf_name, sizeof intf_name, "%s.%04x", + ppriv->dev->name, pkey); priv = ipoib_intf_alloc(intf_name); - if (!priv) - goto alloc_mem_failed; + if (!priv) { + result = -ENOMEM; + goto err; + } set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); @@ -92,19 +111,33 @@ goto register_failed; } + priv->parent = ppriv->dev; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_parent)) + goto sysfs_failed; + down(&ipoib_device_mutex); list_add_tail(&priv->list, &ppriv->child_intfs); up(&ipoib_device_mutex); + up(&vlan_mutex); return 0; +sysfs_failed: + unregister_netdev(priv->dev); + register_failed: ipoib_dev_cleanup(priv->dev); device_init_failed: free_netdev(priv->dev); -alloc_mem_failed: +err: + up(&vlan_mutex); return result; } @@ -120,13 +153,8 @@ down(&ipoib_device_mutex); list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { if (priv->pkey == pkey) { - if (priv->dev->flags & IFF_UP) { - up(&ipoib_device_mutex); - return -EBUSY; - } - - ipoib_dev_cleanup(priv->dev); unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); list_del(&priv->list); @@ -140,219 +168,3 @@ return -ENOENT; } - -/* =============================================================== */ -/*..ipoib_vlan_iter_next -- incr. iter. -- return non-zero at end */ -int ipoib_vlan_iter_next(struct ipoib_vlan_iter *iter) -{ - while (1) { - struct ipoib_dev_priv *priv; - - priv = list_entry(iter->pintf_cur, struct ipoib_dev_priv, list); - if (!iter->intf_cur) - iter->intf_cur = priv->child_intfs.next; - else - iter->intf_cur = iter->intf_cur->next; - - if (iter->intf_cur == &priv->child_intfs) { - iter->pintf_cur = iter->pintf_cur->next; - if (iter->pintf_cur == &ipoib_device_list) - return 1; - - iter->intf_cur = NULL; - return 0; - } else - return 0; - } -} - -/* =============================================================== */ -/*.._ipoib_vlan_seq_start -- seq file handling */ -static void *_ipoib_vlan_seq_start(struct seq_file *file, loff_t *pos) -{ - struct ipoib_vlan_iter *iter; - loff_t n = *pos; - - iter = kmalloc(sizeof(*iter), GFP_KERNEL); - if (!iter) - return NULL; - - iter->pintf_cur = ipoib_device_list.next; - iter->intf_cur = NULL; - - while (n--) { - if (ipoib_vlan_iter_next(iter)) { - kfree(iter); - return NULL; - } - } - - return iter; -} - -/* =============================================================== */ -/*.._ipoib_vlan_seq_next -- seq file handling */ -static void *_ipoib_vlan_seq_next(struct seq_file *file, void *iter_ptr, - loff_t *pos) -{ - struct ipoib_vlan_iter *iter = iter_ptr; - - (*pos)++; - - if (ipoib_vlan_iter_next(iter)) { - kfree(iter); - return NULL; - } - - return iter; -} - -/* =============================================================== */ -/*.._ipoib_vlan_seq_stop -- seq file handling */ -static void _ipoib_vlan_seq_stop(struct seq_file *file, void *iter_ptr) -{ - struct ipoib_vlan_iter *iter = iter_ptr; - - kfree(iter); -} - -/* =============================================================== */ -/*.._ipoib_vlan_seq_show -- seq file handling */ -static int _ipoib_vlan_seq_show(struct seq_file *file, void *iter_ptr) -{ - struct ipoib_vlan_iter *iter = iter_ptr; - - if (iter) { - struct ipoib_dev_priv *ppriv; - - ppriv = list_entry(iter->pintf_cur, struct ipoib_dev_priv, list); - - if (!iter->intf_cur) - seq_printf(file, "%s 0x%04x\n", ppriv->dev->name, - ppriv->pkey); - else { - struct ipoib_dev_priv *priv; - - priv = list_entry(iter->intf_cur, struct ipoib_dev_priv, - list); - - seq_printf(file, " %s %s 0x%04x\n", ppriv->dev->name, - priv->dev->name, priv->pkey); - } - } - - return 0; -} - -static struct seq_operations ipoib_vlan_seq_operations = { - .start = _ipoib_vlan_seq_start, - .next = _ipoib_vlan_seq_next, - .stop = _ipoib_vlan_seq_stop, - .show = _ipoib_vlan_seq_show, -}; - -/* =============================================================== */ -/*.._ipoib_vlan_proc_open -- proc file handling */ -static int _ipoib_vlan_proc_open(struct inode *inode, struct file *file) -{ - if (down_interruptible(&proc_mutex)) - return -ERESTARTSYS; - - return seq_open(file, &ipoib_vlan_seq_operations); -} - -/* =============================================================== */ -/*.._ipoib_vlan_proc_write -- proc file handling */ -static ssize_t _ipoib_vlan_proc_write(struct file *file, - const char __user *buffer, - size_t count, loff_t *pos) -{ - int result; - char kernel_buf[256]; - char intf_parent[128], intf_name[128]; - unsigned int pkey; - struct net_device *pdev; - - count = min(count, sizeof(kernel_buf)); - - if (copy_from_user(kernel_buf, buffer, count)) - return -EFAULT; - - kernel_buf[count - 1] = '\0'; - - if (sscanf(kernel_buf, "add %128s %128s %i", intf_parent, intf_name, - &pkey) == 3) { - if (pkey > 0xffff) - return -EINVAL; - - pdev = dev_get_by_name(intf_parent); - if (!pdev) - return -ENOENT; - - result = ipoib_vlan_add(pdev, intf_name, pkey); - - dev_put(pdev); - - if (result < 0) - return result; - } else if (sscanf(kernel_buf, "del %128s %i", intf_parent, - &pkey) == 2) { - if (pkey > 0xffff) - return -EINVAL; - - pdev = dev_get_by_name(intf_parent); - if (!pdev) - return -ENOENT; - - result = ipoib_vlan_delete(pdev, pkey); - - dev_put(pdev); - - if (result < 0) - return result; - } else - return -EINVAL; - - return count; -} - -/* =============================================================== */ -/*.._ipoib_vlan_proc_release -- proc file handling */ -static int _ipoib_vlan_proc_release(struct inode *inode, struct file *file) -{ - up(&proc_mutex); - - return seq_release(inode, file); -} - -static struct file_operations ipoib_vlan_proc_operations = { - .owner = THIS_MODULE, - .open = _ipoib_vlan_proc_open, - .read = seq_read, - .write = _ipoib_vlan_proc_write, - .llseek = seq_lseek, - .release = _ipoib_vlan_proc_release, -}; - -struct proc_dir_entry *vlan_proc_entry; - -int ipoib_vlan_init(void) -{ - vlan_proc_entry = create_proc_entry("ipoib_vlan", - S_IRUGO | S_IWUGO, ipoib_proc_dir); - - if (!vlan_proc_entry) { - printk(KERN_WARNING "Can't create ipoib_vlan in /proc\n"); - return -ENOMEM; - } - - vlan_proc_entry->proc_fops = &ipoib_vlan_proc_operations; - - return 0; -} - -void ipoib_vlan_cleanup(void) -{ - if (vlan_proc_entry) - remove_proc_entry("ipoib_vlan", ipoib_proc_dir); -} Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1239) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -609,7 +609,6 @@ void ipoib_dev_cleanup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; - int i; /* Delete any child interfaces first */ /* Safe since it's either protected by ipoib_device_mutex or empty */ @@ -626,19 +625,11 @@ ipoib_ib_dev_cleanup(dev); if (priv->rx_ring) { - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) - if (priv->rx_ring[i].skb) - dev_kfree_skb_any(priv->rx_ring[i].skb); - kfree(priv->rx_ring); priv->rx_ring = NULL; } if (priv->tx_ring) { - for (i = 0; i < IPOIB_TX_RING_SIZE; ++i) - if (priv->tx_ring[i].skb) - dev_kfree_skb_any(priv->tx_ring[i].skb); - kfree(priv->tx_ring); priv->tx_ring = NULL; } @@ -714,6 +705,60 @@ return netdev_priv(dev); } +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + static int ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { struct ipoib_dev_priv *priv; @@ -771,6 +816,15 @@ if (ipoib_proc_dev_init(priv->dev)) goto proc_failed; + if (ipoib_add_pkey_attr(priv->dev)) + goto proc_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto proc_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto proc_failed; + down(&ipoib_device_mutex); list_add_tail(&priv->list, &ipoib_device_list); up(&ipoib_device_mutex); @@ -860,8 +914,6 @@ if (ret) goto err_wq; - ipoib_vlan_init(); - return 0; err_wq: @@ -875,7 +927,6 @@ static void __exit ipoib_cleanup_module(void) { - ipoib_vlan_cleanup(); ib_unregister_client(&ipoib_client); remove_proc_entry("infiniband", NULL); destroy_workqueue(ipoib_workqueue); Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1239) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -124,7 +124,6 @@ union ib_gid local_gid; u16 local_lid; - u32 local_qpn; unsigned int admin_mtu; unsigned int mcast_mtu; @@ -145,6 +144,7 @@ struct net_device_stats stats; + struct net_device *parent; struct list_head child_intfs; struct list_head list; }; @@ -186,6 +186,8 @@ kref_put(&ah->ref, ipoib_free_ah); } +int ipoib_add_pkey_attr(struct net_device *dev); + void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(void *dev_ptr); @@ -240,8 +242,8 @@ void ipoib_event(struct ib_event_handler *handler, struct ib_event *record); -int ipoib_vlan_init(void); -void ipoib_vlan_cleanup(void); +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); void ipoib_pkey_poll(void *dev); int ipoib_pkey_dev_delay_open(struct net_device *dev); Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1239) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -229,7 +229,6 @@ priv->stats.tx_bytes += tx_req->skb->len; dev_kfree_skb_any(tx_req->skb); - tx_req->skb = NULL; spin_lock_irqsave(&priv->lock, flags); ++priv->tx_tail; @@ -336,7 +335,6 @@ address->ah, qpn, addr, skb->len)) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - tx_req->skb = NULL; dev_kfree_skb_any(skb); } else { unsigned long flags; @@ -485,6 +483,8 @@ while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) yield(); + ipoib_dbg(priv, "All sends and receives done.\n"); + qp_attr.qp_state = IB_QPS_RESET; attr_mask = IB_QP_STATE; if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) @@ -499,12 +499,9 @@ yield(); } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { - if (priv->rx_ring[i].skb) { - dev_kfree_skb_any(priv->rx_ring[i].skb); - priv->rx_ring[i].skb = NULL; - } - } + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ipoib_warn(priv, "Recv skb still around @ %d\n", i); return 0; } From roland at topspin.com Mon Nov 15 20:46:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 15 Nov 2004 20:46:54 -0800 Subject: [openib-general] warning: ipoibcfg no longer needed In-Reply-To: <52k6smjtet.fsf@topspin.com> (Roland Dreier's message of "Mon, 15 Nov 2004 20:43:06 -0800") References: <52k6smjtet.fsf@topspin.com> Message-ID: <52fz3ajt8h.fsf@topspin.com> I just committed a change to IPoIB that means ipoibcfg is no longer needed (and will no longer work). See the previous message in this thread, "[PATCH] Get rid of /proc/infiniband/ipoib_vlan", for full details. - Roland From halr at voltaire.com Mon Nov 15 21:04:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 00:04:53 -0500 Subject: [openib-general] [PATCH] fix warning in mad.c In-Reply-To: <52oehyjujl.fsf@topspin.com> References: <52oehyjujl.fsf@topspin.com> Message-ID: <1100581493.3369.2393.camel@localhost.localdomain> On Mon, 2004-11-15 at 23:18, Roland Dreier wrote: > flags for spin lock should be unsigned long, not int. Thanks. Applied. -- Hal From halr at voltaire.com Mon Nov 15 21:44:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 00:44:36 -0500 Subject: [openib-general] umad doc Message-ID: <1100583875.3369.2405.camel@localhost.localdomain> Hi Roland, Should the user-mad.txt doc indicate /udev rather than /dev as follows: /udev files r.t. /dev files /udev/infiniband/mthca0/ports/1/mad r.t. /dev/infiniband/mthca0/ports/1/mad -- Hal From itoumsn at nttdata.co.jp Tue Nov 16 04:49:10 2004 From: itoumsn at nttdata.co.jp (Masanori ITOH) Date: Tue, 16 Nov 2004 21:49:10 +0900 (JST) Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA In-Reply-To: <20041112.151421.120503395.itoumsn@nttdata.co.jp> References: <20041112.151421.120503395.itoumsn@nttdata.co.jp> Message-ID: <20041116.214910.01371084.itoumsn@nttdata.co.jp> Hi folks, From: Masanori ITOH Subject: [openib-general] OpenIB gen1 stack u/kDAPL by NTT DATA Date: Fri, 12 Nov 2004 15:14:21 +0900 (JST) > > Hello folks, > > As I mentioned fomerly on this list, I have a working u/kDAPL on top of > the gen1 stack and I've finally finished all internal procedures > to make it public. > # Actually, it took me about one month and a half. Sigh... :( > > I would like to put that into the OpenIB contributors area > (Somewhere like 'https://openib.org/svn/trunk/contrib/nttdata/'.), > and could anyone tell me how I can do that? Today, I checked in my u/kDAPL work into: https://openib.org/svn/trunk/contrib/nttdata/ A detailed readme document is included in 'ntt_dapl_1.0.tar.bz2', and I hope that my work also could be a base of the gen2 u/kDAPL. Regards, Masanori --- Masanori ITOH Open Source Software Development Center, NTT DATA CORPORATION e-mail: itoumsn at nttdata.co.jp phone : +81-3-3523-8122 (ext. 172-7199) From halr at voltaire.com Tue Nov 16 05:20:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 08:20:13 -0500 Subject: [openib-general] Re: [PATCH] fix cleanup in MAD code when unloading HCA driver In-Reply-To: <20041115154826.2162f686.mshefty@ichips.intel.com> References: <20041115154826.2162f686.mshefty@ichips.intel.com> Message-ID: <1100611212.3369.2422.camel@localhost.localdomain> On Mon, 2004-11-15 at 18:48, Sean Hefty wrote: > After looking at the code, I believe that there's a race condition > cleaning up in the MAD code when unloading the HCA driver. The > MAD layer can be processing a received MAD when the driver unloads, > which can result in accessing the receive queue after all MADs > on the receive queue have been freed. > > This patch should correct that issue, by delaying cleanup of > the receive queues until after processing completions. A > similar fix is applied recovering from errors when initializing > the port. Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 16 06:12:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 09:12:28 -0500 Subject: [openib-general] [PATCH] mad: In handle_outgoing_smp, remove unneeded call to smi_handle_dr_recv Message-ID: <1100614347.28332.3.camel@hpc-1> mad: In handle_outgoing_smp, remove unneeded call to smi_handle_dr_recv There is no need to check the DR validity on a MAD which has been processed locally Index: mad.c =================================================================== --- mad.c (revision 1244) +++ mad.c (working copy) @@ -410,17 +410,6 @@ goto error1; } if (ret & IB_MAD_RESULT_REPLY) { - if (!smi_handle_dr_smp_recv( - (struct ib_smp *)&mad_priv->mad, - mad_agent->device->node_type, - mad_agent->port_num, - mad_agent->device->phys_port_cnt)) { - ret = -EINVAL; - kmem_cache_free(ib_mad_cache, - mad_priv); - goto error1; - } - /* * See if response is solicited and * there is a recv handler From roland at topspin.com Tue Nov 16 08:01:19 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 16 Nov 2004 08:01:19 -0800 Subject: [openib-general] Re: umad doc In-Reply-To: <1100583875.3369.2405.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 16 Nov 2004 00:44:36 -0500") References: <1100583875.3369.2405.camel@localhost.localdomain> Message-ID: <52brdxkckw.fsf@topspin.com> Hal> Hi Roland, Should the user-mad.txt doc indicate /udev rather Hal> than /dev as follows: I guess it depends on how udev is set up on your system. On my systems (running Debian sarge), udev manages /dev and there is no /udev tree. I believe this is the way things are expected to be done on a completely modern system. - R. From halr at voltaire.com Tue Nov 16 08:13:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 11:13:26 -0500 Subject: [openib-general] Re: umad doc In-Reply-To: <52brdxkckw.fsf@topspin.com> References: <1100583875.3369.2405.camel@localhost.localdomain> <52brdxkckw.fsf@topspin.com> Message-ID: <1100621604.27172.0.camel@hpc-1> On Tue, 2004-11-16 at 11:01, Roland Dreier wrote: > Hal> Hi Roland, Should the user-mad.txt doc indicate /udev rather > Hal> than /dev as follows: > > I guess it depends on how udev is set up on your system. On my > systems (running Debian sarge), udev manages /dev and there is no > /udev tree. I believe this is the way things are expected to be done > on a completely modern system. Guess I didn't completely modernize my machine :-) BTW, are there major and minor device numbers for the IB devices ? -- Hal From roland at topspin.com Tue Nov 16 08:09:37 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 16 Nov 2004 08:09:37 -0800 Subject: [openib-general] Re: umad doc In-Reply-To: <1100621604.27172.0.camel@hpc-1> (Hal Rosenstock's message of "Tue, 16 Nov 2004 11:13:26 -0500") References: <1100583875.3369.2405.camel@localhost.localdomain> <52brdxkckw.fsf@topspin.com> <1100621604.27172.0.camel@hpc-1> Message-ID: <52y8h1ixmm.fsf@topspin.com> Hal> BTW, are there major and minor device numbers for the IB Hal> devices ? They are assigned dynamically by the call to alloc_chrdev_region(). - R. From halr at voltaire.com Tue Nov 16 08:16:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 11:16:52 -0500 Subject: [openib-general] SMI Patch Message-ID: <1100621810.27172.4.camel@hpc-1> Hi, I have a patch which I think fixes the SMI issues for an end node. There is more to be done for switch support but this hopefully is sufficient for SM support. Can you please validate it before I check it in ? I'm a little gun shy about breaking the tree after last week's debacle. If it works in your configurations, I will check it in. Thanks. -- Hal Index: smi.c =================================================================== --- smi.c (revision 1247) +++ smi.c (working copy) @@ -98,6 +98,9 @@ } /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + /* C14-13:5 -- Check for unreasonable hop pointer */ return 0; } Index: mad.c =================================================================== --- mad.c (revision 1245) +++ mad.c (working copy) @@ -1121,7 +1121,7 @@ port_priv->device->phys_port_cnt)) goto out; if (!smi_check_forward_dr_smp(smp)) - goto out; + goto local; if (!smi_handle_dr_smp_send(smp, port_priv->device->node_type, port_priv->port_num)) @@ -1132,6 +1132,7 @@ goto out; } +local: /* Give driver "right of first refusal" on incoming MAD */ if (port_priv->device->process_mad) { int ret; From halr at voltaire.com Tue Nov 16 08:11:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 11:11:20 -0500 Subject: [openib-general] MAD handling In-Reply-To: <52d5yfptu6.fsf@topspin.com> References: <52d5yfptu6.fsf@topspin.com> Message-ID: <1100621480.3369.2601.camel@localhost.localdomain> On Mon, 2004-11-15 at 00:25, Roland Dreier wrote: > A few questions about MAD handling: I'm reresponding with more concrete/specific answers to the below. > - It looks as if the case of response DR SMPs going to the SM is not > handled in smi.c. smi_check_forward_dr_smp() doesn't handle the > case of hop_ptr == 0, and smi_handle_dr_smp_send() just says > > /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */ > > and returns 0, which will lead to the packet being dropped. How > should this be fixed? I added the check for hop_ptr 0 into smi_handle_dr_smp_send. smi_check_forward_smp is correct as for hop_ptr 0 it returns 0 which means SMP should be completed up the stack (which wasn't being done in mad.c). -- Hal From mshefty at ichips.intel.com Tue Nov 16 09:42:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 16 Nov 2004 09:42:16 -0800 Subject: [openib-general] Re: SMI Patch In-Reply-To: <1100621810.27172.4.camel@hpc-1> References: <1100621810.27172.4.camel@hpc-1> Message-ID: <419A3BF8.3080908@ichips.intel.com> Hal Rosenstock wrote: > Hi, > > I have a patch which I think fixes the SMI issues for an end node. There > is more to be done for switch support but this hopefully is sufficient > for SM support. Can you please validate it before I check it in ? I'm a > little gun shy about breaking the tree after last week's debacle. If it > works in your configurations, I will check it in. I applied the patch to my local repository, and it worked fine. One item of note is that my test system is connected into a switched fabric with opensm running on the source forge stack. When I load the openib stack, the port goes to INIT. It doesn't go to ACTIVE until I unplug and re-insert the cable. This is true with or without this patch. (I'm connected to a 16-port Mellanox switch.) The systems running the source forge stack do not see this issue. - Sean From roland at topspin.com Tue Nov 16 09:43:18 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 16 Nov 2004 09:43:18 -0800 Subject: [openib-general] Re: SMI Patch In-Reply-To: <1100621810.27172.4.camel@hpc-1> (Hal Rosenstock's message of "Tue, 16 Nov 2004 11:16:52 -0500") References: <1100621810.27172.4.camel@hpc-1> Message-ID: <52pt2ditah.fsf@topspin.com> Seems to work here as well... - R. From halr at voltaire.com Tue Nov 16 10:29:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 13:29:24 -0500 Subject: [openib-general] Re: SMI Patch In-Reply-To: <419A3BF8.3080908@ichips.intel.com> References: <1100621810.27172.4.camel@hpc-1> <419A3BF8.3080908@ichips.intel.com> Message-ID: <1100629763.27971.6.camel@hpc-1> On Tue, 2004-11-16 at 12:42, Sean Hefty wrote: > Hal Rosenstock wrote: > > > Hi, > > > > I have a patch which I think fixes the SMI issues for an end node. There > > is more to be done for switch support but this hopefully is sufficient > > for SM support. Can you please validate it before I check it in ? I'm a > > little gun shy about breaking the tree after last week's debacle. If it > > works in your configurations, I will check it in. > > I applied the patch to my local repository, and it worked fine. Thanks for checking this out. > One item of note is that my test system is connected into a switched > fabric with opensm running on the source forge stack. When I load the > openib stack, the port goes to INIT. It doesn't go to ACTIVE until I > unplug and re-insert the cable. This is true with or without this > patch. (I'm connected to a 16-port Mellanox switch.) The systems > running the source forge stack do not see this issue. Can you see whether any packets received make it to ib_mad_recv_done_handler when the port stays in INIT ? It seems weird that a cable reinsertion would bring it back to life. Sounds like some sort of initialization issue that gets fixed on a cable reinsertion. Is this reproducible every time or intermittent ? -- Hal > > - Sean From halr at voltaire.com Tue Nov 16 10:29:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 13:29:38 -0500 Subject: [openib-general] Re: SMI Patch In-Reply-To: <52pt2ditah.fsf@topspin.com> References: <1100621810.27172.4.camel@hpc-1> <52pt2ditah.fsf@topspin.com> Message-ID: <1100629778.27971.9.camel@hpc-1> On Tue, 2004-11-16 at 12:43, Roland Dreier wrote: > Seems to work here as well... Thanks for trying this. I will check this in shortly. -- Hal From halr at voltaire.com Tue Nov 16 10:35:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 13:35:29 -0500 Subject: [openib-general] [PATCH] SMI/MAD: Fix a couple of SMI cases Message-ID: <1100630129.27971.16.camel@hpc-1> smi/mad: In smi_handle_dr_smp_send, handle hop_ptr 0 (C14-13:4). Also, in ib_mad_recv_handler_done, 0 return from smi_check_forward_dr_smp means local rather than discard. Index: smi.c =================================================================== --- smi.c (revision 1247) +++ smi.c (working copy) @@ -98,6 +98,9 @@ } /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + /* C14-13:5 -- Check for unreasonable hop pointer */ return 0; } Index: mad.c =================================================================== --- mad.c (revision 1245) +++ mad.c (working copy) @@ -1121,7 +1121,7 @@ port_priv->device->phys_port_cnt)) goto out; if (!smi_check_forward_dr_smp(smp)) - goto out; + goto local; if (!smi_handle_dr_smp_send(smp, port_priv->device->node_type, port_priv->port_num)) @@ -1132,6 +1132,7 @@ goto out; } +local: /* Give driver "right of first refusal" on incoming MAD */ if (port_priv->device->process_mad) { int ret; From blist at aon.at Tue Nov 16 10:30:19 2004 From: blist at aon.at (Bernhard Fischer) Date: Tue, 16 Nov 2004 19:30:19 +0100 Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP Message-ID: <20041116183019.GB1206@aon.at> Hi, from linux/spinlock.h: "spin_is_locked on UP always says FALSE" please consider applying, -------------- next part -------------- diff -x '*.diff' -rup gen2.oorig/src/linux-kernel/infiniband/core/agent.c gen2/src/linux-kernel/infiniband/core/agent.c --- gen2.oorig/src/linux-kernel/infiniband/core/agent.c 2004-11-12 16:29:26.000000000 +0100 +++ gen2/src/linux-kernel/infiniband/core/agent.c 2004-11-16 19:11:04.595949168 +0100 @@ -42,7 +42,9 @@ __ib_get_agent_port(struct ib_device *de { struct ib_agent_port_private *entry; +#if defined(CONFIG_SMP) BUG_ON(!spin_is_locked(&ib_agent_port_list_lock)); +#endif BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ if (device) { diff -x '*.diff' -rup gen2.oorig/src/linux-kernel/infiniband/core/mad.c gen2/src/linux-kernel/infiniband/core/mad.c --- gen2.oorig/src/linux-kernel/infiniband/core/mad.c 2004-11-16 17:24:36.000000000 +0100 +++ gen2/src/linux-kernel/infiniband/core/mad.c 2004-11-16 19:09:25.577038602 +0100 @@ -100,7 +100,9 @@ __ib_get_mad_port(struct ib_device *devi { struct ib_mad_port_private *entry; +#if defined(CONFIG_SMP) BUG_ON(!spin_is_locked(&ib_mad_port_list_lock)); +#endif list_for_each_entry(entry, &ib_mad_port_list, port_list) { if (entry->device == device && entry->port_num == port_num) return entry; From roland at topspin.com Tue Nov 16 10:38:41 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 16 Nov 2004 10:38:41 -0800 Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP In-Reply-To: <20041116183019.GB1206@aon.at> (Bernhard Fischer's message of "Tue, 16 Nov 2004 19:30:19 +0100") References: <20041116183019.GB1206@aon.at> Message-ID: <52ekitiqq6.fsf@topspin.com> Bernhard> Hi, from linux/spinlock.h: "spin_is_locked on UP always Bernhard> says FALSE" Good catch. Bernhard> please consider applying, Can we try and think of a fix that doesn't involve adding #ifdefs to the source file? Do we really need the BUG_ONs at all? - R. From mshefty at ichips.intel.com Tue Nov 16 10:41:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 16 Nov 2004 10:41:18 -0800 Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP In-Reply-To: <52ekitiqq6.fsf@topspin.com> References: <20041116183019.GB1206@aon.at> <52ekitiqq6.fsf@topspin.com> Message-ID: <419A49CE.7000506@ichips.intel.com> Roland Dreier wrote: > Bernhard> Hi, from linux/spinlock.h: "spin_is_locked on UP always > Bernhard> says FALSE" > > Good catch. > > Bernhard> please consider applying, > > Can we try and think of a fix that doesn't involve adding #ifdefs to > the source file? Do we really need the BUG_ONs at all? I'd vote to remove the BUG_ONs, versus adding #ifdef. - Sean From roland at topspin.com Tue Nov 16 10:46:45 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 16 Nov 2004 10:46:45 -0800 Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP In-Reply-To: <419A49CE.7000506@ichips.intel.com> (Sean Hefty's message of "Tue, 16 Nov 2004 10:41:18 -0800") References: <20041116183019.GB1206@aon.at> <52ekitiqq6.fsf@topspin.com> <419A49CE.7000506@ichips.intel.com> Message-ID: <52acthiqcq.fsf@topspin.com> Sean> I'd vote to remove the BUG_ONs, versus adding #ifdef. That seems fine to me. Maybe adding a comment in agent.c similar to what mad.c says ("Assumes ib_mad_port_list_lock is being held") is all we really need, something like this: Index: agent.c =================================================================== --- agent.c (revision 1249) +++ agent.c (working copy) @@ -36,6 +36,9 @@ extern kmem_cache_t *ib_mad_cache; +/* + * Caller must hold ib_agent_port_list_lock. + */ static inline struct ib_agent_port_private * __ib_get_agent_port(struct ib_device *device, int port_num, struct ib_mad_agent *mad_agent) From halr at voltaire.com Tue Nov 16 11:03:57 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 14:03:57 -0500 Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP In-Reply-To: <20041116183019.GB1206@aon.at> References: <20041116183019.GB1206@aon.at> Message-ID: <1100631837.27971.22.camel@hpc-1> On Tue, 2004-11-16 at 13:30, Bernhard Fischer wrote: > Hi, > > from linux/spinlock.h: "spin_is_locked on UP always says FALSE" > > please consider applying, > > ______________________________________________________________________ Thanks for pointing this out. Guess we all were running SMP. The consensus seems to be to eliminate these rather than conditionalize them. -- Hal From halr at voltaire.com Tue Nov 16 11:04:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 14:04:07 -0500 Subject: [openib-general] [patch] mad.c, agent.c spinlocking on UP In-Reply-To: <52acthiqcq.fsf@topspin.com> References: <20041116183019.GB1206@aon.at> <52ekitiqq6.fsf@topspin.com> <419A49CE.7000506@ichips.intel.com> <52acthiqcq.fsf@topspin.com> Message-ID: <1100631847.27971.24.camel@hpc-1> On Tue, 2004-11-16 at 13:46, Roland Dreier wrote: > Sean> I'd vote to remove the BUG_ONs, versus adding #ifdef. > > That seems fine to me. Maybe adding a comment in agent.c similar to > what mad.c says ("Assumes ib_mad_port_list_lock is being held") is all > we really need, something like this: > > Index: agent.c > =================================================================== > --- agent.c (revision 1249) > +++ agent.c (working copy) > @@ -36,6 +36,9 @@ > extern kmem_cache_t *ib_mad_cache; > > > +/* > + * Caller must hold ib_agent_port_list_lock. > + */ > static inline struct ib_agent_port_private * > __ib_get_agent_port(struct ib_device *device, int port_num, > struct ib_mad_agent *mad_agent) Thanks. Applied. -- Hal From halr at voltaire.com Tue Nov 16 11:32:33 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Nov 2004 14:32:33 -0500 Subject: [openib-general] Setting of MAD TID for user mode clients Message-ID: <1100633553.27971.28.camel@hpc-1> Hi, Should it be the responsibility of user_mad or the client itself to set the hi_tid ? Right now, it's in user_mad::ib_umad_write. Just wondering... -- Hal From roland at topspin.com Tue Nov 16 11:43:56 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 16 Nov 2004 11:43:56 -0800 Subject: [openib-general] Re: Setting of MAD TID for user mode clients In-Reply-To: <1100633553.27971.28.camel@hpc-1> (Hal Rosenstock's message of "Tue, 16 Nov 2004 14:32:33 -0500") References: <1100633553.27971.28.camel@hpc-1> Message-ID: <526545inpf.fsf@topspin.com> Hal> Hi, Should it be the responsibility of user_mad or the client Hal> itself to set the hi_tid ? Right now, it's in Hal> user_mad::ib_umad_write. I think it has to be in the kernel (ie in user_mad.c) because we can't trust anything userspace gives us. - R. From mshefty at ichips.intel.com Tue Nov 16 11:49:34 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 16 Nov 2004 11:49:34 -0800 Subject: [openib-general] Re: Setting of MAD TID for user mode clients In-Reply-To: <526545inpf.fsf@topspin.com> References: <1100633553.27971.28.camel@hpc-1> <526545inpf.fsf@topspin.com> Message-ID: <419A59CE.3000705@ichips.intel.com> Roland Dreier wrote: > Hal> Hi, Should it be the responsibility of user_mad or the client > Hal> itself to set the hi_tid ? Right now, it's in > Hal> user_mad::ib_umad_write. > > I think it has to be in the kernel (ie in user_mad.c) because we can't > trust anything userspace gives us. agreed From mshefty at ichips.intel.com Tue Nov 16 17:07:10 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 16 Nov 2004 17:07:10 -0800 Subject: [openib-general] RMPP implementation Message-ID: <419AA43E.4050405@ichips.intel.com> I'm starting work on the RMPP implementation in the MAD code. If anyone has any ideas/preferences on the implementation, please let me know. For the send side, there are a couple of ways to perform the segmentation: 1. Issue one send at a time. Additional sends are not transfered until the first send completes. 2. Issue multiple sends using 2 data segments per request. This requires allocating and mapping space (36 bytes) for copying the MAD common and RMPP headers. 3. Issue multiple sends using 3 data segments per request. This is the same as #2, but only copes the RMPP header. I'm leaning towards #2 at this point. RMPP is fairly complex, so I will probably submit a series of patches, rather than the entire implementation at once. - Sean From ftillier at infiniconsys.com Tue Nov 16 17:33:43 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Tue, 16 Nov 2004 17:33:43 -0800 Subject: [openib-general] RMPP implementation In-Reply-To: <419AA43E.4050405@ichips.intel.com> Message-ID: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, November 16, 2004 5:07 PM > > I'm starting work on the RMPP implementation in the MAD code. If anyone > has any ideas/preferences on the implementation, please let me know. > > For the send side, there are a couple of ways to perform the segmentation: > > 1. Issue one send at a time. Additional sends are not transfered until > the first send completes. > 2. Issue multiple sends using 2 data segments per request. This > requires allocating and mapping space (36 bytes) for copying the MAD > common and RMPP headers. > 3. Issue multiple sends using 3 data segments per request. This is the > same as #2, but only copes the RMPP header. > > I'm leaning towards #2 at this point. > Isn't #1 the simplest to implement? Turnaround on the send queue should be pretty quick, so send performance should be fine. I say do whatever is simplest, and then optimize from there, and to me that means #1 at the moment. What are the reasons to *not* do #1? - Fab From mst at mellanox.co.il Wed Nov 17 01:08:35 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 17 Nov 2004 11:08:35 +0200 Subject: [openib-general] RMPP implementation In-Reply-To: <419AA43E.4050405@ichips.intel.com> References: <419AA43E.4050405@ichips.intel.com> Message-ID: <20041117090835.GA6959@mellanox.co.il> Hello! Quoting r. Sean Hefty (mshefty at ichips.intel.com) "[openib-general] RMPP implementation": > I'm starting work on the RMPP implementation in the MAD code. If anyone > has any ideas/preferences on the implementation, please let me know. > > For the send side, there are a couple of ways to perform the segmentation: > > 1. Issue one send at a time. Additional sends are not transfered until > the first send completes. > 2. Issue multiple sends using 2 data segments per request. This > requires allocating and mapping space (36 bytes) for copying the MAD > common and RMPP headers. > 3. Issue multiple sends using 3 data segments per request. This is the > same as #2, but only copes the RMPP header. > > I'm leaning towards #2 at this point. > > RMPP is fairly complex, so I will probably submit a series of patches, > rather than the entire implementation at once. > > - Sean RMPP is somewhat similiar to TCP. I wander if there is some way to re-use the TCP stack code. MST From roland at topspin.com Wed Nov 17 08:49:22 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 17 Nov 2004 08:49:22 -0800 Subject: [openib-general] ipoib_debugfs -- new kernel patch required for 2.6.9 Message-ID: <52lld0e7zh.fsf@topspin.com> I just committed changes that replace the IPoIB /proc files with and ipoib_debugfs filesystem. Using this filesystem is described in the new docs/ipoib.txt file. There is a new kernel patch, linux-2.6.9-backports.diff, that is required to build against a 2.6.9 kernel. This patch just adds the new d_alloc_name() function from 2.6.10-rc. - R. From mshefty at ichips.intel.com Wed Nov 17 09:07:48 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 17 Nov 2004 09:07:48 -0800 Subject: [openib-general] RMPP implementation In-Reply-To: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com> References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com> Message-ID: <419B8564.2090805@ichips.intel.com> Fab Tillier wrote: >>1. Issue one send at a time. Additional sends are not transfered until >> the first send completes. > > Isn't #1 the simplest to implement? Turnaround on the send queue should be > pretty quick, so send performance should be fine. I say do whatever is > simplest, and then optimize from there, and to me that means #1 at the > moment. What are the reasons to *not* do #1? It's simpler to implement, and would definitely be the easiest to do on redirected QPs. The only disadvantage is that it lowers the throughput between two clients. Also, this is a relatively small decrease in complexity with respect to the rest of RMPP. A couple of other areas that will need to be addressed include: RMPP timeouts, receive window sizes, and user-mode support. - Sean From ftillier at infiniconsys.com Wed Nov 17 09:22:11 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 17 Nov 2004 09:22:11 -0800 Subject: [openib-general] RMPP implementation In-Reply-To: <419B8564.2090805@ichips.intel.com> Message-ID: <000301c4ccc9$fa3a3cf0$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, November 17, 2004 9:08 AM > > Fab Tillier wrote: > >>1. Issue one send at a time. Additional sends are not transfered until > >> the first send completes. > > > > Isn't #1 the simplest to implement? Turnaround on the send queue should > > be > > pretty quick, so send performance should be fine. I say do whatever is > > simplest, and then optimize from there, and to me that means #1 at the > > moment. What are the reasons to *not* do #1? > > It's simpler to implement, and would definitely be the easiest to do on > redirected QPs. The only disadvantage is that it lowers the throughput > between two clients. Also, this is a relatively small decrease in > complexity with respect to the rest of RMPP. I agree that it will lower the throughput, but by how much? I would expect it to be minimal. It also allows more concurrent transfers to progress. I'm thinking that the SA is likely the primary user of RMPP sends, and thus responding to more queries in parallel is probably better than responding to queries serially but faster for each query. The send completion delay is likely to be less than the RMPP timeouts, so might as well keep many requestors going than getting a response to any one client quickly. - Fab From roland at topspin.com Wed Nov 17 09:31:36 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 17 Nov 2004 09:31:36 -0800 Subject: [openib-general] RMPP implementation In-Reply-To: <419B8564.2090805@ichips.intel.com> (Sean Hefty's message of "Wed, 17 Nov 2004 09:07:48 -0800") References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com> <419B8564.2090805@ichips.intel.com> Message-ID: <52hdnoe613.fsf@topspin.com> Would it make sense to figure out what the expected consumers of this RMPP support will be and what they will need before designing the RMPP implementation? - Roland From mshefty at ichips.intel.com Wed Nov 17 09:34:58 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 17 Nov 2004 09:34:58 -0800 Subject: [openib-general] RMPP implementation In-Reply-To: <52hdnoe613.fsf@topspin.com> References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com> <419B8564.2090805@ichips.intel.com> <52hdnoe613.fsf@topspin.com> Message-ID: <419B8BC2.60806@ichips.intel.com> Roland Dreier wrote: > Would it make sense to figure out what the expected consumers of this > RMPP support will be and what they will need before designing the RMPP > implementation? Absolutely. Right now, I'm assuming opensm and SA query as the primary users. - Sean From halr at voltaire.com Wed Nov 17 09:42:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 17 Nov 2004 12:42:04 -0500 Subject: [openib-general] RMPP implementation In-Reply-To: <419B8BC2.60806@ichips.intel.com> References: <000201c4cc45$7a880200$655aa8c0@infiniconsys.com> <419B8564.2090805@ichips.intel.com> <52hdnoe613.fsf@topspin.com> <419B8BC2.60806@ichips.intel.com> Message-ID: <1100713324.12272.2.camel@localhost.localdomain> On Wed, 2004-11-17 at 12:34, Sean Hefty wrote: > Roland Dreier wrote: > > > Would it make sense to figure out what the expected consumers of this > > RMPP support will be and what they will need before designing the RMPP > > implementation? > > Absolutely. Right now, I'm assuming opensm and SA query as the primary > users. By OpenSM, I presume you are referring to the SA. It might also have other applications: e.g database synchronization between OpenSMs (but this would be down the road). I think we also need to understand the dynamics of the users as well. -- Hal From roland at topspin.com Wed Nov 17 14:57:41 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 17 Nov 2004 14:57:41 -0800 Subject: [openib-general] Updated backports patch needed to build Message-ID: <52d5ycozh6.fsf@topspin.com> I've checked in a few changes that use kernel features added after the 2.6.9 release. This means to build the latest tree, you have two choices: - Apply the latest linux-2.6.9-backports.diff to your 2.6.9 tree. This backports all the features required. - Use an extremely up-to-date kernel tree. I haven't tried 2.6.10-rc2 but I expect it would work; certainly an up-to-date BK tree will definitely have everything needed. In this case you should apply all the patches _except_ linux-2.6.9-backports.diff. - Roland From shaharf at voltaire.com Thu Nov 18 08:14:47 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 18 Nov 2004 18:14:47 +0200 Subject: [openib-general] openib gen2 architecture Message-ID: Hi all, I know I am new to this project and I must be naïve but I want to understand few things concerning the openib architecture. In the course of learning the openib gen 2 stack and preparing to port the opensm to it (which is my current task), I have encountered few areas that seem problematic to me and I would like to understand the reasoning for it, if not to offer alternatives. I am sorry that I rise these issues so late, but I was not involved in this project earlier. I hope it is better late than never. It seems to me that the major design approach is to do everything in the kernel but let user mode staff access to the lower levels to enable performance sensitive applications override all kernel layers. Am I right? It seems also that within the kernel, the ib interface/verbs (ib_*) is very close to the mthca verbs that are very close to vapi. I know that this is the way most of the industry was working, but I wonder - is this the correct model? Will this not pollute the kernel with a lot of IB specific stuff? Personally, I think that IB verbs (vapi) are so complicated that another level of abstraction is required. PDs, MRs, QPs QP state machine, PKEYs, MLIDs and other "curses", why should a module such as IPoIB knows about it? If the answer is performance then I have to disagree. In the same fashion you can say that in order to achieve efficient disk IO applications should know the disks geometry and to able to do direct IO to the disk firmware, or that applications should talk SCSI verbs to optimize their data transfers. It seems to me that the current interfaces evolved to what they are today mainly because of the way IB itself evolved - with a lot of uncertainly and a lot of design holes (not to say "craters"). This forced most of the industry to stick with very straight forward interfaces that were based on Mellanox VAPI. I wonder if this is not the right time to come up with much better abstraction - for user mode and for kernel mode. For example, it seems that the abstraction layer should abstract the IB networking objects and not the IB hca interface. In other words - why not to build the abstraction around IB networking types - UD, RC, RD, MADS? Why do we have to expose the memory management model of the driver/HCA to upper layers? Do we really want to expose IB related structures such as CQs, QPs, and WQE? Why? Not only that this is not good for abstraction meaning changes in the drivers will require upper layers modifications, but also this is very problematic due security and stability reasons. I think that using correct abstraction is very critical for a real acceptance in the Linux/open source world. Good abstraction will enable us also to provide good and secure kernel mode and user mode interfaces and access. Once we have such interfaces, I think we should consider again the user/kernel division. As a general rule I think that it is commonly agreed that the kernel should include only things that must be in the kernel, meaning hardware aware software and very performance sensitive software. Other software modules may be inserted to the kernel once it is mature and robust. For example, RPC, NFSD and SMBFS (SAMBA) were developed in user mode, served many years in user mode and then after they have matured, they started to "sink" into the kernel. I think that IB, and especially IB management modules, are far from being mature. Even the IB standard itself is not really stable. Specifically, there is a requirement (in the SOW) to make the IB management distributed due some scalability and other (redundancy, etc.) requirements. I do not know if this requirement will actually realize, but if is will, the SM and maybe also the SMI/GSI agents and the CM will have to significantly change. If this is likely to happen, I would suggest keeping as much as possible in user mode - it is much easier to develop and to update. We should have kernel based agents and mechanism to assist the performance, but I think that most of the work should be done in user mode where it can harm less. Specifically, thinks such as MAD transaction manager (retries, timeouts, matching), RMPP and other should be developed in user mode and packed as libraries, again, at least until they will stabilize and mature. Why should we develop complicated functionality such as RMPP in the kernel when the only few kernel based queries (if any at all) will use them? If I not mistaken, one of the IB design goals was to enable efficient *user mode* networking (copy less, low latency). This is also the major advantage IB have over several alternatives - most remarkably 10G Ethernet. If we will not emphasize such advantages, we will reduce the survival chance of IB once 10GE will be widely used. If potential users will get the impression that comparing to 10GE IB is cheaper, faster, more efficient but requires tons of special kernel based modules and very complicated interfaces and therefore much less stable and much more exposed to bugs, they will use 10GE. I have no doubt. Yes, it is true that this project is meant to supply HPC code base, but eventually, IB will not survive as HPC interconnect only. Furthermore, all HPC applications are user mode based. Good user mode interfaces are critical for HCP not less then to any other high end networking applications. I really would like to know if I am shooting in the dark or that the issues I mentioned were discussed and there are good reasons to do them the way they are. Or, maybe I don't get the picture and the state of things is completely different from what I am painting. Either way I would like to know what do you think. Thanks, Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Thu Nov 18 08:28:43 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 08:28:43 -0800 Subject: [openib-general] openib gen2 architecture In-Reply-To: (shaharf@voltaire.com's message of "Thu, 18 Nov 2004 18:14:47 +0200") References: Message-ID: <528y8zm890.fsf@topspin.com> shaharf> It seems to me that the major design approach is to do shaharf> everything in the kernel but let user mode staff access shaharf> to the lower levels to enable performance sensitive shaharf> applications override all kernel layers. Am I right? No. The reason everything is in the kernel now is that we simply have not started implementing any userspace verbs support. shaharf> I wonder if this is not the right time to come up with shaharf> much better abstraction - for user mode and for kernel shaharf> mode. For example, it seems that the abstraction layer shaharf> should abstract the IB networking objects and not the IB shaharf> hca interface. In other words - why not to build the shaharf> abstraction around IB networking types - UD, RC, RD, shaharf> MADS? Why do we have to expose the memory management shaharf> model of the driver/HCA to upper layers? Do we really shaharf> want to expose IB related structures such as CQs, QPs, shaharf> and WQE? Why? Not only that this is not good for shaharf> abstraction meaning changes in the drivers will require shaharf> upper layers modifications, but also this is very shaharf> problematic due security and stability reasons. Keep in mind that CQs, QPs and other IB transport objects are themselves abstractions. I'm not opposed to better abstractions in principle, but I think that the current level is a good one. Any IB hardware is likely to be optimized for implementing these abstractions, and I have a hard time believing we are smart enough to build layers on top of them that are both generic enough and efficient enough for all applications. shaharf> I think that using correct abstraction is very critical shaharf> for a real acceptance in the Linux/open source world. I agree. However, I think the Linux kernel community will actually be opposed to extra abstraction layers that hide what the hardware is really doing. For example, I believe any really high-performance IB application is going to understand the work queueing model and want to deal with QPs and CQs. shaharf> Why should we develop complicated functionality such as shaharf> RMPP in the kernel when the only few kernel based queries shaharf> (if any at all) will use them? Funny you should raise this point. I said the same thing some time ago and Yaron Haviv violently disagreed ;) - Roland From halr at voltaire.com Thu Nov 18 08:42:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 11:42:16 -0500 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <52r7n37xz9.fsf@topspin.com> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> Message-ID: <1100796136.3277.9.camel@localhost.localdomain> On Tue, 2004-11-09 at 12:07, Roland Dreier wrote: > multiport bonding/failover > (although my feeling is that it would be better to extend the existing > bonding driver rather than trying to put this in the IPoIB driver), .... I'm not clear what the tradeoffs / pros / cons of the two approaches (use the bonding driver (above the IPoIB driver) or implement it inside the IPoIB driver) would be. -- Hal From halr at voltaire.com Thu Nov 18 09:02:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 12:02:03 -0500 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names Message-ID: <1100797323.3277.19.camel@localhost.localdomain> mad: Add port number to MAD thread names Index: mad.c =================================================================== --- mad.c (revision 1259) +++ mad.c (working copy) @@ -1843,6 +1843,7 @@ int ret, cq_size; struct ib_mad_port_private *port_priv; unsigned long flags; + char name[8]; /* First, check if port already open at MAD layer */ port_priv = ib_get_mad_port(device, port_num); @@ -1898,7 +1899,8 @@ if (ret) goto error7; - port_priv->wq = create_workqueue("ib_mad"); + sprintf(name, "ib_mad%d", port_num); + port_priv->wq = create_workqueue(name); if (!port_priv->wq) { ret = -ENOMEM; goto error8; From roland at topspin.com Thu Nov 18 09:23:47 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 09:23:47 -0800 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <1100797323.3277.19.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 18 Nov 2004 12:02:03 -0500") References: <1100797323.3277.19.camel@localhost.localdomain> Message-ID: <52zn1fkr4s.fsf@topspin.com> It's extremely unlikely, but: + char name[8]; + sprintf(name, "ib_mad%d", port_num); if port_num >= 10, this will overflow the buffer. Since a device could conceivably have up to 255 ports (although an HCA with hundreds of ports is rather far-fetched, and we only create one port for a switch), I would suggest doing char name[sizeof "ib_mad123"]; and snprintf(name, sizeof name, "ib_mad%d", port_num); for correctness and (mostly) ease of auditing. - Roland From Nitin.Hande at Sun.COM Thu Nov 18 09:34:32 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 18 Nov 2004 09:34:32 -0800 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <1100796136.3277.9.camel@localhost.localdomain> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> <1100796136.3277.9.camel@localhost.localdomain> Message-ID: <419CDD28.70906@Sun.COM> Hal/Roland, Hal Rosenstock wrote: > On Tue, 2004-11-09 at 12:07, Roland Dreier wrote: > >>multiport bonding/failover >>(although my feeling is that it would be better to extend the existing >>bonding driver rather than trying to put this in the IPoIB driver), .... > > > I'm not clear what the tradeoffs / pros / cons of the two approaches > (use the bonding driver (above the IPoIB driver) or implement it inside > the IPoIB driver) would be. I just started taking a look at the existing bonding driver and evaluating what work needs to be done to support ipoib driver below it. It seems to me like a lot of pieces for this approach are readily available (ifenslave and other logic) etc and besides I guess that will keep one standard approach of doing bonding in linux. I also assume that while the ipoib gets enslaved, it will get enough opportunity to take right set of steps for its present connections and traffic etc. I might be wrong here though. Would like to hear from other members about this one. Thanks Nitin > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Thu Nov 18 09:41:51 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Nov 2004 09:41:51 -0800 Subject: [openib-general] openib gen2 architecture In-Reply-To: References: Message-ID: <20041118174151.GB14868@esmail.cup.hp.com> On Thu, Nov 18, 2004 at 06:14:47PM +0200, shaharf wrote: > Personally, I think that IB verbs (vapi) are so complicated that > another level of abstraction is required. PDs, MRs, QPs QP state machine, > PKEYs, MLIDs and other "curses", why should a module such as IPoIB > knows about it? > If the answer is performance then I have to disagree. In the same fashion > you can say that in order to achieve efficient disk IO applications > should know the disks geometry and to able to do direct IO to the disk > firmware, or that applications should talk SCSI verbs to optimize their > data transfers. Some applications in fact still do this. (e.g. sgp_dd in sg3-utils package) But other applications trade off some of the performance for managability (abstract storage) and portability to other storage technologies. IPoIB should know whatever it needs about IB to get good performance. It doesn't need to be portable and layers above (ifconfig) and below (SA) should be providing managebility. (At assuming I understand this correctly) > I wonder if this is not the right time to come up with much better > abstraction - for user mode and for kernel mode. For example, it > seems that the abstraction layer should abstract the IB networking > objects and not the IB hca interface. In other words - why not to build > the abstraction around IB networking types - UD, RC, RD, MADS? If you think it will perform as well as others expect IPoIB should, then go for it. People argued replacing the TCP/IP stack (highly tuned) with something else is a mistake since one also loses all the features (packet filtering notable). Beware you aren't the first to present a similar arguement. I don't really care as long as it works. thanks, grant From Nitin.Hande at Sun.COM Thu Nov 18 09:44:11 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 18 Nov 2004 09:44:11 -0800 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <419CDD28.70906@Sun.COM> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> <1100796136.3277.9.camel@localhost.localdomain> <419CDD28.70906@Sun.COM> Message-ID: <419CDF6B.3080102@Sun.COM> Nitin Hande wrote: > Hal/Roland, > Hal Rosenstock wrote: > >>On Tue, 2004-11-09 at 12:07, Roland Dreier wrote: >> >> >>>multiport bonding/failover >>>(although my feeling is that it would be better to extend the existing >>>bonding driver rather than trying to put this in the IPoIB driver), .... >> >> >>I'm not clear what the tradeoffs / pros / cons of the two approaches >>(use the bonding driver (above the IPoIB driver) or implement it inside >>the IPoIB driver) would be. > > I just started taking a look at the existing bonding driver and > evaluating what work needs to be done to support ipoib driver below it. > It seems to me like a lot of pieces for this approach are readily > available (ifenslave and other logic) etc and besides I guess that will > keep one standard approach of doing bonding in linux. I also assume that > while the ipoib gets enslaved, it will get enough opportunity to take > right set of steps for its present connections and traffic etc. Oh well, looking at the code, first thing ifenslave is doing is bringing the slave interface down, thereby resulting in ipoib to flush its traffic. Thanks Nitin I might > be wrong here though. Would like to hear from other members about this one. > > Thanks > Nitin > > >>-- Hal >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From peter at pantasys.com Thu Nov 18 10:18:49 2004 From: peter at pantasys.com (Peter Buckingham) Date: Thu, 18 Nov 2004 10:18:49 -0800 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <419CDD28.70906@Sun.COM> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> <1100796136.3277.9.camel@localhost.localdomain> <419CDD28.70906@Sun.COM> Message-ID: <419CE789.7080602@pantasys.com> Nitin Hande wrote: > I just started taking a look at the existing bonding driver and > evaluating what work needs to be done to support ipoib driver below it. > It seems to me like a lot of pieces for this approach are readily > available (ifenslave and other logic) etc and besides I guess that will > keep one standard approach of doing bonding in linux. I also assume that > while the ipoib gets enslaved, it will get enough opportunity to take > right set of steps for its present connections and traffic etc. I might > be wrong here though. Would like to hear from other members about this one. i took a quick look at this a little while ago (admittedly with gen1 IPoIB) and when trying to enslave the ib0, ib1 interfaces the bonding driver complains about an unsupported ioctl. i didn't have any time to track it down much further than that. if there's some interest i could take a bit more of a look at it. peter From krause at cup.hp.com Thu Nov 18 10:25:43 2004 From: krause at cup.hp.com (Michael Krause) Date: Thu, 18 Nov 2004 10:25:43 -0800 Subject: [openib-general] openib gen2 architecture In-Reply-To: <20041118174151.GB14868@esmail.cup.hp.com> References: <20041118174151.GB14868@esmail.cup.hp.com> Message-ID: <6.1.2.0.2.20041118102226.01de9890@esmail.cup.hp.com> At 09:41 AM 11/18/2004, Grant Grundler wrote: >On Thu, Nov 18, 2004 at 06:14:47PM +0200, shaharf wrote: > > Personally, I think that IB verbs (vapi) are so complicated that > > another level of abstraction is required. PDs, MRs, QPs QP state machine, > > PKEYs, MLIDs and other "curses", why should a module such as IPoIB > > knows about it? > > If the answer is performance then I have to disagree. In the same fashion > > you can say that in order to achieve efficient disk IO applications > > should know the disks geometry and to able to do direct IO to the disk > > firmware, or that applications should talk SCSI verbs to optimize their > > data transfers. In general, there is very little that IP over IB must know to operate. It really comes down to the design implementation and the choices people want to make. Given this is an open source project, one might suggest that if there is a better way to structure the design, then implement and propose it as a replacement. > > I wonder if this is not the right time to come up with much better > > abstraction - for user mode and for kernel mode. For example, it > > seems that the abstraction layer should abstract the IB networking > > objects and not the IB hca interface. In other words - why not to build > > the abstraction around IB networking types - UD, RC, RD, MADS? Some designs are modular in nature and keep consumers such as IP over IB from having to know much more than the QP / work queue / completion queue to operate. It again comes down to design as some focus on maximum code re-use. For example, there is nothing that precludes an implementation from using a kernel IT API / DAPL interface for most subsystems in order to free itself from all of these details. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Nov 18 10:29:58 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 18 Nov 2004 10:29:58 -0800 Subject: [openib-general] openib gen2 architecture In-Reply-To: References: Message-ID: <419CEA26.3030702@ichips.intel.com> shaharf wrote: > It seems to me that the major design approach is to do everything in > the kernel but let user mode staff access to the lower levels to > enable performance sensitive applications override all kernel layers. > Am I right? The focus is only on kernel components at the moment, plus whatever user-mode support is needed to configure the fabric. > It seems also that within the kernel, the ib interface/verbs (ib_*) > is very close to the mthca verbs that are very close to vapi. The agreement among the IB vendors was the start with VAPI as the base for the development of a new API. But VAPI is little more than verbs as defined by the IB spec. Infiniband hardware exposes PDs, CQs, QPs, etc, so I think it's natural for these constructs to appear in the software. Abstractions away from IB specific constructs do exist in the form of SDP, ipoib, and DAPL. > It seems to me that the current interfaces evolved to what they are > today mainly because of the way IB itself evolved - with a lot of > uncertainly and a lot of design holes (not to say "craters"). This > forced most of the industry to stick with very straight forward > interfaces that were based on Mellanox VAPI. The verbs interface evolved because IB hardware is expected to expose this sort of functionality. If you examine the layering of the software, ib_mthca and ib_core work together, with ib_core simply de-multiplexing among multiple devices. > I wonder if this is not the right time to come up with much better > abstraction - for user mode and for kernel mode. I believe that we have the correct software layering. At the lowest level you need software that talks directly with the hardware, with support for different hardware devices. Abstractions, such as SDP and DAPL should be above that. It seems that you're just wanting to move the abstraction down lower in the stack. > Do we really want to expose IB related > structures such as CQs, QPs, and WQE? Why? *blinks* > etc.) requirements. I do not know if this requirement will actually > realize, but if is will, the SM and maybe also the SMI/GSI agents and > the CM will have to significantly change. If this is likely to Coding requirements that may or may not happen or potential changes in the future is likely to produce nothing usable. > Why should we develop complicated functionality such as > RMPP in the kernel when the only few kernel based queries (if any at > all) will use them? I'm not opposed to moving functionality from the kernel to user-space, if it makes sense to do so. Note that TCP is in the kernel, and RMPP is somewhat similar. > very complicated interfaces and therefore much less > stable and much more exposed to bugs, they will use 10GE. Regardless of exposed interfaces, there will still be a need to implement IB management, meaning that the complexity will still be there. > doubt. Yes, it is true that this project is meant to supply HPC code > base, but eventually, IB will not survive as HPC interconnect only. This is a debatable point. I think that IB can survive as an only an HPC interconnect, and that it may have to. Ethernet may never be as good as IB, but that doesn't mean that it won't someday be good enough, especially if it comes for "free" on the motherboard. From Nitin.Hande at Sun.COM Thu Nov 18 10:41:55 2004 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Thu, 18 Nov 2004 10:41:55 -0800 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <419CE789.7080602@pantasys.com> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> <1100796136.3277.9.camel@localhost.localdomain> <419CDD28.70906@Sun.COM> <419CE789.7080602@pantasys.com> Message-ID: <419CECF3.1030606@Sun.COM> Peter Buckingham wrote: > Nitin Hande wrote: > >>I just started taking a look at the existing bonding driver and >>evaluating what work needs to be done to support ipoib driver below it. >>It seems to me like a lot of pieces for this approach are readily >>available (ifenslave and other logic) etc and besides I guess that will >>keep one standard approach of doing bonding in linux. I also assume that >>while the ipoib gets enslaved, it will get enough opportunity to take >>right set of steps for its present connections and traffic etc. I might >>be wrong here though. Would like to hear from other members about this one. > > > i took a quick look at this a little while ago (admittedly with gen1 > IPoIB) and when trying to enslave the ib0, ib1 interfaces the bonding > driver complains about an unsupported ioctl. i didn't have any time to > track it down much further than that. if there's some interest i could > take a bit more of a look at it. Yes, some of that needs to be implemented for ipoib. I have some bits being readied. But before that would like to hear from people about various approaches. Thanks Nitin > > peter > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Nov 18 10:43:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 13:43:32 -0500 Subject: [openib-general] Re: More on IPoIB Multicast In-Reply-To: <419CECF3.1030606@Sun.COM> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> <1100796136.3277.9.camel@localhost.localdomain> <419CDD28.70906@Sun.COM> <419CE789.7080602@pantasys.com> <419CECF3.1030606@Sun.COM> Message-ID: <1100803412.3280.4.camel@localhost.localdomain> On Thu, 2004-11-18 at 13:41, Nitin Hande wrote: > But before that would like to hear from people about > various approaches. Some vendors have implemented this by combining multiple HCA ports and failing over from one to the other. Bonding may provide striping (using both ports concurrently). I will need to read up on bonding to understand what it provides and compare it to what can be done under the IPoIB driver. -- Hal From halr at voltaire.com Thu Nov 18 10:45:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 13:45:50 -0500 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <52zn1fkr4s.fsf@topspin.com> References: <1100797323.3277.19.camel@localhost.localdomain> <52zn1fkr4s.fsf@topspin.com> Message-ID: <1100803550.3280.7.camel@localhost.localdomain> On Thu, 2004-11-18 at 12:23, Roland Dreier wrote: > It's extremely unlikely, but: > > + char name[8]; > > + sprintf(name, "ib_mad%d", port_num); > > if port_num >= 10, this will overflow the buffer. Since a device > could conceivably have up to 255 ports (although an HCA with hundreds > of ports is rather far-fetched, and we only create one port for a > switch), I would suggest doing > > char name[sizeof "ib_mad123"]; > > and > > snprintf(name, sizeof name, "ib_mad%d", port_num); > > for correctness and (mostly) ease of auditing. Thanks. Applied. -- Hal Index: mad.c =================================================================== --- mad.c (revision 1261) +++ mad.c (working copy) @@ -1843,7 +1843,7 @@ int ret, cq_size; struct ib_mad_port_private *port_priv; unsigned long flags; - char name[8]; + char name[sizeof "ib_mad123"]; /* First, check if port already open at MAD layer */ port_priv = ib_get_mad_port(device, port_num); @@ -1899,7 +1899,7 @@ if (ret) goto error7; - sprintf(name, "ib_mad%d", port_num); + snprintf(name, sizeof name, "ib_mad%d", port_num); port_priv->wq = create_workqueue(name); if (!port_priv->wq) { ret = -ENOMEM; From johannes at erdfelt.com Thu Nov 18 10:53:05 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Thu, 18 Nov 2004 10:53:05 -0800 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <52zn1fkr4s.fsf@topspin.com> References: <1100797323.3277.19.camel@localhost.localdomain> <52zn1fkr4s.fsf@topspin.com> Message-ID: <20041118185305.GQ27658@sventech.com> On Thu, Nov 18, 2004, Roland Dreier wrote: > It's extremely unlikely, but: > > + char name[8]; > > + sprintf(name, "ib_mad%d", port_num); > > if port_num >= 10, this will overflow the buffer. Since a device > could conceivably have up to 255 ports (although an HCA with hundreds > of ports is rather far-fetched, and we only create one port for a > switch), I would suggest doing > > char name[sizeof "ib_mad123"]; You mean char name[sizeof "ib_mad123" + 1]; right? :) Otherwise we'll limit the name to < 100 ports (yes, yes, nitpicking) > and > > snprintf(name, sizeof name, "ib_mad%d", port_num); > > for correctness and (mostly) ease of auditing. I agree completely. JE From halr at voltaire.com Thu Nov 18 10:49:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 13:49:04 -0500 Subject: [openib-general] mthca crash on startup Message-ID: <1100803744.3280.11.camel@localhost.localdomain> When starting ib_mthca, I got the following log messages. I am running with the latest bits. It may also have been related to the startup of a switch or SM at the same instant in time. -- Hal Nov 18 13:32:06 localhost kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) Nov 18 13:32:06 localhost kernel: ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:02:00.0) Nov 18 13:32:07 localhost /sbin/hotplug: no runnable /etc/hotplug/infiniband.agent is installed Nov 18 13:32:07 localhost kernel: modprobe: page allocation failure. order:6, mode:0x20 Nov 18 13:32:07 localhost kernel: [] __alloc_pages+0x1c2/0x370 Nov 18 13:32:07 localhost kernel: [] __get_free_pages+0x1f/0x40 Nov 18 13:32:07 localhost kernel: [] dma_alloc_coherent+0xce/0x100 Nov 18 13:32:07 localhost kernel: [] mthca_cmd_box+0x85/0xe0 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] mthca_alloc_sqp+0x6c/0x420 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] mthca_create_qp+0x16e/0x180 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] ib_create_qp+0x22/0x80 [ib_core] Nov 18 13:32:07 localhost kernel: [] create_mad_qp+0x86/0xd0 [ib_mad] Nov 18 13:32:07 localhost kernel: [] qp_event_handler+0x0/0x30 [ib_mad] Nov 18 13:32:07 localhost kernel: [] ib_get_dma_mr+0x1e/0x50 [ib_core] Nov 18 13:32:07 localhost kernel: [] ib_mad_port_open+0x233/0x5c0 [ib_mad] Nov 18 13:32:07 localhost kernel: [] ib_mad_init_device+0x3e/0x100 [ib_mad] Nov 18 13:32:07 localhost kernel: [] ib_cache_setup_one+0x12d/0x1d0 [ib_core] Nov 18 13:32:07 localhost /sbin/hotplug: no runnable /etc/hotplug/infiniband.agent is installed Nov 18 13:32:07 localhost kernel: [] ib_mad_init_device+0x0/0x100 [ib_mad] Nov 18 13:32:07 localhost kernel: [] ib_register_device+0x17d/0x1a0 [ib_core] Nov 18 13:32:07 localhost kernel: [] mthca_req_notify_cq+0x0/0x30 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] mthca_poll_cq+0x0/0xbb0 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] mthca_destroy_cq+0x0/0x30 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] mthca_register_device+0x15b/0x1a0 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] mthca_init_one+0x523/0x6e0 [ib_mthca] Nov 18 13:32:07 localhost kernel: [] pci_device_probe_static+0x52/0x70 Nov 18 13:32:08 localhost kernel: [] __pci_device_probe+0x3c/0x50 Nov 18 13:32:08 localhost kernel: [] pci_device_probe+0x2c/0x50 Nov 18 13:32:08 localhost kernel: [] bus_match+0x3f/0x70 Nov 18 13:32:08 localhost kernel: [] driver_attach+0x5c/0x90 Nov 18 13:32:08 localhost kernel: [] bus_add_driver+0x91/0xb0 Nov 18 13:32:08 localhost kernel: [] driver_register+0x8c/0x90 Nov 18 13:32:08 localhost kernel: [] pci_register_driver+0x90/0xb0 Nov 18 13:32:08 localhost kernel: [] mthca_init+0xf/0x1a [ib_mthca] Nov 18 13:32:08 localhost kernel: [] sys_init_module+0x289/0x340 Nov 18 13:32:08 localhost kernel: [] sysenter_past_esp+0x52/0x71 From halr at voltaire.com Thu Nov 18 10:51:33 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 13:51:33 -0500 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <20041118185305.GQ27658@sventech.com> References: <1100797323.3277.19.camel@localhost.localdomain> <52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com> Message-ID: <1100803892.3280.13.camel@localhost.localdomain> On Thu, 2004-11-18 at 13:53, Johannes Erdfelt wrote: > I would suggest doing > > > > char name[sizeof "ib_mad123"]; > > You mean > > char name[sizeof "ib_mad123" + 1]; > > right? :) Right. Thanks. Applied. -- Hal From roland at topspin.com Thu Nov 18 10:57:09 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:57:09 -0800 Subject: [openib-general] Draft kernel RFC patches coming... Message-ID: <52llczkmt6.fsf@topspin.com> I'm about to send out a draft version of the kernel submission patches. I am using the same script I'll use to send the patches to linux-kernel, so look for the thread starting [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review All comments/corrections/criticisms, both about the code and the introduction and patch descriptions I wrote, will be very much appreciated. Assuming things look OK, I'm still planning on sending this to linux-kernel on Monday. Thanks, Roland From roland at topspin.com Thu Nov 18 10:57:33 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:57:33 -0800 Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review Message-ID: <200411181057.Qj4goy9DRJmMAYGe@topspin.com> I'm very happy to be able to post an initial version of InfiniBand patches for review. Although this code should be far closer to kernel coding standards than previous open source InfiniBand drivers, this initial posting should be treated as a request for comments and not a request for inclusion; our ultimate goal is to have these drivers included in the mainline kernel, but we expect that fixes and improvements will need to be made before the code is completely acceptable. These patches add a minimal but complete level of InfiniBand support, including an IB midlayer, a low-level driver for Mellanox HCAs, an IP-over-InfiniBand driver, and a mechanism for MADs (management datagrams) to be passed to and from userspace. This means that these patches are all that is required for the kernel to bring up and use an IP-over-InfiniBand link. The code has not been through extreme stress testing yet, but it has been used successfully on i386, x86_64, ppc64, ia64 and sparc64 systems, including mixed 32/64 systems. Feedback on both details of the code as well as the high-level organization of the code will be very much appreciated. For example, the current set of patches puts include files in driver/infiniband/include; would it be preferred to put include files in include/linux/infiniband/, directly in include/linux, or perhaps in include/infiniband? We would also like to explore the best avenue for having these patches merged. It may be desirable for the patches to spend some time in -mm before moving into Linus's kernel; on the other hand, the patches make only very minimal and safe changes outside of drivers/infiniband, so it is quite reasonable to merge them directly into the mainline kernel. Although 2.6.10 is now closed, 2.6.11 should be open by the time the review process is complete. We look forward to the community's comments and criticisms! Thanks, Roland Dreier OpenIB Alliance From roland at topspin.com Thu Nov 18 10:57:58 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:57:58 -0800 Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review Message-ID: <200411181057.U4msjhmj8U8k8coW@topspin.com> I'm very happy to be able to post an initial version of InfiniBand patches for review. Although this code should be far closer to kernel coding standards than previous open source InfiniBand drivers, this initial posting should be treated as a request for comments and not a request for inclusion; our ultimate goal is to have these drivers included in the mainline kernel, but we expect that fixes and improvements will need to be made before the code is completely acceptable. These patches add a minimal but complete level of InfiniBand support, including an IB midlayer, a low-level driver for Mellanox HCAs, an IP-over-InfiniBand driver, and a mechanism for MADs (management datagrams) to be passed to and from userspace. This means that these patches are all that is required for the kernel to bring up and use an IP-over-InfiniBand link. The code has not been through extreme stress testing yet, but it has been used successfully on i386, x86_64, ppc64, ia64 and sparc64 systems, including mixed 32/64 systems. Feedback on both details of the code as well as the high-level organization of the code will be very much appreciated. For example, the current set of patches puts include files in driver/infiniband/include; would it be preferred to put include files in include/linux/infiniband/, directly in include/linux, or perhaps in include/infiniband? We would also like to explore the best avenue for having these patches merged. It may be desirable for the patches to spend some time in -mm before moving into Linus's kernel; on the other hand, the patches make only very minimal and safe changes outside of drivers/infiniband, so it is quite reasonable to merge them directly into the mainline kernel. Although 2.6.10 is now closed, 2.6.11 should be open by the time the review process is complete. We look forward to the community's comments and criticisms! Thanks, Roland Dreier OpenIB Alliance From roland at topspin.com Thu Nov 18 10:58:03 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:03 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support Message-ID: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][1/12] Add core InfiniBand support Date: Thu, 18 Nov 2004 10:58:03 -0800 Size: 120267 URL: From roland at topspin.com Thu Nov 18 10:58:10 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:10 -0800 Subject: [openib-general] [PATCH][RFC/v1][2/12] Hook up drivers/infiniband In-Reply-To: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> Message-ID: <200411181058.K6SRbLv9kMx8dY1X@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier Index: linux-bk/drivers/Kconfig =================================================================== --- linux-bk.orig/drivers/Kconfig 2004-11-17 19:52:35.000000000 -0800 +++ linux-bk/drivers/Kconfig 2004-11-18 10:51:38.887317830 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu Index: linux-bk/drivers/Makefile =================================================================== --- linux-bk.orig/drivers/Makefile 2004-11-17 19:52:44.000000000 -0800 +++ linux-bk/drivers/Makefile 2004-11-18 10:51:38.887317830 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Thu Nov 18 10:58:15 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:15 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][3/12] Add InfiniBand MAD (management datagram) support Message-ID: <200411181058.BeTpz4xPzTYV7Nk7@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][3/12] Add InfiniBand MAD (management datagram) support Date: Thu, 18 Nov 2004 10:58:15 -0800 Size: 108305 URL: From roland at topspin.com Thu Nov 18 10:58:22 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:22 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support Message-ID: <200411181058.sHj94LsTlhUWv3cp@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support Date: Thu, 18 Nov 2004 10:58:22 -0800 Size: 32660 URL: From roland at topspin.com Thu Nov 18 10:58:27 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:27 -0800 Subject: [openib-general] [PATCH][RFC/v1][5/12] Add Mellanox HCA low-level driver In-Reply-To: <200411181058.sHj94LsTlhUWv3cp@topspin.com> Message-ID: <200411181058.1FCya6SB4aFjW3YZ@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier Index: linux-bk/drivers/infiniband/Kconfig =================================================================== --- linux-bk.orig/drivers/infiniband/Kconfig 2004-11-18 10:51:37.708491106 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-11-18 10:51:40.509079447 -0800 @@ -8,4 +8,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/Kconfig" + endmenu Index: linux-bk/drivers/infiniband/Makefile =================================================================== --- linux-bk.orig/drivers/infiniband/Makefile 2004-11-18 10:51:37.740486403 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-11-18 10:51:40.483083269 -0800 @@ -1 +1 @@ -obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND) += core/ hw/ Index: linux-bk/drivers/infiniband/hw/Kconfig =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/Kconfig 2004-11-18 10:51:40.535075626 -0800 @@ -0,0 +1 @@ +source "drivers/infiniband/hw/mthca/Kconfig" Index: linux-bk/drivers/infiniband/hw/Makefile =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/Makefile 2004-11-18 10:51:40.559072099 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND_MTHCA) += mthca/ Index: linux-bk/drivers/infiniband/hw/mthca/Kconfig =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-11-18 10:51:40.583068572 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. Index: linux-bk/drivers/infiniband/hw/mthca/Makefile =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-11-18 10:51:40.606065191 -0800 @@ -0,0 +1,23 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-objs := \ + mthca_main.o \ + mthca_cmd.o \ + mthca_profile.o \ + mthca_reset.o \ + mthca_allocator.o \ + mthca_eq.o \ + mthca_pd.o \ + mthca_cq.o \ + mthca_mr.o \ + mthca_qp.o \ + mthca_av.o \ + mthca_mcg.o \ + mthca_mad.o \ + mthca_provider.o Index: linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-11-18 10:51:40.630061664 -0800 @@ -0,0 +1,175 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-11-18 10:51:40.653058284 -0800 @@ -0,0 +1,212 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +} __attribute__((packed)); + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-11-18 10:51:40.677054757 -0800 @@ -0,0 +1,1522 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* + * Flush posted writes so GO bit is written last (needed with + * __raw_writel, which may not order writes). + */ + readl(dev->hcr + HCR_STATUS_OFFSET); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = readb(dev->hcr + HCR_STATUS_OFFSET); + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->max_avs = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-11-18 10:51:40.700051376 -0800 @@ -0,0 +1,260 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_avs; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-11-18 10:51:40.724047849 -0800 @@ -0,0 +1,51 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-11-18 10:51:40.747044469 -0800 @@ -0,0 +1,821 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +} __attribute__((packed)); + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1)); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe - 1) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1)); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + (cq->ibcq.cqe - 1)), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1); + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1); + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + wmb(); + inc_cons_index(dev, cq, freed); + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-11-18 10:51:40.770041089 -0800 @@ -0,0 +1,386 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-11-18 10:51:40.794037561 -0800 @@ -0,0 +1,119 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-11-18 10:51:40.818034034 -0800 @@ -0,0 +1,650 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + int work = 0; + + while (1) { + if (!next_eqe_sw(eq)) + break; + + eqe = get_eqe(eq, eq->cons_index); + work = 1; + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + } + + if (work) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } + + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-11-18 10:51:40.841030654 -0800 @@ -0,0 +1,321 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + IB_SM_PORT_INFO = 0x0015, + IB_SM_PKEY_TABLE = 0x0016, + IB_SM_SM_INFO = 0x0020, + IB_SM_VENDOR_START = 0xff00 +}; + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = pci_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-11-18 10:51:40.864027274 -0800 @@ -0,0 +1,889 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + if (dev_lim.min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim.min_page_sz, PAGE_SIZE); + err = -ENODEV; + goto err_out_disable; + } + if (dev_lim.num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim.num_ports, MTHCA_MAX_PORTS); + err = -ENODEV; + goto err_out_disable; + } + + mdev->limits.num_ports = dev_lim.num_ports; + mdev->limits.vl_cap = dev_lim.max_vl; + mdev->limits.mtu_cap = dev_lim.max_mtu; + mdev->limits.gid_table_len = dev_lim.max_gids; + mdev->limits.pkey_table_len = dev_lim.max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay; + mdev->limits.max_sg = dev_lim.max_sg; + mdev->limits.reserved_qps = dev_lim.reserved_qps; + mdev->limits.reserved_srqs = dev_lim.reserved_srqs; + mdev->limits.reserved_eecs = dev_lim.reserved_eecs; + mdev->limits.reserved_cqs = dev_lim.reserved_cqs; + mdev->limits.reserved_eqs = dev_lim.reserved_eqs; + mdev->limits.reserved_mtts = dev_lim.reserved_mtts; + mdev->limits.reserved_mrws = dev_lim.reserved_mrws; + mdev->limits.reserved_uars = dev_lim.reserved_uars; + mdev->limits.reserved_pds = dev_lim.reserved_pds; + + if (dev_lim.flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_sg; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) { + mdev->fw.arbel.mem[i].page = alloc_page(GFP_HIGHUSER); + mdev->fw.arbel.mem[i].length = PAGE_SIZE; + if (!mdev->fw.arbel.mem[i].page) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + return -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-11-18 10:51:40.888023746 -0800 @@ -0,0 +1,372 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +} __attribute__((packed)); + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-11-18 10:51:40.917019484 -0800 @@ -0,0 +1,389 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-11-18 10:51:40.940016104 -0800 @@ -0,0 +1,76 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-11-18 10:51:40.964012577 -0800 @@ -0,0 +1,222 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-11-18 10:51:40.989008903 -0800 @@ -0,0 +1,58 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-11-18 10:51:41.916872516 -0800 @@ -0,0 +1,629 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* Temporary until we get core support straightened out */ +enum { + IB_SMP_ATTRIB_NODE_INFO = 0x0011, + IB_SMP_ATTRIB_GUID_INFO = 0x0014, + IB_SMP_ATTRIB_PORT_INFO = 0x0015, + IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 +}; + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent < entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = dev->pdev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-11-18 10:51:41.940868988 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-11-18 10:51:41.963865608 -0800 @@ -0,0 +1,1485 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_qp.c 1227 2004-11-13 22:31:53Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +} __attribute__((packed)); + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +} __attribute__((packed)); + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +} __attribute__((packed)); + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +} __attribute__((packed)); + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +} __attribute__((packed)); + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +} __attribute__((packed)); + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +} __attribute__((packed)); + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) { + spin_lock_irq(&qp->lock); + /* XXX deal with async transitions to ERROR */ + qp->state = new_state; + spin_unlock_irq(&qp->lock); + } + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = pci_alloc_consistent(dev->pdev, sqp->header_buf_size, + &sqp->header_dma); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + pci_free_consistent(dev->pdev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + pci_free_consistent(dev->pdev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (qp->transport == UD) { + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + } else if (qp->transport == MLX) { + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-11-18 10:51:41.988861934 -0800 @@ -0,0 +1,228 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Thu Nov 18 10:58:28 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:28 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver Message-ID: <200411181058.E12z7tsbPyqcrIIc@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver Date: Thu, 18 Nov 2004 10:58:28 -0800 Size: 100897 URL: From roland at topspin.com Thu Nov 18 10:58:34 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:34 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][7/12] Add InfiniBand userspace MAD support Message-ID: <200411181058.kKqIomD9Cag8nD8w@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][7/12] Add InfiniBand userspace MAD support Date: Thu, 18 Nov 2004 10:58:34 -0800 Size: 23295 URL: From roland at topspin.com Thu Nov 18 10:58:40 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:40 -0800 Subject: [openib-general] [PATCH][RFC/v1][8/12] Document InfiniBand ioctl use In-Reply-To: <200411181058.kKqIomD9Cag8nD8w@topspin.com> Message-ID: <200411181058.GB144BQ0h6EKgdPS@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier Index: linux-bk/Documentation/ioctl-number.txt =================================================================== --- linux-bk.orig/Documentation/ioctl-number.txt 2004-11-17 19:52:39.000000000 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-11-18 10:51:44.604477463 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Thu Nov 18 10:58:45 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:45 -0800 Subject: [openib-general] [PATCH][RFC/v1][9/12] Add InfiniBand Documentation files In-Reply-To: <200411181058.GB144BQ0h6EKgdPS@topspin.com> Message-ID: <200411181058.Vg9HQwEgGRimClBJ@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier Index: linux-bk/Documentation/infiniband/ipoib.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-11-18 10:51:44.838443072 -0800 @@ -0,0 +1,55 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output + in the data path when debug_level is set to 2. However, even with + the output disabled, this option will affect performance. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html Index: linux-bk/Documentation/infiniband/sysfs.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-11-18 10:51:44.866438957 -0800 @@ -0,0 +1,63 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" Index: linux-bk/Documentation/infiniband/user_mad.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-11-18 10:51:44.892435136 -0800 @@ -0,0 +1,77 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad" + + can be used. This will create a device node named + + /dev/infiniband/mthca0/ports/1/mad + + for port 1 of device mthca0, and so on. From roland at topspin.com Thu Nov 18 10:58:50 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:50 -0800 Subject: [openib-general] [PATCH][RFC/v1][10/12] IPoIB IPv4 multicast In-Reply-To: <200411181058.Vg9HQwEgGRimClBJ@topspin.com> Message-ID: <200411181058.X9N1lq3F3k3Sfu3e@topspin.com> Add ip_ib_mc_map() to convert IPv$ multicast addresses to IPoIB hardware addresses. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier Index: linux-bk/include/net/ip.h =================================================================== --- linux-bk.orig/include/net/ip.h 2004-11-17 19:52:25.000000000 -0800 +++ linux-bk/include/net/ip.h 2004-11-18 10:51:45.214387812 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif Index: linux-bk/net/ipv4/arp.c =================================================================== --- linux-bk.orig/net/ipv4/arp.c 2004-11-17 19:52:34.000000000 -0800 +++ linux-bk/net/ipv4/arp.c 2004-11-18 10:51:45.214387812 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Thu Nov 18 10:58:56 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:58:56 -0800 Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support In-Reply-To: <200411181058.X9N1lq3F3k3Sfu3e@topspin.com> Message-ID: <200411181058.05X2BGHkQm9UQnbx@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier Index: linux-bk/include/net/if_inet6.h =================================================================== --- linux-bk.orig/include/net/if_inet6.h 2004-11-17 19:52:39.000000000 -0800 +++ linux-bk/include/net/if_inet6.h 2004-11-18 10:51:45.514343721 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif Index: linux-bk/net/ipv6/addrconf.c =================================================================== --- linux-bk.orig/net/ipv6/addrconf.c 2004-11-17 19:52:35.000000000 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-11-18 10:51:45.515343574 -0800 @@ -1098,6 +1098,13 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + /* XXX: replace len with IPOIB_HW_ADDR_LEN later */ + if (dev->addr_len != 20) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1797,6 +1804,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; Index: linux-bk/net/ipv6/ndisc.c =================================================================== --- linux-bk.orig/net/ipv6/ndisc.c 2004-11-17 19:52:19.000000000 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-11-18 10:51:45.516343427 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Thu Nov 18 10:59:01 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 10:59:01 -0800 Subject: [openib-general] [PATCH][RFC/v1][12/12] InfiniBand MAINTAINERS entry In-Reply-To: <200411181058.05X2BGHkQm9UQnbx@topspin.com> Message-ID: <200411181059.m3Tu2pba6tQT74nE@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier Index: linux-bk/MAINTAINERS =================================================================== --- linux-bk.orig/MAINTAINERS 2004-11-17 19:52:19.000000000 -0800 +++ linux-bk/MAINTAINERS 2004-11-18 10:51:45.861292723 -0800 @@ -1075,6 +1075,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From roland at topspin.com Thu Nov 18 11:00:14 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:00:14 -0800 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <20041118185305.GQ27658@sventech.com> (Johannes Erdfelt's message of "Thu, 18 Nov 2004 10:53:05 -0800") References: <1100797323.3277.19.camel@localhost.localdomain> <52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com> Message-ID: <52hdnnkmo1.fsf@topspin.com> Johannes> You mean Johannes> char name[sizeof "ib_mad123" + 1]; Johannes> right? :) No, actually sizeof a string includes the trailing nul. Try the following program: int main() { printf("%d\n", (int) sizeof "123"); } I bet it prints "4" :) - R. From roland at topspin.com Thu Nov 18 11:01:37 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:01:37 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support In-Reply-To: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> (Roland Dreier's message of "Thu, 18 Nov 2004 10:58:03 -0800") References: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> Message-ID: <52d5ybkmlq.fsf@topspin.com> Hmm... looks like our spamassassin is a little trigger happy :) - R. From roland at topspin.com Thu Nov 18 11:12:16 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:12:16 -0800 Subject: [openib-general] Re: mthca crash on startup In-Reply-To: <1100803744.3280.11.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 18 Nov 2004 13:49:04 -0500") References: <1100803744.3280.11.camel@localhost.localdomain> Message-ID: <524qjnkm3z.fsf@topspin.com> > modprobe: page allocation failure. order:6, mode:0x20 > [] mthca_alloc_sqp+0x6c/0x420 [ib_mthca] It's not actually a crash. It's just failing to allocate 2048 * 72 bytes of bus-coherent memory (send queue depth time size of a UD header) while creating a special QP. The system should survive this, although of course MAD services won't work. There are a few things that can be done: - There's no reason mthca needs to allocate all this memory in one physically contiguous chunk, although it makes the code simpler. If this issue persists, we can fix the special QP allocation code (everything else in mthca is pretty good about not requiring contiguous pages). - I seem to recall messages recently on lkml that recent kernels have VM problems that lead to page allocation failures. I think there are some VM tunables and some patches in -mm that are supposed to help. - Having "#define IB_MAD_QP_SEND_SIZE 2048" seems a bit excessive to me. It seems a much shallower send queue should be plenty, especially for QP0. Reducing this will reduce the amount of contiguous memory required, which should improve things. - Roland From peter at pantasys.com Thu Nov 18 11:13:09 2004 From: peter at pantasys.com (Peter Buckingham) Date: Thu, 18 Nov 2004 11:13:09 -0800 Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support In-Reply-To: <200411181058.05X2BGHkQm9UQnbx@topspin.com> References: <200411181058.05X2BGHkQm9UQnbx@topspin.com> Message-ID: <419CF445.407@pantasys.com> Hi Roland, > Index: linux-bk/net/ipv6/addrconf.c > =================================================================== > --- linux-bk.orig/net/ipv6/addrconf.c 2004-11-17 19:52:35.000000000 -0800 > +++ linux-bk/net/ipv6/addrconf.c 2004-11-18 10:51:45.515343574 -0800 > @@ -1098,6 +1098,13 @@ > memset(eui, 0, 7); > eui[7] = *(u8*)dev->dev_addr; > return 0; > + case ARPHRD_INFINIBAND: > + /* XXX: replace len with IPOIB_HW_ADDR_LEN later */ > + if (dev->addr_len != 20) why not make this change to IPOIB_HW_ADDR_LEN now? that's all i've got for now ;-) peter From johannes at erdfelt.com Thu Nov 18 11:18:31 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Thu, 18 Nov 2004 11:18:31 -0800 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <52hdnnkmo1.fsf@topspin.com> References: <1100797323.3277.19.camel@localhost.localdomain> <52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com> <52hdnnkmo1.fsf@topspin.com> Message-ID: <20041118191831.GS27658@sventech.com> On Thu, Nov 18, 2004, Roland Dreier wrote: > Johannes> You mean > > Johannes> char name[sizeof "ib_mad123" + 1]; > > Johannes> right? :) > > No, actually sizeof a string includes the trailing nul. Try the > following program: > > int main() { printf("%d\n", (int) sizeof "123"); } > > I bet it prints "4" :) You know, I knew that at one point, but I guess I forgot it for some reason because you're absolutely right. I guess I was thinking strlen instead of sizeof. JE From roland at topspin.com Thu Nov 18 11:20:35 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:20:35 -0800 Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support In-Reply-To: <419CF445.407@pantasys.com> (Peter Buckingham's message of "Thu, 18 Nov 2004 11:13:09 -0800") References: <200411181058.05X2BGHkQm9UQnbx@topspin.com> <419CF445.407@pantasys.com> Message-ID: <52vfc3j75o.fsf@topspin.com> Peter> why not make this change to IPOIB_HW_ADDR_LEN now? Not a bad idea (although I guess it should be IPOIB_ALEN to match the rest of the kernel). I wonder where to put the value though... is it really worth creating a for this value? - R. From halr at voltaire.com Thu Nov 18 11:22:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 14:22:35 -0500 Subject: [openib-general] [PATCH] mad: Add port number to MAD thread names In-Reply-To: <52hdnnkmo1.fsf@topspin.com> References: <1100797323.3277.19.camel@localhost.localdomain> <52zn1fkr4s.fsf@topspin.com> <20041118185305.GQ27658@sventech.com> <52hdnnkmo1.fsf@topspin.com> Message-ID: <1100805755.3280.15.camel@localhost.localdomain> On Thu, 2004-11-18 at 14:00, Roland Dreier wrote: > Johannes> You mean > > Johannes> char name[sizeof "ib_mad123" + 1]; > > Johannes> right? :) > > No, actually sizeof a string includes the trailing nul. Try the > following program: > > int main() { printf("%d\n", (int) sizeof "123"); } > > I bet it prints "4" :) OK. It's back to just sizeof r.t. sizeof + 1... -- Hal From tduffy at sun.com Thu Nov 18 11:27:51 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 18 Nov 2004 11:27:51 -0800 Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support In-Reply-To: <52vfc3j75o.fsf@topspin.com> References: <200411181058.05X2BGHkQm9UQnbx@topspin.com> <419CF445.407@pantasys.com> <52vfc3j75o.fsf@topspin.com> Message-ID: <1100806071.21672.7.camel@duffman> On Thu, 2004-11-18 at 11:20 -0800, Roland Dreier wrote: > Peter> why not make this change to IPOIB_HW_ADDR_LEN now? > > Not a bad idea (although I guess it should be IPOIB_ALEN to match the > rest of the kernel). I wonder where to put the value though... is it > really worth creating a for this value? Yeah, I think since everything else has one, ipoib should as well. There are some pretty short files in there like if_cablemodem.h or if_strip.h. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Nov 18 11:35:58 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:35:58 -0800 Subject: [openib-general] [PATCH][RFC/v1][11/12] IPoIB IPv6 support In-Reply-To: <1100806071.21672.7.camel@duffman> (Tom Duffy's message of "Thu, 18 Nov 2004 11:27:51 -0800") References: <200411181058.05X2BGHkQm9UQnbx@topspin.com> <419CF445.407@pantasys.com> <52vfc3j75o.fsf@topspin.com> <1100806071.21672.7.camel@duffman> Message-ID: <52r7mrj6g1.fsf@topspin.com> Tom> Yeah, I think since everything else has one, ipoib should as Tom> well. There are some pretty short files in there like Tom> if_cablemodem.h or if_strip.h. Good point. It's pretty hard to beat if_strip.h for brevity... OK, I'll update the patches. Thanks, Roland From halr at voltaire.com Thu Nov 18 11:42:00 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 14:42:00 -0500 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <200411181058.sHj94LsTlhUWv3cp@topspin.com> References: <200411181058.sHj94LsTlhUWv3cp@topspin.com> Message-ID: <1100806919.3280.17.camel@localhost.localdomain> Nit alert... On Thu, 2004-11-18 at 13:58, Roland Dreier wrote: > Content preview: Add support for sending queries to the SA (Subnet > Administrator). ^^^^^^^^^^^^^ Administration). From roland at topspin.com Thu Nov 18 11:49:50 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:49:50 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <1100806919.3280.17.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 18 Nov 2004 14:42:00 -0500") References: <200411181058.sHj94LsTlhUWv3cp@topspin.com> <1100806919.3280.17.camel@localhost.localdomain> Message-ID: <52mzxfj5sx.fsf@topspin.com> Hal> Nit alert... Thanks, fixed in my patches. - R. From halr at voltaire.com Thu Nov 18 11:46:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 14:46:05 -0500 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <200411181058.E12z7tsbPyqcrIIc@topspin.com> References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> Message-ID: <1100807164.3280.20.camel@localhost.localdomain> On Thu, 2004-11-18 at 13:58, Roland Dreier wrote: > The ARP/ND implementation for this driver is not completely > straightforward, because InfiniBand requires an additional path lookup > be performed (through an IB-specific mechanism) after a remote > hardware address has been resolved. We are very open to suggestions > of a better way to handle this than the current implementation. Is it also worth pointing out about multicast vis a vis IB ? -- Hal From roland at topspin.com Thu Nov 18 11:50:53 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:50:53 -0800 Subject: [openib-general] Re: mthca crash on startup In-Reply-To: <524qjnkm3z.fsf@topspin.com> (Roland Dreier's message of "Thu, 18 Nov 2004 11:12:16 -0800") References: <1100803744.3280.11.camel@localhost.localdomain> <524qjnkm3z.fsf@topspin.com> Message-ID: <52is83j5r6.fsf@topspin.com> I committed this change, which should help a little (pci_alloc_consistent() is implicitly GFP_ATOMIC). - R. Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 1265) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -967,8 +967,8 @@ u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; - sqp->header_buf = pci_alloc_consistent(dev->pdev, sqp->header_buf_size, - &sqp->header_dma); + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); if (!sqp->header_buf) return -ENOMEM; From roland at topspin.com Thu Nov 18 11:55:33 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 11:55:33 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <1100807164.3280.20.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 18 Nov 2004 14:46:05 -0500") References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> Message-ID: <52ekirj5je.fsf@topspin.com> Hal> Is it also worth pointing out about multicast vis a vis IB ? I didn't think so, because having to join a multicast group with the SA is conceptually similar to having to program a multicast hash table for an ethernet NIC or something like that. So I don't think we have the sort of layering violation that we have for ARP -- the kernel already expects the driver to have to do something driver-specific to handle multicast. On the other hand, if you have some suggested verbiage, I'm happy to include it. - R. From halr at voltaire.com Thu Nov 18 12:02:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 15:02:18 -0500 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <52ekirj5je.fsf@topspin.com> References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> Message-ID: <1100808138.3280.30.camel@localhost.localdomain> On Thu, 2004-11-18 at 14:55, Roland Dreier wrote: > Hal> Is it also worth pointing out about multicast vis a vis IB ? > > I didn't think so, because having to join a multicast group with the > SA is conceptually similar to having to program a multicast hash table > for an ethernet NIC or something like that. So I don't think we have > the sort of layering violation that we have for ARP -- the kernel > already expects the driver to have to do something driver-specific to > handle multicast. Yes and no. While it is the appears the same in terms of the host (and Linux already handles part of the problem, the semantics are not as rich as they need to be for IB. I am referring to knowing whether to join as a send only member, non member, or full member. (I think at least the non member/full member distinction is important; I can live without send only members as this is a minor optimization IMO although it does match the idea of an IP multicast transmitter). > On the other hand, if you have some suggested verbiage, I'm happy to > include it. If you think the idea above is correct, I will craft some verbiage. -- Hal From roland at topspin.com Thu Nov 18 12:11:10 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 12:11:10 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <1100808138.3280.30.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 18 Nov 2004 15:02:18 -0500") References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> <1100808138.3280.30.camel@localhost.localdomain> Message-ID: <523bz6kjdt.fsf@topspin.com> Hal> Yes and no. While it is the appears the same in terms of the Hal> host (and Linux already handles part of the problem, the Hal> semantics are not as rich as they need to be for IB. I am Hal> referring to knowing whether to join as a send only member, Hal> non member, or full member. (I think at least the non Hal> member/full member distinction is important; I can live Hal> without send only members as this is a minor optimization IMO Hal> although it does match the idea of an IP multicast Hal> transmitter). Hal> If you think the idea above is correct, I will craft some Hal> verbiage. Sounds good. - Roland From halr at voltaire.com Thu Nov 18 12:41:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 15:41:13 -0500 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <523bz6kjdt.fsf@topspin.com> References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> <1100808138.3280.30.camel@localhost.localdomain> <523bz6kjdt.fsf@topspin.com> Message-ID: <1100810473.3280.47.camel@localhost.localdomain> On Thu, 2004-11-18 at 15:11, Roland Dreier wrote: > Hal> Yes and no. While it is the appears the same in terms of the > Hal> host (and Linux already handles part of the problem, the > Hal> semantics are not as rich as they need to be for IB. I am > Hal> referring to knowing whether to join as a send only member, > Hal> non member, or full member. (I think at least the non > Hal> member/full member distinction is important; I can live > Hal> without send only members as this is a minor optimization IMO > Hal> although it does match the idea of an IP multicast > Hal> transmitter). > > Hal> If you think the idea above is correct, I will craft some > Hal> verbiage. > > Sounds good. Here's a first cut at some working on multicast: Although IB has a special join mode intended to support IP multicast routing (non member), as no means to identify different multicast styles has yet been determined, all joins are currently full member. We are looking for guidance in how to solve this. One more thing: Do we also want to say something about no SM right now ? Or is that putting a cosmic kick me sign on ? -- Hal From tduffy at sun.com Thu Nov 18 12:59:36 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 18 Nov 2004 12:59:36 -0800 Subject: [openib-general] [PATCH][RFC/v1][6/12] Add PoIB (IP-over-InfiniBand) driver In-Reply-To: <1100810473.3280.47.camel@localhost.localdomain> References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> <1100808138.3280.30.camel@localhost.localdomain> <523bz6kjdt.fsf@topspin.com> <1100810473.3280.47.camel@localhost.localdomain> Message-ID: <1100811576.21672.31.camel@duffman> On Thu, 2004-11-18 at 15:41 -0500, Hal Rosenstock wrote: > One more thing: > Do we also want to say something about no SM right now ? Or is that > putting a cosmic kick me sign on ? Actually, I thought we agreed that this was necessary to have before we submit to lkml. At least for inclusion. How is somebody going to test the openib code without an SM. (And no, buying a topspin switch is not the answer :-P Nor is using Solaris, or the old gen1 stack) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Nov 18 13:06:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Nov 2004 16:06:26 -0500 Subject: [openib-general] [PATCH][RFC/v1][6/12] Add PoIB (IP-over-InfiniBand) driver In-Reply-To: <1100811576.21672.31.camel@duffman> References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> <1100808138.3280.30.camel@localhost.localdomain> <523bz6kjdt.fsf@topspin.com> <1100810473.3280.47.camel@localhost.localdomain> <1100811576.21672.31.camel@duffman> Message-ID: <1100811986.3280.59.camel@localhost.localdomain> On Thu, 2004-11-18 at 15:59, Tom Duffy wrote: > Actually, I thought we agreed that this was necessary to have before we > submit to lkml. At least for inclusion. How is somebody going to test > the openib code without an SM. (And no, buying a topspin switch is not > the answer :-P Nor is using Solaris, or the old gen1 stack) Don't forget Voltaire too... Anyhow, we are within days of starting on this. There are 2 main portions of this: 1. Port to gen2 API 2. Fix build The other aspects can wait if necessary. How long before we need the first part ? Is there any expectation on how long code review would last ? Or would they also be running the code ? -- Hal From mshefty at ichips.intel.com Thu Nov 18 13:16:00 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 18 Nov 2004 13:16:00 -0800 Subject: [openib-general] [PATCH][RFC/v1][6/12] Add PoIB (IP-over-InfiniBand) driver In-Reply-To: <1100811986.3280.59.camel@localhost.localdomain> References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> <1100808138.3280.30.camel@localhost.localdomain> <523bz6kjdt.fsf@topspin.com> <1100810473.3280.47.camel@localhost.localdomain> <1100811576.21672.31.camel@duffman> <1100811986.3280.59.camel@localhost.localdomain> Message-ID: <419D1110.2080900@ichips.intel.com> Hal Rosenstock wrote: > Anyhow, we are within days of starting on this. > > There are 2 main portions of this: > 1. Port to gen2 API > 2. Fix build > > The other aspects can wait if necessary. > > How long before we need the first part ? Is there any expectation on how > long code review would last ? Or would they also be running the code ? I would use the first part today if I had it. I wouldn't worry too much about code review right away, since its a user-mode component that wouldn't be included in the kernel. - Sean From roland at topspin.com Thu Nov 18 13:25:44 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 13:25:44 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][6/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <1100810473.3280.47.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 18 Nov 2004 15:41:13 -0500") References: <200411181058.E12z7tsbPyqcrIIc@topspin.com> <1100807164.3280.20.camel@localhost.localdomain> <52ekirj5je.fsf@topspin.com> <1100808138.3280.30.camel@localhost.localdomain> <523bz6kjdt.fsf@topspin.com> <1100810473.3280.47.camel@localhost.localdomain> Message-ID: <52y8gyj1d3.fsf@topspin.com> Hal> Although IB has a special join mode intended to support IP Hal> multicast routing (non member), as no means to identify Hal> different multicast styles has yet been determined, all joins Hal> are currently full member. We are looking for guidance in how Hal> to solve this. OK, added to to my patches. Hal> One more thing: Do we also want to say something about no SM Hal> right now ? Or is that putting a cosmic kick me sign on ? As far as I know the SM is now purely a userspace issue -- we have enough kernel support to run the SM. It's probably worth mentioning that OpenSM porting still needs to be done (has anyone started?). - Roland From iod00d at hp.com Thu Nov 18 13:48:27 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Nov 2004 13:48:27 -0800 Subject: [openib-general] [PATCH][RFC/v1][10/12] IPoIB IPv4 multicast In-Reply-To: <200411181058.X9N1lq3F3k3Sfu3e@topspin.com> References: <200411181058.Vg9HQwEgGRimClBJ@topspin.com> <200411181058.X9N1lq3F3k3Sfu3e@topspin.com> Message-ID: <20041118214827.GA15892@esmail.cup.hp.com> On Thu, Nov 18, 2004 at 10:58:50AM -0800, Roland Dreier wrote: > Add ip_ib_mc_map() to convert IPv$ multicast addresses to IPoIB > hardware addresses. ... > + addr = ntohl(addr); ... > + buf[19] = addr & 0xff; > + addr >>= 8; > + buf[18] = addr & 0xff; > + addr >>= 8; > + buf[17] = addr & 0xff; > + addr >>= 8; > + buf[16] = addr & 0x0f; Can the same be done instead with the following? addr &= 0x0fffffff; ((unsigned int *)buf)[4] = cpu_to_be32(addr); Or are there possible alignment issues with buf? Maybe the following is also correct: ((unsigned int *)buf)[4] = addr & htonl(0x0fffffff); anyway...just some micro-optimizations...probably really only matters on BE machines. thanks, grant From roland at topspin.com Thu Nov 18 13:53:32 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 18 Nov 2004 13:53:32 -0800 Subject: [openib-general] [PATCH][RFC/v1][10/12] IPoIB IPv4 multicast In-Reply-To: <20041118214827.GA15892@esmail.cup.hp.com> (Grant Grundler's message of "Thu, 18 Nov 2004 13:48:27 -0800") References: <200411181058.Vg9HQwEgGRimClBJ@topspin.com> <200411181058.X9N1lq3F3k3Sfu3e@topspin.com> <20041118214827.GA15892@esmail.cup.hp.com> Message-ID: <52llcyj02r.fsf@topspin.com> Grant> Can the same be done instead with the following? I think only your second proposal is correct (since addr is in network byte order). However, the existing ip_eth_mc_map() function in uses the "one byte at a time" method, so I thought we might as well follow existing practice. - R. From mlleinin at hpcn.ca.sandia.gov Thu Nov 18 14:25:46 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 18 Nov 2004 14:25:46 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support In-Reply-To: <52d5ybkmlq.fsf@topspin.com> References: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> <52d5ybkmlq.fsf@topspin.com> Message-ID: <1100816746.32165.77.camel@trinity> On Thu, 2004-11-18 at 11:01 -0800, Roland Dreier wrote: > Hmm... looks like our spamassassin is a little trigger happy :) > Well since Roland is sending us all spam I can either boot him off the list or increase the spamassassin threshold. :) I decide to increase the threshold to 7.5 (all the IB patches got a spam score of 6.6) so future kernel patches shouldn't be listed as spam. - Matt From iod00d at hp.com Thu Nov 18 14:52:28 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Nov 2004 14:52:28 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support In-Reply-To: <1100816746.32165.77.camel@trinity> References: <200411181058.nZu5AGvCLwleEqeJ@topspin.com> <52d5ybkmlq.fsf@topspin.com> <1100816746.32165.77.camel@trinity> Message-ID: <20041118225228.GB15892@esmail.cup.hp.com> On Thu, Nov 18, 2004 at 02:25:46PM -0800, Matt Leininger wrote: > On Thu, 2004-11-18 at 11:01 -0800, Roland Dreier wrote: > > Hmm... looks like our spamassassin is a little trigger happy :) > > > > Well since Roland is sending us all spam I can either boot him off the > list or increase the spamassassin threshold. :) I decide to increase > the threshold to 7.5 (all the IB patches got a spam score of 6.6) so > future kernel patches shouldn't be listed as spam. There's 3rd choice: adjust the scoring of individual tests the mail triggered so the total score for those email is < 5. hth, grant From David.Brean at Sun.COM Thu Nov 18 15:56:27 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 18 Nov 2004 18:56:27 -0500 Subject: Bonding (was: Re: [openib-general] Re: More on IPoIB Multicast) In-Reply-To: <1100796136.3277.9.camel@localhost.localdomain> References: <1100020075.7342.1.camel@hpc-1> <52r7n37xz9.fsf@topspin.com> <1100796136.3277.9.camel@localhost.localdomain> Message-ID: <419D36AB.3060304@sun.com> Doesn't the failover mechanism used by the bonding driver move the link layer address from the failed NIC to the standby NIC? If so, the IPoIB link layer address contains a port GUID, so it may be a bit more complex than usual to port bonding over IPoIB. I also thought that the bonding driver would be extended to support ethernet link aggregation (802.3ad) as the load balancing/failover mechanism at some point (I can't find schedule information). This function, too, would not work over IPoIB. -David Hal Rosenstock wrote: > On Thu, 2004-11-18 at 13:41, Nitin Hande wrote: > >> / But before that would like to hear from people about > > />/ various approaches. > / > Some vendors have implemented this by combining multiple HCA ports and > failing over from one to the other. Bonding may provide striping (using > both ports concurrently). > I will need to read up on bonding to understand what it provides and > compare it to what can be done under the IPoIB driver. > > -- Hal > >On Tue, 2004-11-09 at 12:07, Roland Dreier wrote: > > >>multiport bonding/failover >>(although my feeling is that it would be better to extend the existing >>bonding driver rather than trying to put this in the IPoIB driver), .... >> >> > >I'm not clear what the tradeoffs / pros / cons of the two approaches >(use the bonding driver (above the IPoIB driver) or implement it inside >the IPoIB driver) would be. > > > From halr at voltaire.com Fri Nov 19 07:40:01 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 19 Nov 2004 10:40:01 -0500 Subject: [openib-general] OpenIB Thread Usage Message-ID: <1100878801.19061.5.camel@hpc-1> Hi Roland, I noticed that IPoIB uses a single thread whereas the MAD layer uses a thread (per port) per CPU. Could/should IPoIB be multithreaded and would that help performance on multiple processors ? Thanks. -- Hal From roland at topspin.com Fri Nov 19 08:28:33 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:28:33 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <1100878801.19061.5.camel@hpc-1> (Hal Rosenstock's message of "Fri, 19 Nov 2004 10:40:01 -0500") References: <1100878801.19061.5.camel@hpc-1> Message-ID: <524qjliz0u.fsf@topspin.com> Hal> Hi Roland, I noticed that IPoIB uses a single thread whereas Hal> the MAD layer uses a thread (per port) per CPU. Could/should Hal> IPoIB be multithreaded and would that help performance on Hal> multiple processors ? I doubt it. The IPoIB workqueue is used for non-data path stuff like starting multicast joins, which are very far from being CPU-bound. All the data path stuff is run from interrupt context. Of course that's a theoretical argument, and if someone actually measures that changing the create_singlethread_workqueue() to create_workqueue() improves performance, I would have no problem making the change. In fact I'm not sure that having some many MAD workqueue threads isn't overkill that wastes resources, especially on machines with a lot of CPUs. - Roland From roland at topspin.com Fri Nov 19 08:44:26 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:44:26 -0800 Subject: [openib-general] Updated patches coming Message-ID: <52zn1dhjpx.fsf@topspin.com> I'm posting new versions of all the patches. These should incorporate all suggestions from yesterday. Please post any new comments (and let me know if I've screwed up fixing your previous comments). I'll send these patches to linux-kernel (with networking patches cc'ed to netdev) on Monday morning, incorporating any last comments I get by Sunday. Of course I'll also cc openib-general so that everyone here can see the responses without having to sift through linux-kernel. Thanks, Roland From roland at topspin.com Fri Nov 19 08:47:46 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:47:46 -0800 Subject: [openib-general] [PATCH][RFC/v2][0/12] Initial submission of InfiniBand patches for review Message-ID: <20041119 847.L6YuhFRk6dxcC9sS@topspin.com> I'm very happy to be able to post an initial version of InfiniBand patches for review. Although this code should be far closer to kernel coding standards than previous open source InfiniBand drivers, this initial posting should be treated as a request for comments and not a request for inclusion; our ultimate goal is to have these drivers included in the mainline kernel, but we expect that fixes and improvements will need to be made before the code is completely acceptable. These patches add a minimal but complete level of InfiniBand support, including an IB midlayer, a low-level driver for Mellanox HCAs, an IP-over-InfiniBand driver, and a mechanism for MADs (management datagrams) to be passed to and from userspace. This means that these patches are all that is required for the kernel to bring up and use an IP-over-InfiniBand link. (The OpenSM subnet manager has not been ported to this kernel API yet, although this work is underway. This means that at the moment, a kernel with these patches cannot be used to bring up a fabric; however, the kernel side is complete) The code has not been through extreme stress testing yet, but it has been used successfully on i386, x86_64, ppc64, ia64 and sparc64 systems, including mixed 32/64 systems. Feedback on both details of the code as well as the high-level organization of the code will be very much appreciated. For example, the current set of patches puts include files in driver/infiniband/include; would it be preferred to put include files in include/linux/infiniband/, directly in include/linux, or perhaps in include/infiniband? We would also like to explore the best avenue for having these patches merged. It may be desirable for the patches to spend some time in -mm before moving into Linus's kernel; on the other hand, the patches make only very minimal and safe changes outside of drivers/infiniband, so it is quite reasonable to merge them directly into the mainline kernel. Although 2.6.10 is now closed, 2.6.11 will probably be open by the time the review process is complete. We look forward to the community's comments and criticisms! Thanks, Roland Dreier OpenIB Alliance From roland at topspin.com Fri Nov 19 08:47:52 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:47:52 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][1/12] Add core InfiniBand support Message-ID: <20041119 847.0UsrM0D745D1EXvV@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v2][1/12] Add core InfiniBand support Date: Fri, 19 Nov 2004 08:47:52 -0800 Size: 120284 URL: From roland at topspin.com Fri Nov 19 08:47:59 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:47:59 -0800 Subject: [openib-general] [PATCH][RFC/v2][2/12] Hook up drivers/infiniband In-Reply-To: <20041119 847.0UsrM0D745D1EXvV@topspin.com> Message-ID: <20041119 847.Alul4BnW1lXB9SBr@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier Index: linux-bk/drivers/Kconfig =================================================================== --- linux-bk.orig/drivers/Kconfig 2004-11-19 08:34:39.892304998 -0800 +++ linux-bk/drivers/Kconfig 2004-11-19 08:36:00.427436899 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu Index: linux-bk/drivers/Makefile =================================================================== --- linux-bk.orig/drivers/Makefile 2004-11-19 08:35:05.292561917 -0800 +++ linux-bk/drivers/Makefile 2004-11-19 08:36:00.428436751 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Fri Nov 19 08:48:04 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:04 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][3/12] Add InfiniBand MAD (management datagram) support Message-ID: <20041119 848.Sx9CmcXJ37MTHJMY@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v2][3/12] Add InfiniBand MAD (management datagram) support Date: Fri, 19 Nov 2004 08:48:04 -0800 Size: 108322 URL: From roland at topspin.com Fri Nov 19 08:48:17 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:17 -0800 Subject: [openib-general] [PATCH][RFC/v2][5/12] Add Mellanox HCA low-level driver In-Reply-To: <20041119 848.bGZXOMXI6bjJEWQr@topspin.com> Message-ID: <20041119 848.kWwVxIYmeAt15lmS@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier Index: linux-bk/drivers/infiniband/Kconfig =================================================================== --- linux-bk.orig/drivers/infiniband/Kconfig 2004-11-19 08:35:58.828672505 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-11-19 08:36:02.081193188 -0800 @@ -8,4 +8,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/Kconfig" + endmenu Index: linux-bk/drivers/infiniband/Makefile =================================================================== --- linux-bk.orig/drivers/infiniband/Makefile 2004-11-19 08:35:58.864667201 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-11-19 08:36:02.056196872 -0800 @@ -1 +1 @@ -obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND) += core/ hw/ Index: linux-bk/drivers/infiniband/hw/Kconfig =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/Kconfig 2004-11-19 08:36:02.124186852 -0800 @@ -0,0 +1 @@ +source "drivers/infiniband/hw/mthca/Kconfig" Index: linux-bk/drivers/infiniband/hw/Makefile =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/Makefile 2004-11-19 08:36:02.158181842 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND_MTHCA) += mthca/ Index: linux-bk/drivers/infiniband/hw/mthca/Kconfig =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-11-19 08:36:02.184178011 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. Index: linux-bk/drivers/infiniband/hw/mthca/Makefile =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-11-19 08:36:02.224172118 -0800 @@ -0,0 +1,23 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-objs := \ + mthca_main.o \ + mthca_cmd.o \ + mthca_profile.o \ + mthca_reset.o \ + mthca_allocator.o \ + mthca_eq.o \ + mthca_pd.o \ + mthca_cq.o \ + mthca_mr.o \ + mthca_qp.o \ + mthca_av.o \ + mthca_mcg.o \ + mthca_mad.o \ + mthca_provider.o Index: linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-11-19 08:36:02.277164308 -0800 @@ -0,0 +1,175 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-11-19 08:36:02.312159151 -0800 @@ -0,0 +1,212 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +} __attribute__((packed)); + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-11-19 08:36:02.355152815 -0800 @@ -0,0 +1,1522 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* + * Flush posted writes so GO bit is written last (needed with + * __raw_writel, which may not order writes). + */ + readl(dev->hcr + HCR_STATUS_OFFSET); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = readb(dev->hcr + HCR_STATUS_OFFSET); + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->max_avs = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-11-19 08:36:02.381148984 -0800 @@ -0,0 +1,260 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_avs; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-11-19 08:36:02.406145301 -0800 @@ -0,0 +1,51 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-11-19 08:36:02.451138670 -0800 @@ -0,0 +1,821 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +} __attribute__((packed)); + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1)); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe - 1) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1)); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + (cq->ibcq.cqe - 1)), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1); + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1); + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + wmb(); + inc_cons_index(dev, cq, freed); + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-11-19 08:36:02.478134692 -0800 @@ -0,0 +1,386 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-11-19 08:36:02.515129240 -0800 @@ -0,0 +1,119 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-11-19 08:36:02.559122757 -0800 @@ -0,0 +1,650 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + int work = 0; + + while (1) { + if (!next_eqe_sw(eq)) + break; + + eqe = get_eqe(eq, eq->cons_index); + work = 1; + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + } + + if (work) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } + + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-11-19 08:36:02.587118631 -0800 @@ -0,0 +1,321 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + IB_SM_PORT_INFO = 0x0015, + IB_SM_PKEY_TABLE = 0x0016, + IB_SM_SM_INFO = 0x0020, + IB_SM_VENDOR_START = 0xff00 +}; + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = pci_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-11-19 08:36:02.665107138 -0800 @@ -0,0 +1,889 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + if (dev_lim.min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim.min_page_sz, PAGE_SIZE); + err = -ENODEV; + goto err_out_disable; + } + if (dev_lim.num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim.num_ports, MTHCA_MAX_PORTS); + err = -ENODEV; + goto err_out_disable; + } + + mdev->limits.num_ports = dev_lim.num_ports; + mdev->limits.vl_cap = dev_lim.max_vl; + mdev->limits.mtu_cap = dev_lim.max_mtu; + mdev->limits.gid_table_len = dev_lim.max_gids; + mdev->limits.pkey_table_len = dev_lim.max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay; + mdev->limits.max_sg = dev_lim.max_sg; + mdev->limits.reserved_qps = dev_lim.reserved_qps; + mdev->limits.reserved_srqs = dev_lim.reserved_srqs; + mdev->limits.reserved_eecs = dev_lim.reserved_eecs; + mdev->limits.reserved_cqs = dev_lim.reserved_cqs; + mdev->limits.reserved_eqs = dev_lim.reserved_eqs; + mdev->limits.reserved_mtts = dev_lim.reserved_mtts; + mdev->limits.reserved_mrws = dev_lim.reserved_mrws; + mdev->limits.reserved_uars = dev_lim.reserved_uars; + mdev->limits.reserved_pds = dev_lim.reserved_pds; + + if (dev_lim.flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_sg; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) { + mdev->fw.arbel.mem[i].page = alloc_page(GFP_HIGHUSER); + mdev->fw.arbel.mem[i].length = PAGE_SIZE; + if (!mdev->fw.arbel.mem[i].page) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + return -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-11-19 08:36:02.691103307 -0800 @@ -0,0 +1,372 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +} __attribute__((packed)); + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-11-19 08:36:02.735096824 -0800 @@ -0,0 +1,389 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-11-19 08:36:02.775090930 -0800 @@ -0,0 +1,76 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-11-19 08:36:02.802086952 -0800 @@ -0,0 +1,222 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-11-19 08:36:02.826083415 -0800 @@ -0,0 +1,58 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-11-19 08:36:02.865077669 -0800 @@ -0,0 +1,629 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* Temporary until we get core support straightened out */ +enum { + IB_SMP_ATTRIB_NODE_INFO = 0x0011, + IB_SMP_ATTRIB_GUID_INFO = 0x0014, + IB_SMP_ATTRIB_PORT_INFO = 0x0015, + IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 +}; + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent < entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = dev->pdev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-11-19 08:36:02.912070743 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-11-19 08:36:02.958063966 -0800 @@ -0,0 +1,1485 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_qp.c 1270 2004-11-18 21:47:31Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +} __attribute__((packed)); + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +} __attribute__((packed)); + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +} __attribute__((packed)); + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +} __attribute__((packed)); + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +} __attribute__((packed)); + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +} __attribute__((packed)); + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +} __attribute__((packed)); + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) { + spin_lock_irq(&qp->lock); + /* XXX deal with async transitions to ERROR */ + qp->state = new_state; + spin_unlock_irq(&qp->lock); + } + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (qp->transport == UD) { + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + } else if (qp->transport == MLX) { + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-11-19 08:36:03.007056746 -0800 @@ -0,0 +1,228 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Fri Nov 19 08:48:11 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:11 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][4/12] Add InfiniBand SA (Subnet Administration) query support Message-ID: <20041119 848.bGZXOMXI6bjJEWQr@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v2][4/12] Add InfiniBand SA (Subnet Administration) query support Date: Fri, 19 Nov 2004 08:48:11 -0800 Size: 32678 URL: From roland at topspin.com Fri Nov 19 08:48:17 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:17 -0800 Subject: [openib-general] [PATCH][RFC/v2][6/12] IPoIB IPv4 multicast In-Reply-To: <20041119 848.kWwVxIYmeAt15lmS@topspin.com> Message-ID: <20041119 848.hDNvGYK1INkrbzum@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier Index: linux-bk/include/linux/if_infiniband.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-11-19 08:36:05.004762348 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ Index: linux-bk/include/net/ip.h =================================================================== --- linux-bk.orig/include/net/ip.h 2004-11-19 08:34:13.893136297 -0800 +++ linux-bk/include/net/ip.h 2004-11-19 08:36:05.005762200 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif Index: linux-bk/net/ipv4/arp.c =================================================================== --- linux-bk.orig/net/ipv4/arp.c 2004-11-19 08:34:34.281131877 -0800 +++ linux-bk/net/ipv4/arp.c 2004-11-19 08:36:05.005762200 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Fri Nov 19 08:48:23 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:23 -0800 Subject: [openib-general] [PATCH][RFC/v2][7/12] IPoIB IPv6 support In-Reply-To: <20041119 848.hDNvGYK1INkrbzum@topspin.com> Message-ID: <20041119 848.XUxwEgSdAfHPje3T@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier Index: linux-bk/include/net/if_inet6.h =================================================================== --- linux-bk.orig/include/net/if_inet6.h 2004-11-19 08:34:41.311095920 -0800 +++ linux-bk/include/net/if_inet6.h 2004-11-19 08:36:05.345712102 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif Index: linux-bk/net/ipv6/addrconf.c =================================================================== --- linux-bk.orig/net/ipv6/addrconf.c 2004-11-19 08:34:39.555354652 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-11-19 08:36:05.347711808 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1098,6 +1099,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1797,6 +1804,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; Index: linux-bk/net/ipv6/ndisc.c =================================================================== --- linux-bk.orig/net/ipv6/ndisc.c 2004-11-19 08:34:04.597506114 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-11-19 08:36:05.348711660 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Fri Nov 19 08:48:28 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:28 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][8/12] Add IPoIB (IP-over-InfiniBand) driver Message-ID: <20041119 848.bjjQhFQkoeJ2U43n@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v2][8/12] Add IPoIB (IP-over-InfiniBand) driver Date: Fri, 19 Nov 2004 08:48:28 -0800 Size: 101515 URL: From roland at topspin.com Fri Nov 19 08:48:35 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:35 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][9/12] Add InfiniBand userspace MAD support Message-ID: <20041119 848.SV7ZJDa1e8EOaqlB@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v2][9/12] Add InfiniBand userspace MAD support Date: Fri, 19 Nov 2004 08:48:35 -0800 Size: 23312 URL: From roland at topspin.com Fri Nov 19 08:48:40 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:40 -0800 Subject: [openib-general] [PATCH][RFC/v2][10/12] Document InfiniBand ioctl use In-Reply-To: <20041119 848.SV7ZJDa1e8EOaqlB@topspin.com> Message-ID: <20041119 848.HEJ0RHrfzfVVBRVp@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier Index: linux-bk/Documentation/ioctl-number.txt =================================================================== --- linux-bk.orig/Documentation/ioctl-number.txt 2004-11-19 08:34:40.240253723 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-11-19 08:36:07.257430376 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Fri Nov 19 08:48:45 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:45 -0800 Subject: [openib-general] [PATCH][RFC/v2][11/12] Add InfiniBand Documentation files In-Reply-To: <20041119 848.HEJ0RHrfzfVVBRVp@topspin.com> Message-ID: <20041119 848.wcw3ffhLsggHhXp4@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier Index: linux-bk/Documentation/infiniband/ipoib.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-11-19 08:36:07.579382931 -0800 @@ -0,0 +1,55 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output + in the data path when debug_level is set to 2. However, even with + the output disabled, this option will affect performance. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html Index: linux-bk/Documentation/infiniband/sysfs.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-11-19 08:36:07.775354051 -0800 @@ -0,0 +1,63 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" Index: linux-bk/Documentation/infiniband/user_mad.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-11-19 08:36:07.822347125 -0800 @@ -0,0 +1,77 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad" + + can be used. This will create a device node named + + /dev/infiniband/mthca0/ports/1/mad + + for port 1 of device mthca0, and so on. From roland at topspin.com Fri Nov 19 08:48:51 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 08:48:51 -0800 Subject: [openib-general] [PATCH][RFC/v2][12/12] InfiniBand MAINTAINERS entry In-Reply-To: <20041119 848.wcw3ffhLsggHhXp4@topspin.com> Message-ID: <20041119 848.ZGXMNSS7d0T6XA9U@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier Index: linux-bk/MAINTAINERS =================================================================== --- linux-bk.orig/MAINTAINERS 2004-11-19 08:34:04.771480477 -0800 +++ linux-bk/MAINTAINERS 2004-11-19 08:36:08.142299974 -0800 @@ -1075,6 +1075,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From roland at topspin.com Fri Nov 19 09:04:36 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 09:04:36 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v2][1/12] Add core InfiniBand support In-Reply-To: <20041119 847.0UsrM0D745D1EXvV@topspin.com> (Roland Dreier's message of "Fri, 19 Nov 2004 08:47:52 -0800") References: <20041119 847.0UsrM0D745D1EXvV@topspin.com> Message-ID: <52hdnlhisb.fsf@topspin.com> This being flagged as spam seems to be a bug in my patch-sending script -- I'm generating an invalid message ID: 1.8 INVALID_MSGID Message-Id is not valid, according to RFC 2822 I'll fix this up before Monday. - R. From mshefty at ichips.intel.com Fri Nov 19 09:08:43 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Nov 2004 09:08:43 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <524qjliz0u.fsf@topspin.com> References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> Message-ID: <419E289B.90408@ichips.intel.com> Roland Dreier wrote: > In fact I'm not sure that having some many MAD workqueue threads isn't > overkill that wastes resources, especially on machines with a lot of > CPUs. I tried to keep the MAD layer from knowing about completion threads to make it easier to change it later. I think once we get to some CM performance testing, we can try adjusting the threading model to gives us the best performance and scalability: one per port, one per CPU shared across all ports, one per system, dynamically allocated threads, etc. Right now, I'm not sure what sort of performance hit we'll see by having additional idle threads when MAD traffic is low. - Sean From roland at topspin.com Fri Nov 19 09:20:24 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 09:20:24 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <419E289B.90408@ichips.intel.com> (Sean Hefty's message of "Fri, 19 Nov 2004 09:08:43 -0800") References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> Message-ID: <52d5y9hi1z.fsf@topspin.com> Sean> I tried to keep the MAD layer from knowing about completion Sean> threads to make it easier to change it later. I think once Sean> we get to some CM performance testing, we can try adjusting Sean> the threading model to gives us the best performance and Sean> scalability: one per port, one per CPU shared across all Sean> ports, one per system, dynamically allocated threads, etc. I think the CM ends up needing its own set of workqueues so that it can queue MAD processing along with "time wait" events etc. Also we don't want the CM to block general MAD processing while it waits for things like QP modify. Sean> Right now, I'm not sure what sort of performance hit we'll Sean> see by having additional idle threads when MAD traffic is Sean> low. idle threads have pretty minimal impact beyond the memory they use. However on say a 512 CPU box with 6 HCAs, we would create 6000+ kernel threads, which seems pretty excessive. - R. From mshefty at ichips.intel.com Fri Nov 19 09:36:27 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Nov 2004 09:36:27 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <52d5y9hi1z.fsf@topspin.com> References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com> Message-ID: <419E2F1B.7050804@ichips.intel.com> Roland Dreier wrote: > I think the CM ends up needing its own set of workqueues so that it > can queue MAD processing along with "time wait" events etc. Also we > don't want the CM to block general MAD processing while it waits for > things like QP modify. I thought about this approach, but wasn't sure about taking a context switch. I guess with QP redirection, this wouldn't be an issue though. > idle threads have pretty minimal impact beyond the memory they use. > However on say a 512 CPU box with 6 HCAs, we would create 6000+ kernel > threads, which seems pretty excessive. Wouldn't it still just be one per port, or 12 total? From halr at voltaire.com Fri Nov 19 09:46:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 19 Nov 2004 12:46:38 -0500 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <419E2F1B.7050804@ichips.intel.com> References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com> <419E2F1B.7050804@ichips.intel.com> Message-ID: <1100886398.3002.0.camel@hpc-1> On Fri, 2004-11-19 at 12:36, Sean Hefty wrote: > > idle threads have pretty minimal impact beyond the memory they use. > > However on say a 512 CPU box with 6 HCAs, we would create 6000+ kernel > > threads, which seems pretty excessive. > > Wouldn't it still just be one per port, or 12 total? It's currently 1/port/CPU. I think this can be changed easily. -- Hal From roland at topspin.com Fri Nov 19 09:51:33 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 19 Nov 2004 09:51:33 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <419E2F1B.7050804@ichips.intel.com> (Sean Hefty's message of "Fri, 19 Nov 2004 09:36:27 -0800") References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com> <419E2F1B.7050804@ichips.intel.com> Message-ID: <528y8xhgm2.fsf@topspin.com> Sean> I thought about this approach, but wasn't sure about taking Sean> a context switch. I guess with QP redirection, this Sean> wouldn't be an issue though. I don't think there's a choice. If the CM processes MADs from one queue and time wait expirations from another, it's not possible to prevent the MAD queue from getting arbitrarily far ahead of the time wait queue. This results in QPs never being reaped and eventually the system runs out of memory. - R. From mshefty at ichips.intel.com Fri Nov 19 11:59:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Nov 2004 11:59:11 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <528y8xhgm2.fsf@topspin.com> References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com> <419E2F1B.7050804@ichips.intel.com> <528y8xhgm2.fsf@topspin.com> Message-ID: <419E508F.5040603@ichips.intel.com> Roland Dreier wrote: > Sean> I thought about this approach, but wasn't sure about taking > Sean> a context switch. I guess with QP redirection, this > Sean> wouldn't be an issue though. > > I don't think there's a choice. If the CM processes MADs from one queue > and time wait expirations from another, it's not possible to prevent > the MAD queue from getting arbitrarily far ahead of the time wait > queue. This results in QPs never being reaped and eventually the > system runs out of memory. I'm not understanding the issue here. If connections are being made faster than QPs are leaving the time wait state, then the system will eventually run out of resources. But this problem seems somewhat separate from the threading model used to establish connections, unless that thread is preventing other threads from executing. If that's the case, is it be worth considering exposing the MAD work queue for use by not just the MAD layer, but also specific clients, such as the CM and SA client code? I would think that as long as the code is structured around using work queues, we should be able to adjust the number of threads per work queue, along with the number of work queues in order to see what combination works best. - Sean From halr at voltaire.com Fri Nov 19 12:18:45 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 19 Nov 2004 15:18:45 -0500 Subject: [openib-general] [RFC] [PATCH] mad: Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU Message-ID: <1100895525.4136.11.camel@localhost.localdomain> Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU (Note that I have not applied this but am requesting comments). Index: mad.c =================================================================== --- mad.c (revision 1269) +++ mad.c (working copy) @@ -1900,7 +1900,7 @@ goto error7; snprintf(name, sizeof name, "ib_mad%d", port_num); - port_priv->wq = create_workqueue(name); + port_priv->wq = create_singlethread_workqueue(name); if (!port_priv->wq) { ret = -ENOMEM; goto error8; From mshefty at ichips.intel.com Fri Nov 19 13:25:27 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Nov 2004 13:25:27 -0800 Subject: [openib-general] [RFC] [PATCH] mad: Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU In-Reply-To: <1100895525.4136.11.camel@localhost.localdomain> References: <1100895525.4136.11.camel@localhost.localdomain> Message-ID: <419E64C7.9030905@ichips.intel.com> Hal Rosenstock wrote: > Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU > (Note that I have not applied this but am requesting comments). > > Index: mad.c > =================================================================== > --- mad.c (revision 1269) > +++ mad.c (working copy) > @@ -1900,7 +1900,7 @@ > goto error7; > > snprintf(name, sizeof name, "ib_mad%d", port_num); > - port_priv->wq = create_workqueue(name); > + port_priv->wq = create_singlethread_workqueue(name); > if (!port_priv->wq) { > ret = -ENOMEM; > goto error8; My guess is that this is probably preferable to having 1/port/CPU, especially on larger systems. It would depend on what the clients do when notified of a completion. I guess one advantage of keeping it 1/port/CPU (for now) is that it would help test multi-threaded support. - Sean From halr at voltaire.com Sun Nov 21 09:17:57 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 21 Nov 2004 12:17:57 -0500 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <52wtwmk9q8.fsf@topspin.com> References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman> <52wtwmk9q8.fsf@topspin.com> Message-ID: <1101057477.4124.7.camel@localhost.localdomain> On Mon, 2004-11-15 at 17:50, Roland Dreier wrote: > Tom> I just tried with the latest gen2 openib bits on 2.6.10-rc2, > Tom> mthca and ipoib builtin and everything builds and boots fine > Tom> (at least on x86_64). > > Cool, thanks for testing. For what it's worth, it works here on i386 > as well. (Not very convenient for development though :) Thanks for investigating. Here's the combo which breaks the build as follows: LD drivers/infiniband/core/built-in.o LD drivers/infiniband/built-in.o ld: cannot open drivers/infiniband/ulp/built-in.o: No such file or directory make[2]: *** [drivers/infiniband/built-in.o] Error 1 This occurs when you configure Infiniband as built-in and mthca and IPoIB as modular. This sort of combo seems to work for other subsystems like I2O. -- Hal From mst at mellanox.co.il Sun Nov 21 14:06:07 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 22 Nov 2004 00:06:07 +0200 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <419E508F.5040603@ichips.intel.com> References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com> <419E2F1B.7050804@ichips.intel.com> <528y8xhgm2.fsf@topspin.com> <419E508F.5040603@ichips.intel.com> Message-ID: <20041121220607.GF11676@mellanox.co.il> Hello! Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] Re: OpenIB Thread Usage": > Roland Dreier wrote: > > > Sean> I thought about this approach, but wasn't sure about taking > > Sean> a context switch. I guess with QP redirection, this > > Sean> wouldn't be an issue though. > > > >I don't think there's a choice. If the CM processes MADs from one queue > >and time wait expirations from another, it's not possible to prevent > >the MAD queue from getting arbitrarily far ahead of the time wait > >queue. This results in QPs never being reaped and eventually the > >system runs out of memory. > > I'm not understanding the issue here. If connections are being made > faster than QPs are leaving the time wait state, then the system will > eventually run out of resources. But this problem seems somewhat > separate from the threading model used to establish connections, unless > that thread is preventing other threads from executing. The idea I think was that if you start dropping MADs when you can not establish a connection, the remote side will retry. One way to drop MADs is by blocking the MAD work thread. MST From roland at topspin.com Sun Nov 21 20:41:31 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 21 Nov 2004 20:41:31 -0800 Subject: [openib-general] Re: OpenIB BuiltIn Support ? In-Reply-To: <1101057477.4124.7.camel@localhost.localdomain> (Hal Rosenstock's message of "Sun, 21 Nov 2004 12:17:57 -0500") References: <1100553459.2767.80.camel@hpc-1> <52actilqml.fsf@topspin.com> <419929AA.1010409@pantasys.com> <1100557988.13150.28.camel@duffman> <52wtwmk9q8.fsf@topspin.com> <1101057477.4124.7.camel@localhost.localdomain> Message-ID: <52d5y6fqbo.fsf@topspin.com> Hal> Thanks for investigating. Here's the combo which breaks the Hal> build as follows: Hal> This occurs when you configure Infiniband as built-in and Hal> mthca and IPoIB as modular. Hmm, seems like the kernel build system doesn't like going into directories and finding nothing to build. I fixed it by getting rid of ulp/Makefile, ulp/Kconfig, hw/Makefile and hw/Kconfig (and having infiniband/Kconfig and infiniband/Makefile go directy to ulp/ipoib and hw/mthca). With these changes, I built but didn't boot a kernel with IB-core built-in and mthca/ipoib as modules. - R. From roland at topspin.com Sun Nov 21 20:45:45 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 21 Nov 2004 20:45:45 -0800 Subject: [openib-general] Re: OpenIB Thread Usage In-Reply-To: <419E508F.5040603@ichips.intel.com> (Sean Hefty's message of "Fri, 19 Nov 2004 11:59:11 -0800") References: <1100878801.19061.5.camel@hpc-1> <524qjliz0u.fsf@topspin.com> <419E289B.90408@ichips.intel.com> <52d5y9hi1z.fsf@topspin.com> <419E2F1B.7050804@ichips.intel.com> <528y8xhgm2.fsf@topspin.com> <419E508F.5040603@ichips.intel.com> Message-ID: <524qjifq4m.fsf@topspin.com> Sean> I'm not understanding the issue here. If connections are Sean> being made faster than QPs are leaving the time wait state, Sean> then the system will eventually run out of resources. But Sean> this problem seems somewhat separate from the threading Sean> model used to establish connections, unless that thread is Sean> preventing other threads from executing. Sorry I wasn't clearer. The problem I was trying to describe (which incidentally has been seen with a real application) is that if MADs like CM REQs are processed in one queue, and time wait expirations are processed in a separate queue, then it's possible for the MAD queue + application to starve the time wait queue. This means a larger and larger backlog of time wait expirations accumulates and eventually the system runs out of resources, even though the application only keeps a constant number of QPs in use. Sean> If that's the case, is it be worth considering exposing the Sean> MAD work queue for use by not just the MAD layer, but also Sean> specific clients, such as the CM and SA client code? That would be another solution. However it seems reasonable to let clients tune their work processing model to their needs (and avoid having them clog up the MAD queue). - R. From roland at topspin.com Mon Nov 22 07:13:24 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:24 -0800 Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review Message-ID: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> I'm very happy to be able to post an initial version of InfiniBand patches for review. Although this code should be far closer to kernel coding standards than previous open source InfiniBand drivers, this initial posting should be treated as a request for comments and not a request for inclusion; our ultimate goal is to have these drivers included in the mainline kernel, but we expect that fixes and improvements will need to be made before the code is completely acceptable. These patches add a minimal but complete level of InfiniBand support, including an IB midlayer, a low-level driver for Mellanox HCAs, an IP-over-InfiniBand driver, and a mechanism for MADs (management datagrams) to be passed to and from userspace. This means that these patches are all that is required for the kernel to bring up and use an IP-over-InfiniBand link. (The OpenSM subnet manager has not been ported to this kernel API yet, although this work is underway. This means that at the moment, a kernel with these patches cannot be used to bring up a fabric; however, the kernel side is complete) The code has not been through extreme stress testing yet, but it has been used successfully on i386, x86_64, ppc64, ia64 and sparc64 systems, including mixed 32/64 systems. Feedback on both details of the code as well as the high-level organization of the code will be very much appreciated. For example, the current set of patches puts include files in driver/infiniband/include; would it be preferred to put include files in include/linux/infiniband/, directly in include/linux, or perhaps in include/infiniband? We would also like to explore the best avenue for having these patches merged. It may be desirable for the patches to spend some time in -mm before moving into Linus's kernel; on the other hand, the patches make only very minimal and safe changes outside of drivers/infiniband, so it is quite reasonable to merge them directly into the mainline kernel. Although 2.6.10 is now closed, 2.6.11 will probably be open by the time the review process is complete. We look forward to the community's comments and criticisms! Thanks, Roland Dreier OpenIB Alliance www.openib.org From roland at topspin.com Mon Nov 22 07:13:29 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:29 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support Message-ID: <20041122713.TMt4584EVSreQOO2@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][1/12] Add core InfiniBand support Date: Mon, 22 Nov 2004 07:13:29 -0800 Size: 120269 URL: From roland at topspin.com Mon Nov 22 07:13:36 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:36 -0800 Subject: [openib-general] [PATCH][RFC/v1][2/12] Hook up drivers/infiniband In-Reply-To: <20041122713.TMt4584EVSreQOO2@topspin.com> Message-ID: <20041122713.yCm1WiU1XOAxLOWd@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier Index: linux-bk/drivers/Kconfig =================================================================== --- linux-bk.orig/drivers/Kconfig 2004-11-21 21:07:30.646934807 -0800 +++ linux-bk/drivers/Kconfig 2004-11-21 21:25:52.850360262 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu Index: linux-bk/drivers/Makefile =================================================================== --- linux-bk.orig/drivers/Makefile 2004-11-21 21:07:54.491393897 -0800 +++ linux-bk/drivers/Makefile 2004-11-21 21:25:52.850360262 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Mon Nov 22 07:13:41 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:41 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][3/12] Add InfiniBand MAD (management datagram) support Message-ID: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][3/12] Add InfiniBand MAD (management datagram) support Date: Mon, 22 Nov 2004 07:13:41 -0800 Size: 108306 URL: From roland at topspin.com Mon Nov 22 07:13:48 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:48 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support Message-ID: <20041122713.g6bh6aqdXIN4RJYR@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support Date: Mon, 22 Nov 2004 07:13:48 -0800 Size: 32662 URL: From roland at topspin.com Mon Nov 22 07:13:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:54 -0800 Subject: [openib-general] [PATCH][RFC/v1][5/12] Add Mellanox HCA low-level driver In-Reply-To: <20041122713.g6bh6aqdXIN4RJYR@topspin.com> Message-ID: <20041122713.cSeT4UFKGqJDdZ8T@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier Index: linux-bk/drivers/infiniband/Kconfig =================================================================== --- linux-bk.orig/drivers/infiniband/Kconfig 2004-11-21 21:25:51.525556772 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-11-21 21:25:54.389132014 -0800 @@ -8,4 +8,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/mthca/Kconfig" + endmenu Index: linux-bk/drivers/infiniband/Makefile =================================================================== --- linux-bk.orig/drivers/infiniband/Makefile 2004-11-21 21:25:51.549553213 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-11-21 21:25:54.364135721 -0800 @@ -1 +1,2 @@ obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ Index: linux-bk/drivers/infiniband/hw/mthca/Kconfig =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-11-21 21:25:54.414128306 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. Index: linux-bk/drivers/infiniband/hw/mthca/Makefile =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-11-21 21:25:54.439124598 -0800 @@ -0,0 +1,23 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-objs := \ + mthca_main.o \ + mthca_cmd.o \ + mthca_profile.o \ + mthca_reset.o \ + mthca_allocator.o \ + mthca_eq.o \ + mthca_pd.o \ + mthca_cq.o \ + mthca_mr.o \ + mthca_qp.o \ + mthca_av.o \ + mthca_mcg.o \ + mthca_mad.o \ + mthca_provider.o Index: linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-11-21 21:25:54.464120890 -0800 @@ -0,0 +1,175 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-11-21 21:25:54.489117183 -0800 @@ -0,0 +1,212 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +} __attribute__((packed)); + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-11-21 21:25:54.517113030 -0800 @@ -0,0 +1,1522 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* + * Flush posted writes so GO bit is written last (needed with + * __raw_writel, which may not order writes). + */ + readl(dev->hcr + HCR_STATUS_OFFSET); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = readb(dev->hcr + HCR_STATUS_OFFSET); + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->max_avs = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-11-21 21:25:54.543109174 -0800 @@ -0,0 +1,260 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_avs; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-11-21 21:25:54.567105615 -0800 @@ -0,0 +1,51 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-11-21 21:25:54.594101610 -0800 @@ -0,0 +1,821 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +} __attribute__((packed)); + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1)); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe - 1) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1)); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + (cq->ibcq.cqe - 1)), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1); + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1); + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + wmb(); + inc_cons_index(dev, cq, freed); + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-11-21 21:25:54.619097902 -0800 @@ -0,0 +1,386 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-11-21 21:25:54.644094195 -0800 @@ -0,0 +1,119 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-11-21 21:25:54.670090339 -0800 @@ -0,0 +1,650 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + int work = 0; + + while (1) { + if (!next_eqe_sw(eq)) + break; + + eqe = get_eqe(eq, eq->cons_index); + work = 1; + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + } + + if (work) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } + + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-11-21 21:25:54.696086483 -0800 @@ -0,0 +1,321 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + IB_SM_PORT_INFO = 0x0015, + IB_SM_PKEY_TABLE = 0x0016, + IB_SM_SM_INFO = 0x0020, + IB_SM_VENDOR_START = 0xff00 +}; + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = pci_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + pci_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + PCI_DMA_TODEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-11-21 21:25:54.722082627 -0800 @@ -0,0 +1,889 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + if (dev_lim.min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim.min_page_sz, PAGE_SIZE); + err = -ENODEV; + goto err_out_disable; + } + if (dev_lim.num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim.num_ports, MTHCA_MAX_PORTS); + err = -ENODEV; + goto err_out_disable; + } + + mdev->limits.num_ports = dev_lim.num_ports; + mdev->limits.vl_cap = dev_lim.max_vl; + mdev->limits.mtu_cap = dev_lim.max_mtu; + mdev->limits.gid_table_len = dev_lim.max_gids; + mdev->limits.pkey_table_len = dev_lim.max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay; + mdev->limits.max_sg = dev_lim.max_sg; + mdev->limits.reserved_qps = dev_lim.reserved_qps; + mdev->limits.reserved_srqs = dev_lim.reserved_srqs; + mdev->limits.reserved_eecs = dev_lim.reserved_eecs; + mdev->limits.reserved_cqs = dev_lim.reserved_cqs; + mdev->limits.reserved_eqs = dev_lim.reserved_eqs; + mdev->limits.reserved_mtts = dev_lim.reserved_mtts; + mdev->limits.reserved_mrws = dev_lim.reserved_mrws; + mdev->limits.reserved_uars = dev_lim.reserved_uars; + mdev->limits.reserved_pds = dev_lim.reserved_pds; + + if (dev_lim.flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_sg; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) { + mdev->fw.arbel.mem[i].page = alloc_page(GFP_HIGHUSER); + mdev->fw.arbel.mem[i].length = PAGE_SIZE; + if (!mdev->fw.arbel.mem[i].page) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + return -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-11-21 21:25:54.747078919 -0800 @@ -0,0 +1,372 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +} __attribute__((packed)); + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-11-21 21:25:54.772075211 -0800 @@ -0,0 +1,389 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-11-21 21:25:54.797071503 -0800 @@ -0,0 +1,76 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-11-21 21:25:54.822067796 -0800 @@ -0,0 +1,222 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-11-21 21:25:54.847064088 -0800 @@ -0,0 +1,58 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-11-21 21:25:54.873060232 -0800 @@ -0,0 +1,629 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* Temporary until we get core support straightened out */ +enum { + IB_SMP_ATTRIB_NODE_INFO = 0x0011, + IB_SMP_ATTRIB_GUID_INFO = 0x0014, + IB_SMP_ATTRIB_PORT_INFO = 0x0015, + IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 +}; + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent < entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = dev->pdev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-11-21 21:25:54.898056524 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-11-21 21:25:54.927052223 -0800 @@ -0,0 +1,1485 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_qp.c 1270 2004-11-18 21:47:31Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +} __attribute__((packed)); + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +} __attribute__((packed)); + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +} __attribute__((packed)); + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +} __attribute__((packed)); + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +} __attribute__((packed)); + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +} __attribute__((packed)); + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +} __attribute__((packed)); + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) { + spin_lock_irq(&qp->lock); + /* XXX deal with async transitions to ERROR */ + qp->state = new_state; + spin_unlock_irq(&qp->lock); + } + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (qp->transport == UD) { + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + } else if (qp->transport == MLX) { + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ Index: linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-11-21 21:25:54.952048515 -0800 @@ -0,0 +1,228 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Mon Nov 22 07:13:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:54 -0800 Subject: [openib-general] [PATCH][RFC/v1][6/12] IPoIB IPv4 multicast In-Reply-To: <20041122713.cSeT4UFKGqJDdZ8T@topspin.com> Message-ID: <20041122713.Md0y3UqVvYcRT3Zf@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier Index: linux-bk/include/linux/if_infiniband.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-11-21 21:25:56.078881371 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ Index: linux-bk/include/net/ip.h =================================================================== --- linux-bk.orig/include/net/ip.h 2004-11-21 21:07:12.110687532 -0800 +++ linux-bk/include/net/ip.h 2004-11-21 21:25:56.078881371 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif Index: linux-bk/net/ipv4/arp.c =================================================================== --- linux-bk.orig/net/ipv4/arp.c 2004-11-21 21:07:24.904787535 -0800 +++ linux-bk/net/ipv4/arp.c 2004-11-21 21:25:56.079881223 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Nov 22 07:13:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:13:59 -0800 Subject: [openib-general] [PATCH][RFC/v1][7/12] IPoIB IPv6 support In-Reply-To: <20041122713.Md0y3UqVvYcRT3Zf@topspin.com> Message-ID: <20041122713.FnSlYodJYum7s82D@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier Index: linux-bk/include/net/if_inet6.h =================================================================== --- linux-bk.orig/include/net/if_inet6.h 2004-11-21 21:07:35.126269616 -0800 +++ linux-bk/include/net/if_inet6.h 2004-11-21 21:25:56.386835692 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif Index: linux-bk/net/ipv6/addrconf.c =================================================================== --- linux-bk.orig/net/ipv6/addrconf.c 2004-11-21 21:07:29.222146392 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-11-21 21:25:56.387835544 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1098,6 +1099,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1797,6 +1804,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; Index: linux-bk/net/ipv6/ndisc.c =================================================================== --- linux-bk.orig/net/ipv6/ndisc.c 2004-11-21 21:07:06.642499599 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-11-21 21:25:56.388835395 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Nov 22 07:14:11 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:14:11 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support Message-ID: <20041122714.9zlcKGKvXlpga8EP@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support Date: Mon, 22 Nov 2004 07:14:11 -0800 Size: 23296 URL: From roland at topspin.com Mon Nov 22 07:14:04 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:14:04 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][8/12] Add IPoIB (IP-over-InfiniBand) driver Message-ID: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Roland Dreier Subject: [PATCH][RFC/v1][8/12] Add IPoIB (IP-over-InfiniBand) driver Date: Mon, 22 Nov 2004 07:14:04 -0800 Size: 101204 URL: From roland at topspin.com Mon Nov 22 07:14:17 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:14:17 -0800 Subject: [openib-general] [PATCH][RFC/v1][10/12] Document InfiniBand ioctl use In-Reply-To: <20041122714.9zlcKGKvXlpga8EP@topspin.com> Message-ID: <20041122714.taTI3zcdWo5JfuMd@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier Index: linux-bk/Documentation/ioctl-number.txt =================================================================== --- linux-bk.orig/Documentation/ioctl-number.txt 2004-11-21 21:07:31.047875266 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-11-21 21:25:57.971600622 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Mon Nov 22 07:14:22 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:14:22 -0800 Subject: [openib-general] [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122714.taTI3zcdWo5JfuMd@topspin.com> Message-ID: <20041122714.AyIOvRY195EGFTaO@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier Index: linux-bk/Documentation/infiniband/ipoib.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-11-21 21:25:58.205565918 -0800 @@ -0,0 +1,55 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output + in the data path when debug_level is set to 2. However, even with + the output disabled, this option will affect performance. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html Index: linux-bk/Documentation/infiniband/sysfs.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-11-21 21:25:58.231562062 -0800 @@ -0,0 +1,63 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" Index: linux-bk/Documentation/infiniband/user_mad.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-11-21 21:25:58.258558058 -0800 @@ -0,0 +1,77 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad" + + can be used. This will create a device node named + + /dev/infiniband/mthca0/ports/1/mad + + for port 1 of device mthca0, and so on. From roland at topspin.com Mon Nov 22 07:14:27 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:14:27 -0800 Subject: [openib-general] [PATCH][RFC/v1][12/12] InfiniBand MAINTAINERS entry In-Reply-To: <20041122714.AyIOvRY195EGFTaO@topspin.com> Message-ID: <20041122714.y3rav5uMdVVNMNlz@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier Index: linux-bk/MAINTAINERS =================================================================== --- linux-bk.orig/MAINTAINERS 2004-11-21 21:07:06.694491878 -0800 +++ linux-bk/MAINTAINERS 2004-11-21 21:25:58.537516680 -0800 @@ -1075,6 +1075,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From tziporet at mellanox.co.il Mon Nov 22 07:22:26 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 22 Nov 2004 17:22:26 +0200 Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review Message-ID: <506C3D7B14CDD411A52C00025558DED6064BEAAD@mtlex01.yok.mtl.com> Congratulations for this important step toward the inclusion of Infiniband drivers in Linux kernel. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Monday, November 22, 2004 5:13 PM To: linux-kernel at vger.kernel.org Cc: openib-general at openib.org Subject: [openib-general] [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review I'm very happy to be able to post an initial version of InfiniBand patches for review. Although this code should be far closer to kernel coding standards than previous open source InfiniBand drivers, this initial posting should be treated as a request for comments and not a request for inclusion; our ultimate goal is to have these drivers included in the mainline kernel, but we expect that fixes and improvements will need to be made before the code is completely acceptable. These patches add a minimal but complete level of InfiniBand support, including an IB midlayer, a low-level driver for Mellanox HCAs, an IP-over-InfiniBand driver, and a mechanism for MADs (management datagrams) to be passed to and from userspace. This means that these patches are all that is required for the kernel to bring up and use an IP-over-InfiniBand link. (The OpenSM subnet manager has not been ported to this kernel API yet, although this work is underway. This means that at the moment, a kernel with these patches cannot be used to bring up a fabric; however, the kernel side is complete) The code has not been through extreme stress testing yet, but it has been used successfully on i386, x86_64, ppc64, ia64 and sparc64 systems, including mixed 32/64 systems. Feedback on both details of the code as well as the high-level organization of the code will be very much appreciated. For example, the current set of patches puts include files in driver/infiniband/include; would it be preferred to put include files in include/linux/infiniband/, directly in include/linux, or perhaps in include/infiniband? We would also like to explore the best avenue for having these patches merged. It may be desirable for the patches to spend some time in -mm before moving into Linus's kernel; on the other hand, the patches make only very minimal and safe changes outside of drivers/infiniband, so it is quite reasonable to merge them directly into the mainline kernel. Although 2.6.10 is now closed, 2.6.11 will probably be open by the time the review process is complete. We look forward to the community's comments and criticisms! Thanks, Roland Dreier OpenIB Alliance www.openib.org _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at infradead.org Mon Nov 22 07:31:44 2004 From: hch at infradead.org (Christoph Hellwig) Date: Mon, 22 Nov 2004 15:31:44 +0000 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122714.AyIOvRY195EGFTaO@topspin.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> Message-ID: <20041122153144.GA4821@infradead.org> > + When the IPoIB driver is loaded, it creates one interface for each > + port using the P_Key at index 0. To create an interface with a > + different P_Key, write the desired P_Key into the main interface's > + /sys/class/net//create_child file. For example: > + > + echo 0x8001 > /sys/class/net/ib0/create_child > + > + This will create an interface named ib0.8001 with P_Key 0x8001. To > + remove a subinterface, use the "delete_child" file: > + > + echo 0x8001 > /sys/class/net/ib0/delete_child > + > + The P_Key for any interface is given by the "pkey" file, and the > + main interface for a subinterface is in "parent." Any reason this doesn't use an interface similar to the normal vlan code? And what is a P_Key? From roland at topspin.com Mon Nov 22 07:41:47 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 07:41:47 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122153144.GA4821@infradead.org> (Christoph Hellwig's message of "Mon, 22 Nov 2004 15:31:44 +0000") References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122153144.GA4821@infradead.org> Message-ID: <52k6sdevr8.fsf@topspin.com> Christoph> Any reason this doesn't use an interface similar to the Christoph> normal vlan code? The normal vlan code uses an ioctl(). I thought a simple sysfs interface would be more palatable than a new socket ioctl. Christoph> And what is a P_Key? It is a 16-bit identifier carried by IB packets that says which partition the packet is in. End ports have P_Key tables that list which partitions they are members of (a port can be a member of one or more partitions, and can only receive packets from that partition). - Roland From roland at topspin.com Mon Nov 22 10:31:10 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 10:31:10 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support In-Reply-To: <20041122713.TMt4584EVSreQOO2@topspin.com> (Roland Dreier's message of "Mon, 22 Nov 2004 07:13:29 -0800") References: <20041122713.TMt4584EVSreQOO2@topspin.com> Message-ID: <528y8tenwx.fsf@topspin.com> Seems like spamassassin is still overzealous. I'm confused by the tests that applied to this message: > 2.4 RATWARE_HASH_2_V2 Bulk email fingerprint (hash 2 v2) found > 1.2 RATWARE_HASH_2 Bulk email fingerprint (hash 2) found I did some digging on this. This SA rules seem pretty bogus -- they just look at if the X-Mailer line has at least 14 (resp. 16) characters from [A-Za-z0-9_]. My patch script sets X-Mailer to "roland_patchbomb", which is 16 characters long, so all my patch mail gets a log factor of 3.6 right off the bat. I'll work around this by changing my X-Mailer... > 1.8 DOMAIN_BODY BODY: Domain registration spam body This test thinks a kernel patch looks like domain registration spam ?? > 1.1 REMOVE_REMOVAL_NEAR List removal information And a log factor of 1.1 for list removal information, which is added to every message by mailman... - R. From iod00d at hp.com Mon Nov 22 10:47:56 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 22 Nov 2004 10:47:56 -0800 Subject: [openib-general] *****SPAM***** [PATCH][RFC/v1][1/12] Add core InfiniBand support In-Reply-To: <528y8tenwx.fsf@topspin.com> References: <20041122713.TMt4584EVSreQOO2@topspin.com> <528y8tenwx.fsf@topspin.com> Message-ID: <20041122184756.GG4100@esmail.cup.hp.com> On Mon, Nov 22, 2004 at 10:31:10AM -0800, Roland Dreier wrote: > Seems like spamassassin is still overzealous. I'm confused by the > tests that applied to this message: For my personal use, I added overrides for how certain tests score to ~/.spamassassin/user_prefs. I'm sure something similar exists for mailing list use. e.g.: ... score MIME_HTML_ONLY 5.00 score MIME_HTML_NO_CHARSET 2.00 score MICROSOFT_EXECUTABLE 3.50 score HTML_FONT_INVISIBLE 5.00 ... The four tests that you pointed out could just have their "hit" score lowered so they add less (zero?) to the total score. Hacking around, it looks like adding similar lines to /etc/spamassissin/local.cf would change the "default" test scores system wide. (at least on my debian machine) grant From sam at ravnborg.org Mon Nov 22 11:33:50 2004 From: sam at ravnborg.org (Sam Ravnborg) Date: Mon, 22 Nov 2004 20:33:50 +0100 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041122713.g6bh6aqdXIN4RJYR@topspin.com> References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> Message-ID: <20041122193350.GB8150@mars.ravnborg.org> Nitpicking. Sam > --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-11-21 21:25:53.101323036 -0800 > +++ linux-bk/drivers/infiniband/core/Makefile 2004-11-21 21:25:53.879207651 -0800 > @@ -2,7 +2,8 @@ > > obj-$(CONFIG_INFINIBAND) += \ > ib_core.o \ > - ib_mad.o > + ib_mad.o \ > + ib_sa.o It's more readable to keep .o files on one line. > > ib_core-objs := \ > packer.o \ For new stuff please use ib_core-y := > @@ -17,3 +18,5 @@ > mad.o \ > smi.o \ > agent.o > + > +ib_sa-objs := sa_query.o ib_sa-y := please. > +#include > + > +#include > +#include If they are in same dir as .c file use: #include "ib_pack.h" #include "ib_sa.h" > Index: linux-bk/drivers/infiniband/include/ib_sa.h .h files for a subsystem like this ought to be placed in include/infiniband if they will be used by files in other directories than drivers/infiniband From sam at ravnborg.org Mon Nov 22 11:40:03 2004 From: sam at ravnborg.org (Sam Ravnborg) Date: Mon, 22 Nov 2004 20:40:03 +0100 Subject: [openib-general] Re: [PATCH][RFC/v1][8/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> References: <20041122713.FnSlYodJYum7s82D@topspin.com> <20041122714.nKCPmH9LMhT0X7WE@topspin.com> Message-ID: <20041122194003.GC8150@mars.ravnborg.org> More nitpicking.. Sam > +++ linux-bk/drivers/infiniband/Makefile 2004-11-21 21:25:56.794775182 -0800 > @@ -1,2 +1,3 @@ > obj-$(CONFIG_INFINIBAND) += core/ No reason to use $(CONFIG_INFINIBAND) here - it's already done in drivers/infiniband/Makefile > +EXTRA_CFLAGS += -Idrivers/infiniband/include This will get killed if you move the include files... + > +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o > + > +ib_ipoib-y := ipoib_main.o \ > + ipoib_ib.o \ > + ipoib_multicast.o \ > + ipoib_verbs.o \ > + ipoib_vlan.o One or two lines. > +#include > + > +#include "ipoib_proto.h" Shoulb be included as the last file - since it's the most local one. > + > +#include > +#include > +#include > From roland at topspin.com Mon Nov 22 13:28:56 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 13:28:56 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041122193350.GB8150@mars.ravnborg.org> (Sam Ravnborg's message of "Mon, 22 Nov 2004 20:33:50 +0100") References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> <20041122193350.GB8150@mars.ravnborg.org> Message-ID: <521xeld147.fsf@topspin.com> Sam> Nitpicking. Great, thanks for the help :) I'll fix these up before our next version of the patches are posted. Sam> It's more readable to keep .o files on one line. OK, I will reformat our Makefiles. (I used the old style because it's easier to add/remove source files, but I think you're right that it's better to optimize for readability rather than the rare event of adding/removing sources) Sam> For new stuff please use ib_core-y := OK, no problem (until a few days ago I didn't even know -y was equivalent to -obj, let alone preferred). Sam> .h files for a subsystem like this ought to be placed in Sam> include/infiniband if they will be used by files in other Sam> directories than drivers/infiniband Right now all the code is in drivers/infiniband. However Christoph suggested moving the .h files to include/infiniband as well. I have no problem moving the includes (and as you point out this eliminates having to add a -I to our CFLAGS), but on the other hand do we want to add a new toplevel include directory for what is still admittedly a minor subsystem? Thanks, Roland From greg at kroah.com Mon Nov 22 14:25:07 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 14:25:07 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041122713.g6bh6aqdXIN4RJYR@topspin.com> References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> Message-ID: <20041122222507.GB15634@kroah.com> On Mon, Nov 22, 2004 at 07:13:48AM -0800, Roland Dreier wrote: > > Index: linux-bk/drivers/infiniband/core/Makefile > =================================================================== Please hack your submit script to not add these headers, when importing to bk they end up showing up in the change log comments :( > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-bk/drivers/infiniband/core/sa_query.c 2004-11-21 21:25:53.928200384 -0800 > @@ -0,0 +1,815 @@ > +/* > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available at > + * , or the OpenIB.org BSD > + * license, available in the LICENSE.TXT file accompanying this > + * software. These details are also available at > + * . > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * Copyright (c) 2004 Topspin Communications. All rights reserved. No email address of who to bug with issues? > + * > + * $Id$ Not needed :) > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > + > +MODULE_AUTHOR("Roland Dreier"); > +MODULE_DESCRIPTION("InfiniBand subnet administration query support"); > +MODULE_LICENSE("Dual BSD/GPL"); > + > +struct ib_sa_hdr { > + u64 sm_key; > + u16 attr_offset; > + u16 reserved; > + ib_sa_comp_mask comp_mask; > +} __attribute__ ((packed)); Why is this packed? > +struct ib_sa_mad { > + struct ib_mad_hdr mad_hdr; > + struct ib_rmpp_hdr rmpp_hdr; > + struct ib_sa_hdr sa_hdr; > + u8 data[200]; > +} __attribute__ ((packed)); Same here? > + > +struct ib_sa_sm_ah { > + struct ib_ah *ah; > + struct kref ref; > +}; > + > +struct ib_sa_port { > + struct ib_mad_agent *agent; > + struct ib_mr *mr; > + struct ib_sa_sm_ah *sm_ah; > + struct work_struct update_task; > + spinlock_t ah_lock; > + u8 port_num; > +}; > + > +struct ib_sa_device { > + int start_port, end_port; > + struct ib_event_handler event_handler; > + struct ib_sa_port port[0]; > +}; > + > +struct ib_sa_query { > + void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); > + void (*release)(struct ib_sa_query *); > + struct ib_sa_port *port; > + struct ib_sa_mad *mad; > + struct ib_sa_sm_ah *sm_ah; > + DECLARE_PCI_UNMAP_ADDR(mapping) > + int id; > +}; > + > +struct ib_sa_path_query { > + void (*callback)(int, struct ib_sa_path_rec *, void *); > + void *context; > + struct ib_sa_query sa_query; > +}; > + > +struct ib_sa_mcmember_query { > + void (*callback)(int, struct ib_sa_mcmember_rec *, void *); > + void *context; > + struct ib_sa_query sa_query; > +}; > + > +static void ib_sa_add_one(struct ib_device *device); > +static void ib_sa_remove_one(struct ib_device *device); > + > +static struct ib_client sa_client = { > + .name = "sa", > + .add = ib_sa_add_one, > + .remove = ib_sa_remove_one > +}; > + > +static spinlock_t idr_lock; > +DEFINE_IDR(query_idr); Should this be global or static? > + > +static spinlock_t tid_lock; > +static u32 tid; > + > +enum { > + IB_SA_ATTR_CLASS_PORTINFO = 0x01, > + IB_SA_ATTR_NOTICE = 0x02, > + IB_SA_ATTR_INFORM_INFO = 0x03, > + IB_SA_ATTR_NODE_REC = 0x11, > + IB_SA_ATTR_PORT_INFO_REC = 0x12, > + IB_SA_ATTR_SL2VL_REC = 0x13, > + IB_SA_ATTR_SWITCH_REC = 0x14, > + IB_SA_ATTR_LINEAR_FDB_REC = 0x15, > + IB_SA_ATTR_RANDOM_FDB_REC = 0x16, > + IB_SA_ATTR_MCAST_FDB_REC = 0x17, > + IB_SA_ATTR_SM_INFO_REC = 0x18, > + IB_SA_ATTR_LINK_REC = 0x20, > + IB_SA_ATTR_GUID_INFO_REC = 0x30, > + IB_SA_ATTR_SERVICE_REC = 0x31, > + IB_SA_ATTR_PARTITION_REC = 0x33, > + IB_SA_ATTR_RANGE_REC = 0x34, > + IB_SA_ATTR_PATH_REC = 0x35, > + IB_SA_ATTR_VL_ARB_REC = 0x36, > + IB_SA_ATTR_MC_GROUP_REC = 0x37, > + IB_SA_ATTR_MC_MEMBER_REC = 0x38, > + IB_SA_ATTR_TRACE_REC = 0x39, > + IB_SA_ATTR_MULTI_PATH_REC = 0x3a, > + IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b > +}; Oops, tabs vs. spaces. Care to use the __bitwise field here so that you can have sparse check to see that you are actually using the proper enum values in all places? See the kobject_action code for an example of this. > + > +#define PATH_REC_FIELD(field) \ > + .struct_offset_bytes = offsetof(struct ib_sa_path_rec, field), \ > + .struct_size_bytes = sizeof ((struct ib_sa_path_rec *) 0)->field, \ > + .field_name = "sa_path_rec:" #field > + > +static const struct ib_field path_rec_table[] = { > + { RESERVED, > + .offset_words = 0, > + .offset_bits = 0, > + .size_bits = 32 }, What is "RESERVED"? I must be missing a previous patch somewhere, I currently don't see all of the series yet. thanks, greg k-h From greg at kroah.com Mon Nov 22 14:13:04 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 14:13:04 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review In-Reply-To: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> Message-ID: <20041122221304.GA15634@kroah.com> On Mon, Nov 22, 2004 at 07:13:24AM -0800, Roland Dreier wrote: > organization of the code will be very much appreciated. For example, > the current set of patches puts include files in driver/infiniband/include; > would it be preferred to put include files in include/linux/infiniband/, > directly in include/linux, or perhaps in include/infiniband? Who would be including these files, only drivers in drivers/infiniband? Or from files in other parts of the kernel? If from other parts of the kernel, use include/linux/infiniband. thanks, greg k-h From greg at kroah.com Mon Nov 22 14:34:32 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 14:34:32 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][8/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> References: <20041122713.FnSlYodJYum7s82D@topspin.com> <20041122714.nKCPmH9LMhT0X7WE@topspin.com> Message-ID: <20041122223432.GC15634@kroah.com> On Mon, Nov 22, 2004 at 07:14:04AM -0800, Roland Dreier wrote: > > +#define ipoib_printk(level, priv, format, arg...) \ > + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) > +#define ipoib_warn(priv, format, arg...) \ > + ipoib_printk(KERN_WARNING, priv, format , ## arg) What's wrong with using the dev_printk() and friends instead of your own? And why cast a pointer in a macro, don't you know the type of it anyway? > Index: linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-11-21 21:25:56.924755902 -0800 You're using a separate filesystem to export debug data? I'm all for new virtual filesystems, but why not just use sysfs for this? What are you doing in here that you can't do with another mechanism (netlink, sysfs, sockets, relayfs, etc.)? > +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > +#define DATA_PATH_DEBUG_HELP " and data path tracing if > 1" > +#else > +#define DATA_PATH_DEBUG_HELP "" > +#endif > + > +module_param(debug_level, int, 0644); > +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" DATA_PATH_DEBUG_HELP); Why not just use 2 different debug variables for this? > + > +int mcast_debug_level; Global? thanks, greg k-h From ftillier at infiniconsys.com Mon Nov 22 14:40:45 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Mon, 22 Nov 2004 14:40:45 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA(Subnet Administration) query support In-Reply-To: <20041122222507.GB15634@kroah.com> Message-ID: <000401c4d0e4$4edb43d0$655aa8c0@infiniconsys.com> > From: Greg KH [mailto:greg at kroah.com] > Sent: Monday, November 22, 2004 2:25 PM > > > +struct ib_sa_hdr { > > + u64 sm_key; > > + u16 attr_offset; > > + u16 reserved; > > + ib_sa_comp_mask comp_mask; > > +} __attribute__ ((packed)); > > Why is this packed? > > > +struct ib_sa_mad { > > + struct ib_mad_hdr mad_hdr; > > + struct ib_rmpp_hdr rmpp_hdr; > > + struct ib_sa_hdr sa_hdr; > > + u8 data[200]; > > +} __attribute__ ((packed)); > > Same here? These describe on-the-wire IB structures, and their definition matches the IB spec (Version 1.1, Volume 1) struct ib_mad_hdr matches "Standard MAD Header", Figure 144 struct ib_rmpp_hdr matches "RMPP MAD Header", Figure 168 struct ib_sa_hdr and struct ib_sa_mad match "SA Header", Figure 193 Hope that answers your question - let us know if it doesn't. Cheers, - Fab From roland at topspin.com Mon Nov 22 14:50:41 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 14:50:41 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review In-Reply-To: <20041122221304.GA15634@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 14:13:04 -0800") References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> <20041122221304.GA15634@kroah.com> Message-ID: <52wtwdbiri.fsf@topspin.com> Greg> Who would be including these files, only drivers in Greg> drivers/infiniband? Or from files in other parts of the Greg> kernel? In the current patchset all the code is under drivers/infiniband. Greg> If from other parts of the kernel, use include/linux/infiniband. That's one vote for include/linux/infiniband and two votes for include/infiniband so far... - R. From greg at kroah.com Mon Nov 22 14:50:33 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 14:50:33 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <20041122714.9zlcKGKvXlpga8EP@topspin.com> References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> Message-ID: <20041122225033.GD15634@kroah.com> On Mon, Nov 22, 2004 at 07:14:11AM -0800, Roland Dreier wrote: > Add a driver that provides a character special device for each > InfiniBand port. This device allows userspace to send and receive > MADs via write() and read() (with some control operations implemented > as ioctls). Do you really need these ioctls? For example: > +static int ib_umad_ioctl(struct inode *inode, struct file *filp, > + unsigned int cmd, unsigned long arg) > +{ > + switch (cmd) { > + case IB_USER_MAD_GET_ABI_VERSION: > + return put_user(IB_USER_MAD_ABI_VERSION, > + (u32 __user *) arg) ? -EFAULT : 0; This could be in a sysfs file, right? > + case IB_USER_MAD_REGISTER_AGENT: > + return ib_umad_reg_agent(filp->private_data, arg); > + case IB_USER_MAD_UNREGISTER_AGENT: > + return ib_umad_unreg_agent(filp->private_data, arg); You are letting any user, with any privilege register or unregister an "agent"? And shouldn't you lock your list of agent ids when adding or removing one, or are you relying on the BKL of the ioctl call? If so, please document this. Also, these "agents" seem to be a type of filter, right? Is there no other way to implement this than an ioctl? thanks, greg k-h From greg at kroah.com Mon Nov 22 14:53:35 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 14:53:35 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122714.AyIOvRY195EGFTaO@topspin.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> Message-ID: <20041122225335.GE15634@kroah.com> On Mon, Nov 22, 2004 at 07:14:22AM -0800, Roland Dreier wrote: > +/dev files > + > + To create the appropriate character device files automatically with > + udev, a rule like > + > + KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad" > + > + can be used. This will create a device node named > + > + /dev/infiniband/mthca0/ports/1/mad > + > + for port 1 of device mthca0, and so on. Why do you propose such a "deep" nesting of directories for umad devices? That's not the LANNANA way. Oh, have you asked for a real major number to be reserved for umad? thanks, greg k-h From roland at topspin.com Mon Nov 22 14:58:45 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 14:58:45 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122225335.GE15634@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 14:53:35 -0800") References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> Message-ID: <52sm71bie2.fsf@topspin.com> Greg> Why do you propose such a "deep" nesting of directories for Greg> umad devices? That's not the LANNANA way. No real reason, I'm open to better suggestions. Greg> Oh, have you asked for a real major number to be reserved Greg> for umad? No, I think we're fine with a dynamic major. Is there any reason to want a real major? - Roland From roland at topspin.com Mon Nov 22 15:05:40 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 15:05:40 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <20041122225033.GD15634@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 14:50:33 -0800") References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> Message-ID: <52oehpbi2j.fsf@topspin.com> >> Add a driver that provides a character special device for each >> InfiniBand port. This device allows userspace to send and >> receive MADs via write() and read() (with some control >> operations implemented as ioctls). Greg> Do you really need these ioctls? Greg> This could be in a sysfs file, right? The API version definitely can be, good point. Greg> You are letting any user, with any privilege register or Greg> unregister an "agent"? They have to be able to open the device node. We could add a check that they have it open for writing but there's not really much point in opening this device read-only. Greg> And shouldn't you lock your list of agent ids when adding or Greg> removing one, or are you relying on the BKL of the ioctl Greg> call? If so, please document this. Each file has an "agent_mutex" rwsem that protects this... the global list of agents handled by the lower level API is protected by its own locking. Greg> Also, these "agents" seem to be a type of filter, right? Is Greg> there no other way to implement this than an ioctl? ioctl seems to be the least bad way to me. This really feels like a legitimate use of ioctl to me -- we use read/write to handle passing data through our file descriptor, and ioctl for control of the properties of the descriptor. What would you suggest as an ioctl replacement? Thanks, Roland From greg at kroah.com Mon Nov 22 15:05:33 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 15:05:33 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <52sm71bie2.fsf@topspin.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com> Message-ID: <20041122230533.GB13083@kroah.com> On Mon, Nov 22, 2004 at 02:58:45PM -0800, Roland Dreier wrote: > Greg> Why do you propose such a "deep" nesting of directories for > Greg> umad devices? That's not the LANNANA way. > > No real reason, I'm open to better suggestions. /dev/umad* /dev/ib/umad* > Greg> Oh, have you asked for a real major number to be reserved > Greg> for umad? > > No, I think we're fine with a dynamic major. Is there any reason to > want a real major? People who do not use udev will not like you. thanks, greg k-h From greg at kroah.com Mon Nov 22 15:01:28 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 15:01:28 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review In-Reply-To: <52wtwdbiri.fsf@topspin.com> References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> <20041122221304.GA15634@kroah.com> <52wtwdbiri.fsf@topspin.com> Message-ID: <20041122230128.GA13083@kroah.com> On Mon, Nov 22, 2004 at 02:50:41PM -0800, Roland Dreier wrote: > Greg> Who would be including these files, only drivers in > Greg> drivers/infiniband? Or from files in other parts of the > Greg> kernel? > > In the current patchset all the code is under drivers/infiniband. Then it should just stay in that directory. Well, that's my preference anyway :) thanks, greg k-h From roland at topspin.com Mon Nov 22 15:18:07 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 15:18:07 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][8/12] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041122223432.GC15634@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 14:34:32 -0800") References: <20041122713.FnSlYodJYum7s82D@topspin.com> <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122223432.GC15634@kroah.com> Message-ID: <52k6sdbhhs.fsf@topspin.com> Greg> What's wrong with using the dev_printk() and friends instead Greg> of your own? dev_printk expects a struct device, not a net_device. Greg> And why cast a pointer in a macro, don't you know the type Greg> of it anyway? this lets us pass in the return value of netdev_priv() directly without having to have the cast in the code that uses the macro. Greg> You're using a separate filesystem to export debug data? Greg> I'm all for new virtual filesystems, but why not just use Greg> sysfs for this? What are you doing in here that you can't Greg> do with another mechanism (netlink, sysfs, sockets, relayfs, Greg> etc.)? For each multicast group, we want to export the GID, how long it's been around, whether our join has completed and whether it's send-only. It wouldn't be too bad to create a kobject with all those attributes but getting the info from so many little files is a little bit of a pain, and so is dealing with kobject lifetime rules. It's even worse with netlink since then a new tool is required. (AFAIK relayfs isn't in Linus's kernel). It's nice to be able to tell someone to just mount ipoib_debugfs and send the contents of debugfs/ib0_mcg. The actual filesystem stuff is pretty trivial using everything libfs provides for us now... Greg> Why not just use 2 different debug variables for this? No real reason... I'll fix it up. >> + +int mcast_debug_level; Greg> Global? Good point, I'll move it into ipoib_multicast.c. - R. From roland at topspin.com Mon Nov 22 15:21:26 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 15:21:26 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122230533.GB13083@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 15:05:33 -0800") References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com> <20041122230533.GB13083@kroah.com> Message-ID: <52fz31bhc9.fsf@topspin.com> Greg> /dev/umad* /dev/ib/umad* Right now the umad module creates devices with kernel names like umad0, umad1, etc, but it puts ibdev and port files in sysfs so userspace can figure out which IB device and port the file corresponds to. I would really prefer to have this info reflected in the /dev name... Greg> People who do not use udev will not like you. OK, I guess we will apply to LANANA. - R. From johannes at erdfelt.com Mon Nov 22 15:30:47 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Mon, 22 Nov 2004 15:30:47 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122230533.GB13083@kroah.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com> <20041122230533.GB13083@kroah.com> Message-ID: <20041122233047.GH27658@sventech.com> On Mon, Nov 22, 2004, Greg KH wrote: > On Mon, Nov 22, 2004 at 02:58:45PM -0800, Roland Dreier wrote: > > Greg> Oh, have you asked for a real major number to be reserved > > Greg> for umad? > > > > No, I think we're fine with a dynamic major. Is there any reason to > > want a real major? > > People who do not use udev will not like you. I don't quite understand this. Given things like udev, wouldn't dynamic majors work just like having a static major number? JE From roland at topspin.com Mon Nov 22 15:34:23 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 15:34:23 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041122222507.GB15634@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 14:25:07 -0800") References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> <20041122222507.GB15634@kroah.com> Message-ID: <527jodbgqo.fsf@topspin.com> Greg> Please hack your submit script to not add these headers, Greg> when importing to bk they end up showing up in the change Greg> log comments :( OK, will do. Greg> No email address of who to bug with issues? There's a patch to MAINTAINERS... Greg> Why is this packed? Greg> Same here? Both of these structures unfortunately have 64 bit fields only aligned to 32 bits (and are sent on the wire so we can't fiddle with the layout). So without the "packed" they won't come out right on 64-bit archs. Greg> Should this be global or static? static, fixed. Greg> Oops, tabs vs. spaces. fixed. Greg> Care to use the __bitwise field here so that you can have Greg> sparse check to see that you are actually using the proper Greg> enum values in all places? See the kobject_action code for Greg> an example of this. Sure, that's a good idea. I'll look for other places we can do this too. Greg> What is "RESERVED"? I must be missing a previous patch Greg> somewhere, I currently don't see all of the series yet. It's in part 1/12: http://article.gmane.org/gmane.linux.kernel/257531 unfortunately some people marked it as spam and it didn't get everywhere. - Roland From jjengla at sandia.gov Mon Nov 22 17:26:04 2004 From: jjengla at sandia.gov (Josh England) Date: Mon, 22 Nov 2004 17:26:04 -0800 Subject: [openib-general] troubles with IPoIB Message-ID: <1101173164.18604.53.camel@localhost> Hi all, I've got an 85-node x86_64 PCIe cluster I'd like to run (and test) openIB on. I've built a kernel using the latest patches from SVN, loaded all the modules, and I see ACTIVE on the ports, but IPoIB does not seem to want to work. After I 'echo 2 > module/ib_ipoib/debug_level', I get the following: Nov 22 16:47:03 n0 kernel: ib0: called: id 14, op 0, status: 0 Nov 22 16:47:03 n0 kernel: ib0: send complete, wrid 14 Nov 22 16:47:03 n0 kernel: ib0: called: id -2147483634, op 128, status: 0 Nov 22 16:47:03 n0 kernel: ib0: received 100 bytes, SLID 0x000d Nov 22 16:47:03 n0 kernel: ib0: dropping loopback packet Nov 22 16:47:04 n0 kernel: ib0: sending packet, length=60 address=000001007fd17340 qpn=0xffffff Is the 'called: id [big-fat-negative-number]' supposed to be there? Is there a utility (such as vping) to test even basic IB connectivity? How can I get IPoIB to work? ----------------------------------------------- Josh England Sandia National Laboratory, Livermore, CA Visualization and Scientific Computing email: jjengla at sandia.gov phone: (925) 294-2076 -JE From roland at topspin.com Mon Nov 22 17:34:35 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 17:34:35 -0800 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101173164.18604.53.camel@localhost> (Josh England's message of "Mon, 22 Nov 2004 17:26:04 -0800") References: <1101173164.18604.53.camel@localhost> Message-ID: <52vfbx9wlw.fsf@topspin.com> Josh> Is the 'called: id [big-fat-negative-number]' supposed to be Josh> there? Yes, that's fine (it's just an artifact of the fact that receive work request IDs get (1<<31) ORed in). I should clean up that debug message though. The debug messages show IPoIB apparently sending an ARP packet and seeing it appear on the broadcast group (looped back locally). Josh> Is there a utility (such as vping) to test even Josh> basic IB connectivity? Unfortunately not yet. Josh> How can I get IPoIB to work? What subnet manager are you using? If you mount ipoib_debugfs somewhere (say /ipoib_debugfs), what do you get from cat /ipoib_debugfs/ib0_mcg Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" show anything on the other nodes when you try to ping or something? Thanks, Roland From roland at topspin.com Mon Nov 22 18:08:21 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 18:08:21 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <20041122225033.GD15634@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 14:50:33 -0800") References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> Message-ID: <52ekil9v1m.fsf@topspin.com> Greg> This could be in a sysfs file, right? Ugh, how does one add an attribute (like the ABI version) to a class_simple? It shouldn't be per-device but I don't see anything like class_create_file() that could work for class_simple. Thanks, Roland From jjengla at sandia.gov Mon Nov 22 18:14:23 2004 From: jjengla at sandia.gov (Josh England) Date: Mon, 22 Nov 2004 18:14:23 -0800 Subject: [openib-general] troubles with IPoIB In-Reply-To: <52vfbx9wlw.fsf@topspin.com> References: <1101173164.18604.53.camel@localhost> <52vfbx9wlw.fsf@topspin.com> Message-ID: <1101176063.17750.58.camel@localhost> On Mon, 2004-11-22 at 17:34 -0800, Roland Dreier wrote: > What subnet manager are you using? Embedded Voltaire SM. > If you mount ipoib_debugfs > somewhere (say /ipoib_debugfs), what do you get from > cat /ipoib_debugfs/ib0_mcg How do I mount ipoib_debugfs? Is there some device associated with it that I should be seeing? BTW...I'm using udev. > Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" > show anything on the other nodes when you try to ping or something? Nope...nothing's coming in... -JE From roland at topspin.com Mon Nov 22 18:27:36 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 18:27:36 -0800 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101176063.17750.58.camel@localhost> (Josh England's message of "Mon, 22 Nov 2004 18:14:23 -0800") References: <1101173164.18604.53.camel@localhost> <52vfbx9wlw.fsf@topspin.com> <1101176063.17750.58.camel@localhost> Message-ID: <521xel9u5j.fsf@topspin.com> Josh> How do I mount ipoib_debugfs? Is there some device Josh> associated with it that I should be seeing? BTW...I'm using Josh> udev. no device associated... just do mount -t ipoib_debugfs none /ipoib_debufs/ for whatever value of /ipoib_debufs/ you like. (need to create the directory first, just like any other mount point) udev should be no problem (all my dev systems use it too). - Roland From roland at topspin.com Mon Nov 22 19:33:21 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 19:33:21 -0800 Subject: [openib-general] [PATCH] Convert from pci_xxx to dma_xxx functions Message-ID: <52wtwd8cji.fsf@topspin.com> Christoph Hellwig suggested we might as well put a generic struct device *dma_device and use the generic dma_map functions rather than assuming we're dealing with a PCI device. (There's no dma_xxx equivalent of pci_unmap_addr_set() and friends, so I left that stuff-- Christoph agrees this is OK for now). Look OK to commit? Thanks, Roland Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1273) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -773,7 +773,7 @@ if (!priv) goto alloc_mem_failed; - SET_NETDEV_DEV(priv->dev, &hca->dma_device->dev); + SET_NETDEV_DEV(priv->dev, hca->dma_device); result = ib_query_pkey(hca, port, 0, &priv->pkey); if (result) { Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1272) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -107,9 +107,9 @@ } skb_reserve(skb, 4); /* 16 byte align IP header */ priv->rx_ring[id].skb = skb; - addr = pci_map_single(priv->ca->dma_device, + addr = dma_map_single(priv->ca->dma_device, skb->data, IPOIB_BUF_SIZE, - PCI_DMA_FROMDEVICE); + DMA_FROM_DEVICE); pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); ret = ipoib_ib_receive(priv, id, addr); @@ -154,11 +154,11 @@ priv->rx_ring[wr_id].skb = NULL; - pci_unmap_single(priv->ca->dma_device, + dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[wr_id], mapping), IPOIB_BUF_SIZE, - PCI_DMA_FROMDEVICE); + DMA_FROM_DEVICE); if (wc->status != IB_WC_SUCCESS) { if (wc->status != IB_WC_WR_FLUSH_ERR) @@ -216,10 +216,10 @@ tx_req = &priv->tx_ring[wr_id]; - pci_unmap_single(priv->ca->dma_device, + dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); ++priv->stats.tx_packets; priv->stats.tx_bytes += tx_req->skb->len; @@ -318,9 +318,9 @@ */ tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; tx_req->skb = skb; - addr = pci_map_single(priv->ca->dma_device, + addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 1272) +++ infiniband/include/ib_verbs.h (working copy) @@ -679,7 +679,7 @@ }; struct ib_device { - struct pci_dev *dma_device; + struct device *dma_device; char name[IB_DEVICE_NAME_MAX]; Index: infiniband/core/agent.c =================================================================== --- infiniband/core/agent.c (revision 1272) +++ infiniband/core/agent.c (working copy) @@ -23,11 +23,15 @@ Copyright (c) 2004 Voltaire Corporation. All rights reserved. */ +#include + +#include + #include + #include "smi.h" #include "agent_priv.h" #include "mad_priv.h" -#include spinlock_t ib_agent_port_list_lock; @@ -117,10 +121,10 @@ agent_send_wr->mad = mad; /* PCI mapping */ - gather_list.addr = pci_map_single(mad_agent->device->dma_device, + gather_list.addr = dma_map_single(mad_agent->device->dma_device, &mad->mad, sizeof(mad->mad), - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); gather_list.length = sizeof(mad->mad); gather_list.lkey = (*port_priv->mr).lkey; @@ -182,10 +186,10 @@ spin_lock_irqsave(&port_priv->send_list_lock, flags); if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - pci_unmap_single(mad_agent->device->dma_device, + dma_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), sizeof(mad->mad), - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); ib_destroy_ah(agent_send_wr->ah); kfree(agent_send_wr); } else { @@ -255,10 +259,10 @@ spin_unlock_irqrestore(&port_priv->send_list_lock, flags); /* Unmap PCI */ - pci_unmap_single(mad_agent->device->dma_device, + dma_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), sizeof(agent_send_wr->mad->mad), - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); ib_destroy_ah(agent_send_wr->ah); Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1272) +++ infiniband/core/user_mad.c (working copy) @@ -115,10 +115,10 @@ struct ib_umad_packet *packet = (void *) (unsigned long) send_wc->wr_id; - pci_unmap_single(agent->device->dma_device, + dma_unmap_single(agent->device->dma_device, pci_unmap_addr(packet, mapping), sizeof packet->mad.data, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); ib_destroy_ah(packet->ah); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { @@ -267,10 +267,10 @@ goto err_up; } - gather_list.addr = pci_map_single(agent->device->dma_device, + gather_list.addr = dma_map_single(agent->device->dma_device, packet->mad.data, sizeof packet->mad.data, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); gather_list.length = sizeof packet->mad.data; gather_list.lkey = file->mr[packet->mad.id]->lkey; pci_unmap_addr_set(packet, mapping, gather_list.addr); @@ -285,10 +285,10 @@ ret = ib_post_send_mad(agent, &wr, &bad_wr); if (ret) { - pci_unmap_single(agent->device->dma_device, + dma_unmap_single(agent->device->dma_device, pci_unmap_addr(packet, mapping), sizeof packet->mad.data, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); goto err_up; } @@ -549,7 +549,7 @@ umad_dev->port[i - s].class_dev = class_simple_device_add(umad_class, umad_dev->port[i - s].dev.dev, - &device->dma_device->dev, + device->dma_device, "umad%d", umad_dev->port[i - s].devnum); if (IS_ERR(umad_dev->port[i - s].class_dev)) goto err_class; Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1272) +++ infiniband/core/mad.c (working copy) @@ -53,16 +53,16 @@ * and/or other materials provided with the distribution. */ +#include +#include #include + #include "mad_priv.h" #include "smi.h" #include "agent.h" -#include -#include - MODULE_LICENSE("Dual BSD/GPL"); MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); @@ -1094,11 +1094,11 @@ mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, mad_list); recv = container_of(mad_priv_hdr, struct ib_mad_private, header); - pci_unmap_single(port_priv->device->dma_device, + dma_unmap_single(port_priv->device->dma_device, pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - sizeof(struct ib_mad_private_header), - PCI_DMA_FROMDEVICE); + DMA_FROM_DEVICE); /* Setup MAD receive work completion from "normal" work completion */ recv->header.recv_wc.wc = wc; @@ -1627,12 +1627,12 @@ break; } } - sg_list.addr = pci_map_single(qp_info->port_priv-> + sg_list.addr = dma_map_single(qp_info->port_priv-> device->dma_device, &mad_priv->grh, sizeof *mad_priv - sizeof mad_priv->header, - PCI_DMA_FROMDEVICE); + DMA_FROM_DEVICE); pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; mad_priv->header.mad_list.mad_queue = recv_queue; @@ -1648,12 +1648,12 @@ list_del(&mad_priv->header.mad_list.list); recv_queue->count--; spin_unlock_irqrestore(&recv_queue->lock, flags); - pci_unmap_single(qp_info->port_priv->device->dma_device, + dma_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&mad_priv->header, mapping), sizeof *mad_priv - sizeof mad_priv->header, - PCI_DMA_FROMDEVICE); + DMA_FROM_DEVICE); kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); break; @@ -1686,11 +1686,11 @@ list_del(&mad_list->list); /* Undo PCI mapping */ - pci_unmap_single(qp_info->port_priv->device->dma_device, + dma_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - sizeof(struct ib_mad_private_header), - PCI_DMA_FROMDEVICE); + DMA_FROM_DEVICE); kmem_cache_free(ib_mad_cache, recv); } Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1276) +++ infiniband/core/sa_query.c (working copy) @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -43,14 +44,14 @@ u16 attr_offset; u16 reserved; ib_sa_comp_mask comp_mask; -} __attribute__((packed)); +} __attribute__ ((packed)); struct ib_sa_mad { struct ib_mad_hdr mad_hdr; struct ib_rmpp_hdr rmpp_hdr; struct ib_sa_hdr sa_hdr; u8 data[200]; -} __attribute__((packed)); +} __attribute__ ((packed)); struct ib_sa_sm_ah { struct ib_ah *ah; @@ -460,20 +461,20 @@ wr.wr.ud.ah = port->sm_ah->ah; spin_unlock_irqrestore(&port->ah_lock, flags); - gather_list.addr = pci_map_single(port->agent->device->dma_device, + gather_list.addr = dma_map_single(port->agent->device->dma_device, query->mad, sizeof (struct ib_sa_mad), - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); gather_list.length = sizeof (struct ib_sa_mad); gather_list.lkey = port->mr->lkey; pci_unmap_addr_set(query, mapping, gather_list.addr); ret = ib_post_send_mad(port->agent, &wr, &bad_wr); if (ret) { - pci_unmap_single(port->agent->device->dma_device, + dma_unmap_single(port->agent->device->dma_device, pci_unmap_addr(query, mapping), sizeof (struct ib_sa_mad), - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); kref_put(&query->sm_ah->ref, free_sm_ah); spin_lock_irqsave(&idr_lock, flags); idr_remove(&query_idr, query->id); @@ -662,10 +663,10 @@ break; } - pci_unmap_single(agent->device->dma_device, + dma_unmap_single(agent->device->dma_device, pci_unmap_addr(query, mapping), sizeof (struct ib_sa_mad), - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); kref_put(&query->sm_ah->ref, free_sm_ah); query->release(query); Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 1272) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -27,6 +27,7 @@ #include #include #include +#include #include #include Index: infiniband/hw/mthca/mthca_main.c =================================================================== --- infiniband/hw/mthca/mthca_main.c (revision 1272) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -28,7 +28,6 @@ #include #include #include -#include #ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL #include Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 1272) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -573,7 +573,7 @@ strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; - dev->ib_dev.dma_device = dev->pdev; + dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; Index: infiniband/hw/mthca/mthca_mad.c =================================================================== --- infiniband/hw/mthca/mthca_mad.c (revision 1272) +++ infiniband/hw/mthca/mthca_mad.c (working copy) @@ -144,10 +144,10 @@ wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; wr.wr_id = (unsigned long) tmad; - gather_list.addr = pci_map_single(agent->device->dma_device, + gather_list.addr = dma_map_single(agent->device->dma_device, tmad->mad, sizeof *tmad->mad, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); gather_list.length = sizeof *tmad->mad; gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; pci_unmap_addr_set(tmad, mapping, gather_list.addr); @@ -167,10 +167,10 @@ spin_unlock_irqrestore(&dev->sm_lock, flags); if (ret) { - pci_unmap_single(agent->device->dma_device, + dma_unmap_single(agent->device->dma_device, pci_unmap_addr(tmad, mapping), sizeof *tmad->mad, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); kfree(tmad->mad); kfree(tmad); } @@ -259,10 +259,10 @@ struct mthca_trap_mad *tmad = (void *) (unsigned long) mad_send_wc->wr_id; - pci_unmap_single(agent->device->dma_device, + dma_unmap_single(agent->device->dma_device, pci_unmap_addr(tmad, mapping), sizeof *tmad->mad, - PCI_DMA_TODEVICE); + DMA_TO_DEVICE); kfree(tmad->mad); kfree(tmad); } From halr at voltaire.com Mon Nov 22 20:24:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Nov 2004 23:24:37 -0500 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101176063.17750.58.camel@localhost> References: <1101173164.18604.53.camel@localhost> <52vfbx9wlw.fsf@topspin.com> <1101176063.17750.58.camel@localhost> Message-ID: <1101183877.4124.545.camel@localhost.localdomain> On Mon, 2004-11-22 at 21:14, Josh England wrote: > On Mon, 2004-11-22 at 17:34 -0800, Roland Dreier wrote: > > What subnet manager are you using? > > Embedded Voltaire SM. I run this way all the time (with smaller configurations). Multicast works in general. -- Hal From halr at voltaire.com Mon Nov 22 20:26:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Nov 2004 23:26:18 -0500 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101173164.18604.53.camel@localhost> References: <1101173164.18604.53.camel@localhost> Message-ID: <1101183978.4124.548.camel@localhost.localdomain> Hi Josh, On Mon, 2004-11-22 at 20:26, Josh England wrote: > I've got an 85-node x86_64 PCIe cluster I'd like to run (and test) > openIB on. I've built a kernel using the latest patches from SVN, > loaded all the modules, and I see ACTIVE on the ports, but IPoIB does > not seem to want to work. What is the firmware version of the PCIe adapters ? I have seen problems like this when not all the adapters were at 4.5.3. You can get this via: cat /sys/class/infiniband/mthca0/fw_ver -- Hal From mlleinin at hpcn.ca.sandia.gov Mon Nov 22 21:15:38 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 22 Nov 2004 21:15:38 -0800 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101183978.4124.548.camel@localhost.localdomain> References: <1101173164.18604.53.camel@localhost> <1101183978.4124.548.camel@localhost.localdomain> Message-ID: <1101186939.29554.92.camel@trinity> On Mon, 2004-11-22 at 23:26 -0500, Hal Rosenstock wrote: > Hi Josh, > > On Mon, 2004-11-22 at 20:26, Josh England wrote: > > I've got an 85-node x86_64 PCIe cluster I'd like to run (and test) > > openIB on. I've built a kernel using the latest patches from SVN, > > loaded all the modules, and I see ACTIVE on the ports, but IPoIB does > > not seem to want to work. > > What is the firmware version of the PCIe adapters ? I have seen problems > like this when not all the adapters were at 4.5.3. > > You can get this via: > > cat /sys/class/infiniband/mthca0/fw_ver > We are using fw_ver 4.5.0. Looks like we need to upgrade. Time to try the user space firmware burning tools. - Matt From greg at kroah.com Mon Nov 22 22:30:45 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 22:30:45 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <52ekil9v1m.fsf@topspin.com> References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com> Message-ID: <20041123063045.GA22493@kroah.com> On Mon, Nov 22, 2004 at 06:08:21PM -0800, Roland Dreier wrote: > Greg> This could be in a sysfs file, right? > > Ugh, how does one add an attribute (like the ABI version) to a > class_simple? It shouldn't be per-device but I don't see anything > like class_create_file() that could work for class_simple. class_simple_device_add returns a pointer to a struct class_device * that you can then use to create a file in sysfs with. That should be what you're looking for. thanks, greg k-h From greg at kroah.com Mon Nov 22 22:41:20 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 22:41:20 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <527jodbgqo.fsf@topspin.com> References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> <20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com> Message-ID: <20041123064120.GB22493@kroah.com> On Mon, Nov 22, 2004 at 03:34:23PM -0800, Roland Dreier wrote: > Greg> No email address of who to bug with issues? > > There's a patch to MAINTAINERS... Yeah, but a name in each file is much nicer. > Greg> What is "RESERVED"? I must be missing a previous patch > Greg> somewhere, I currently don't see all of the series yet. > > It's in part 1/12: http://article.gmane.org/gmane.linux.kernel/257531 > unfortunately some people marked it as spam and it didn't get > everywhere. Thanks for pointing this out. One comment, the file drivers/infiniband/core/cache.c has a license that is illegal due to the contents of the file. Please change the license of the file to GPL only. Oh, and how about kernel-doc comments for all functions that are EXPORT_SYMBOL() marked? And for your core big structures? thanks, greg k-h From roland at topspin.com Mon Nov 22 22:45:02 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 22:45:02 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <20041123063045.GA22493@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 22:30:45 -0800") References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com> <20041123063045.GA22493@kroah.com> Message-ID: <52llct83o1.fsf@topspin.com> Greg> class_simple_device_add returns a pointer to a struct Greg> class_device * that you can then use to create a file in Greg> sysfs with. That should be what you're looking for. Shouldn't the ABI version be an attribute in /sys/class/infiniband_mad rather than being per-device? (I'm already creating several per-device attributes for the devices I get back from class_simple_device_add). - R. From greg at kroah.com Mon Nov 22 22:45:08 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 22:45:08 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122233047.GH27658@sventech.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com> <20041122230533.GB13083@kroah.com> <20041122233047.GH27658@sventech.com> Message-ID: <20041123064508.GC22493@kroah.com> On Mon, Nov 22, 2004 at 03:30:47PM -0800, Johannes Erdfelt wrote: > On Mon, Nov 22, 2004, Greg KH wrote: > > On Mon, Nov 22, 2004 at 02:58:45PM -0800, Roland Dreier wrote: > > > Greg> Oh, have you asked for a real major number to be reserved > > > Greg> for umad? > > > > > > No, I think we're fine with a dynamic major. Is there any reason to > > > want a real major? > > > > People who do not use udev will not like you. > > I don't quite understand this. Given things like udev, wouldn't dynamic > majors work just like having a static major number? Yes, but people who do not use udev, will have a hard time creating the device nodes by hand every time. thanks, greg k-h From roland at topspin.com Mon Nov 22 22:47:29 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 22 Nov 2004 22:47:29 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041123064120.GB22493@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 22:41:20 -0800") References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> <20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com> <20041123064120.GB22493@kroah.com> Message-ID: <52hdnh83jy.fsf@topspin.com> Greg> Yeah, but a name in each file is much nicer. Very little of the kernel seems to follow this rule right now. Greg> One comment, the file drivers/infiniband/core/cache.c has a Greg> license that is illegal due to the contents of the file. Greg> Please change the license of the file to GPL only. ?? Can you explain this? What makes that file special? Greg> Oh, and how about kernel-doc comments for all functions that Greg> are EXPORT_SYMBOL() marked? And for your core big Greg> structures? I guess we'll start working on it... - Roland From johannes at erdfelt.com Mon Nov 22 22:51:10 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Mon, 22 Nov 2004 22:51:10 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041123064508.GC22493@kroah.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com> <20041122230533.GB13083@kroah.com> <20041122233047.GH27658@sventech.com> <20041123064508.GC22493@kroah.com> Message-ID: <20041123065110.GA3959@sventech.com> On Mon, Nov 22, 2004, Greg KH wrote: > On Mon, Nov 22, 2004 at 03:30:47PM -0800, Johannes Erdfelt wrote: > > On Mon, Nov 22, 2004, Greg KH wrote: > > > People who do not use udev will not like you. > > > > I don't quite understand this. Given things like udev, wouldn't dynamic > > majors work just like having a static major number? > > Yes, but people who do not use udev, will have a hard time creating the > device nodes by hand every time. Ok, I can understand that for now. Is the eventual plan to move to dynamic majors for all devices? JE From greg at kroah.com Mon Nov 22 23:29:45 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 23:29:45 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <52hdnh83jy.fsf@topspin.com> References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> <20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com> <20041123064120.GB22493@kroah.com> <52hdnh83jy.fsf@topspin.com> Message-ID: <20041123072944.GA22786@kroah.com> On Mon, Nov 22, 2004 at 10:47:29PM -0800, Roland Dreier wrote: > Greg> Yeah, but a name in each file is much nicer. > > Very little of the kernel seems to follow this rule right now. I agree, but it's good to add this for new files. > Greg> One comment, the file drivers/infiniband/core/cache.c has a > Greg> license that is illegal due to the contents of the file. > Greg> Please change the license of the file to GPL only. > > ?? Can you explain this? What makes that file special? You are using a specific data structure that is only licensed to be used in GPL code. By using it in code that has a non-GPL license (like the dual license you have) you are violating the license of that code, and open yourself up to lawsuits by the holder of that code. There, can I be vague enough? :) To be straightforward, either drop the RCU code completely, or change the license of your code. Hm, because of the fact that you are linking in GPL only code into this code (because of the .h files you are using) how could you ever expect to use a BSD-like license for this collected work? Aren't licenses fun... thanks, greg k-h From greg at kroah.com Mon Nov 22 23:38:27 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 23:38:27 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041123065110.GA3959@sventech.com> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122225335.GE15634@kroah.com> <52sm71bie2.fsf@topspin.com> <20041122230533.GB13083@kroah.com> <20041122233047.GH27658@sventech.com> <20041123064508.GC22493@kroah.com> <20041123065110.GA3959@sventech.com> Message-ID: <20041123073827.GA23122@kroah.com> On Mon, Nov 22, 2004 at 10:51:10PM -0800, Johannes Erdfelt wrote: > > Is the eventual plan to move to dynamic majors for all devices? No, some people will not allow that to happen, it would break too many old programs and configurations. It will probably be a config option if people wish to try it out (it's only about a 3 line change to the kernel to enable this, I need to just submit the patch one of these days...) thanks, greg k-h From greg at kroah.com Mon Nov 22 23:43:37 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 23:43:37 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <52llct83o1.fsf@topspin.com> References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com> <20041123063045.GA22493@kroah.com> <52llct83o1.fsf@topspin.com> Message-ID: <20041123074337.GB23194@kroah.com> On Mon, Nov 22, 2004 at 10:45:02PM -0800, Roland Dreier wrote: > Greg> class_simple_device_add returns a pointer to a struct > Greg> class_device * that you can then use to create a file in > Greg> sysfs with. That should be what you're looking for. > > Shouldn't the ABI version be an attribute in /sys/class/infiniband_mad > rather than being per-device? Yes, it probably should be. Hm, no, we don't allow you to put class specific files if you use the class_simple API, sorry I misread your question. You can just handle the class yourself and use the CLASS_ATTR() macro to define your api version function. thanks, greg k-h From greg at kroah.com Mon Nov 22 23:45:51 2004 From: greg at kroah.com (Greg KH) Date: Mon, 22 Nov 2004 23:45:51 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <52oehpbi2j.fsf@topspin.com> References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52oehpbi2j.fsf@topspin.com> Message-ID: <20041123074551.GC23194@kroah.com> On Mon, Nov 22, 2004 at 03:05:40PM -0800, Roland Dreier wrote: > Greg> You are letting any user, with any privilege register or > Greg> unregister an "agent"? > > They have to be able to open the device node. We could add a check > that they have it open for writing but there's not really much point > in opening this device read-only. Ok, I remember this conversation a while ago. We discussed this same thing a number of months back on the openib mailing list. Nevermind :) > Greg> Also, these "agents" seem to be a type of filter, right? Is > Greg> there no other way to implement this than an ioctl? > > ioctl seems to be the least bad way to me. This really feels like a > legitimate use of ioctl to me -- we use read/write to handle passing > data through our file descriptor, and ioctl for control of the > properties of the descriptor. > > What would you suggest as an ioctl replacement? I really can't think of anything else. It just will require a _lot_ of vigilant attention to prevent people from adding other ioctls to this one, right? Do you have other ioctls planned for this same interface for stage 2 and future stages of ib implementation for Linux? thanks, greg k-h From ebiederman at lnxi.com Tue Nov 23 00:49:16 2004 From: ebiederman at lnxi.com (Eric W. Biederman) Date: 23 Nov 2004 01:49:16 -0700 Subject: [openib-general] Re: [PATCH][RFC/v1][11/12] Add InfiniBand Documentation files In-Reply-To: <20041122153144.GA4821@infradead.org> References: <20041122714.taTI3zcdWo5JfuMd@topspin.com> <20041122714.AyIOvRY195EGFTaO@topspin.com> <20041122153144.GA4821@infradead.org> Message-ID: Christoph Hellwig writes: > > + When the IPoIB driver is loaded, it creates one interface for each > > + port using the P_Key at index 0. To create an interface with a > > + different P_Key, write the desired P_Key into the main interface's > > + /sys/class/net//create_child file. For example: > > + > > + echo 0x8001 > /sys/class/net/ib0/create_child > > + > > + This will create an interface named ib0.8001 with P_Key 0x8001. To > > + remove a subinterface, use the "delete_child" file: > > + > > + echo 0x8001 > /sys/class/net/ib0/delete_child > > + > > + The P_Key for any interface is given by the "pkey" file, and the > > + main interface for a subinterface is in "parent." > > Any reason this doesn't use an interface similar to the normal vlan code? > > And what is a P_Key? IB version of a vlan identifier. Eric From arnd at arndb.de Tue Nov 23 04:13:57 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 23 Nov 2004 13:13:57 +0100 Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review In-Reply-To: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> Message-ID: <200411231313.57758.arnd@arndb.de> On Maandag 22 November 2004 16:13, Roland Dreier wrote: > I'm very happy to be able to post an initial version of InfiniBand > patches for review. Patches 1, 3 and 5 didn't make it to lkml. Did you hit the 100kb size limit for mails? Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature URL: From roland at topspin.com Tue Nov 23 07:04:44 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 07:04:44 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <20041123074551.GC23194@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 23:45:51 -0800") References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52oehpbi2j.fsf@topspin.com> <20041123074551.GC23194@kroah.com> Message-ID: <527joc8v3n.fsf@topspin.com> Greg> Do you have other ioctls planned for this same interface for Greg> stage 2 and future stages of ib implementation for Linux? Not that I know of. - Roland From roland at topspin.com Tue Nov 23 07:06:07 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 07:06:07 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <20041123074337.GB23194@kroah.com> (Greg KH's message of "Mon, 22 Nov 2004 23:43:37 -0800") References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com> <20041123063045.GA22493@kroah.com> <52llct83o1.fsf@topspin.com> <20041123074337.GB23194@kroah.com> Message-ID: <523bz08v1c.fsf@topspin.com> Greg> Yes, it probably should be. Hm, no, we don't allow you to Greg> put class specific files if you use the class_simple API, Greg> sorry I misread your question. You can just handle the Greg> class yourself and use the CLASS_ATTR() macro to define your Greg> api version function. Ugh, then we end up duplicating the class_simple code. Would you accept a patch that adds class_simple_create_file()/class_simple_remove_file()? Thanks, Roland From greg at kroah.com Tue Nov 23 07:17:47 2004 From: greg at kroah.com (Greg KH) Date: Tue, 23 Nov 2004 07:17:47 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][9/12] Add InfiniBand userspace MAD support In-Reply-To: <523bz08v1c.fsf@topspin.com> References: <20041122714.nKCPmH9LMhT0X7WE@topspin.com> <20041122714.9zlcKGKvXlpga8EP@topspin.com> <20041122225033.GD15634@kroah.com> <52ekil9v1m.fsf@topspin.com> <20041123063045.GA22493@kroah.com> <52llct83o1.fsf@topspin.com> <20041123074337.GB23194@kroah.com> <523bz08v1c.fsf@topspin.com> Message-ID: <20041123151747.GA26986@kroah.com> On Tue, Nov 23, 2004 at 07:06:07AM -0800, Roland Dreier wrote: > Greg> Yes, it probably should be. Hm, no, we don't allow you to > Greg> put class specific files if you use the class_simple API, > Greg> sorry I misread your question. You can just handle the > Greg> class yourself and use the CLASS_ATTR() macro to define your > Greg> api version function. > > Ugh, then we end up duplicating the class_simple code. Would you > accept a patch that adds class_simple_create_file()/class_simple_remove_file()? Ick, ok, sure. Just make sure to mark them as EXPORT_SYMBOL_GPL() :) thanks, greg k-h From roland at topspin.com Tue Nov 23 07:20:35 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 07:20:35 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][0/12] Initial submission of InfiniBand patches for review In-Reply-To: <200411231313.57758.arnd@arndb.de> (Arnd Bergmann's message of "Tue, 23 Nov 2004 13:13:57 +0100") References: <20041122713.Nh0zRPbm8qA0VBxj@topspin.com> <200411231313.57758.arnd@arndb.de> Message-ID: <52vfbw7fss.fsf@topspin.com> Arnd> Patches 1, 3 and 5 didn't make it to lkml. Did you hit the Arnd> 100kb size limit for mails? Ah, that must be what happened. I was confused because gmane.org did pick them up, but I think that's because gmane is also subscribed to openib-general (which is cc'ed). I'll reroll the patches, splitting the too-large pieces, and send soon. Thanks, Roland From roland at topspin.com Tue Nov 23 07:39:03 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 07:39:03 -0800 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101186939.29554.92.camel@trinity> (Matt Leininger's message of "Mon, 22 Nov 2004 21:15:38 -0800") References: <1101173164.18604.53.camel@localhost> <1101183978.4124.548.camel@localhost.localdomain> <1101186939.29554.92.camel@trinity> Message-ID: <52llcs7ey0.fsf@topspin.com> Matt> We are using fw_ver 4.5.0. Looks like we need to upgrade. Matt> Time to try the user space firmware burning tools. I would recommend _not_ using tvflash to upgrade PCIe HCAs from FW 4.5.0 to 4.5.3 right now. The invariant sector of flash needs to be rewritten, and the version of tvflash checked in right now doesn't handle that properly yet. Give me a day or so to fix it... - Roland From rddunlap at osdl.org Tue Nov 23 07:27:05 2004 From: rddunlap at osdl.org (Randy.Dunlap) Date: Tue, 23 Nov 2004 07:27:05 -0800 Subject: [openib-general] Re: [PATCH][RFC/v1][4/12] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041123064120.GB22493@kroah.com> References: <20041122713.SDrx8l5Z4XR5FsjB@topspin.com> <20041122713.g6bh6aqdXIN4RJYR@topspin.com> <20041122222507.GB15634@kroah.com> <527jodbgqo.fsf@topspin.com> <20041123064120.GB22493@kroah.com> Message-ID: <41A356C9.40607@osdl.org> Greg KH wrote: > On Mon, Nov 22, 2004 at 03:34:23PM -0800, Roland Dreier wrote: > >> Greg> No email address of who to bug with issues? >> >>There's a patch to MAINTAINERS... > > > Yeah, but a name in each file is much nicer. I disagree. I'd rather be able to look in MAINTAINERS | CREDITS for all such references. -- ~Randy From roland at topspin.com Tue Nov 23 08:14:08 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:08 -0800 Subject: [openib-general] [PATCH][RFC/v2][0/21] Second submission of InfiniBand patches for review Message-ID: <20041123814.p0AnYzTlx42JeVes@topspin.com> Here is the second version of the InfiniBand driver patch set. These patches incorporate most but not all of the feedback received since Monday. However, the main reason for posting this new set is that several of the patches in the first batch ran afoul of the 100K limit on linux-kernel. This batch is split into smaller pieces so all the parts should make it through this time. Thanks, Roland Dreier OpenIB Alliance www.openib.org From roland at topspin.com Tue Nov 23 08:14:14 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:14 -0800 Subject: [openib-general] [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: <20041123814.p0AnYzTlx42JeVes@topspin.com> Message-ID: <20041123814.rXLIXw020elfd6Da@topspin.com> Add public headers for core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_cache.h 2004-11-23 08:10:15.790234096 -0800 @@ -0,0 +1,49 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ib_cache.h 1255 2004-11-17 17:20:41Z roland $ + */ + +#ifndef _IB_CACHE_H +#define _IB_CACHE_H + +#include + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid); +int ib_cached_pkey_get(struct ib_device *device_handle, + u8 port, + int index, + u16 *pkey); +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index); + +#endif /* _IB_CACHE_H */ + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_fmr_pool.h 2004-11-23 08:10:15.847225692 -0800 @@ -0,0 +1,69 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ib_fmr_pool.h 696 2004-08-28 03:10:21Z roland $ + */ + +#if !defined(IB_FMR_POOL_H) +#define IB_FMR_POOL_H + +#include + +struct ib_fmr_pool; + +struct ib_fmr_pool_param { + int max_pages_per_fmr; + enum ib_access_flags access; + int pool_size; + int dirty_watermark; + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + unsigned cache:1; +}; + +struct ib_pool_fmr { + struct ib_fmr *fmr; + struct ib_fmr_pool *pool; + struct list_head list; + struct hlist_node cache_node; + int ref_count; + int remap_count; + u64 io_virtual_address; + int page_list_len; + u64 page_list[0]; +}; + +int ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params, + struct ib_fmr_pool **pool_handle); + +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool); + +int ib_flush_fmr_pool(struct ib_fmr_pool *pool); + +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address); + +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr); + +#endif /* IB_FMR_POOL_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_pack.h 2004-11-23 08:10:15.909216552 -0800 @@ -0,0 +1,241 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ib_pack.h 1051 2004-10-25 02:47:17Z roland $ + */ + +#ifndef IB_PACK_H +#define IB_PACK_H + +#include + +enum { + IB_LRH_BYTES = 8, + IB_GRH_BYTES = 40, + IB_BTH_BYTES = 12, + IB_DETH_BYTES = 8 +}; + +struct ib_field { + size_t struct_offset_bytes; + size_t struct_size_bytes; + int offset_words; + int offset_bits; + int size_bits; + char *field_name; +}; + +#define RESERVED \ + .field_name = "reserved" + +/* + * This macro cleans up the definitions of constants for BTH opcodes. + * It is used to define constants such as IB_OPCODE_UD_SEND_ONLY, + * which becomes IB_OPCODE_UD + IB_OPCODE_SEND_ONLY, and this gives + * the correct value. + * + * In short, user code should use the constants defined using the + * macro rather than worrying about adding together other constants. +*/ +#define IB_OPCODE(transport, op) \ + IB_OPCODE_ ## transport ## _ ## op = \ + IB_OPCODE_ ## transport + IB_OPCODE_ ## op + +enum { + /* transport types -- just used to define real constants */ + IB_OPCODE_RC = 0x00, + IB_OPCODE_UC = 0x20, + IB_OPCODE_RD = 0x40, + IB_OPCODE_UD = 0x60, + + /* operations -- just used to define real constants */ + IB_OPCODE_SEND_FIRST = 0x00, + IB_OPCODE_SEND_MIDDLE = 0x01, + IB_OPCODE_SEND_LAST = 0x02, + IB_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + IB_OPCODE_SEND_ONLY = 0x04, + IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + IB_OPCODE_RDMA_WRITE_FIRST = 0x06, + IB_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + IB_OPCODE_RDMA_WRITE_LAST = 0x08, + IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + IB_OPCODE_RDMA_WRITE_ONLY = 0x0a, + IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + IB_OPCODE_RDMA_READ_REQUEST = 0x0c, + IB_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + IB_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + IB_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + IB_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + IB_OPCODE_ACKNOWLEDGE = 0x11, + IB_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + IB_OPCODE_COMPARE_SWAP = 0x13, + IB_OPCODE_FETCH_ADD = 0x14, + + /* real constants follow -- see comment about above IB_OPCODE() + macro for more details */ + + /* RC */ + IB_OPCODE(RC, SEND_FIRST), + IB_OPCODE(RC, SEND_MIDDLE), + IB_OPCODE(RC, SEND_LAST), + IB_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, SEND_ONLY), + IB_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_FIRST), + IB_OPCODE(RC, RDMA_WRITE_MIDDLE), + IB_OPCODE(RC, RDMA_WRITE_LAST), + IB_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_ONLY), + IB_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_READ_REQUEST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RC, ACKNOWLEDGE), + IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RC, COMPARE_SWAP), + IB_OPCODE(RC, FETCH_ADD), + + /* UC */ + IB_OPCODE(UC, SEND_FIRST), + IB_OPCODE(UC, SEND_MIDDLE), + IB_OPCODE(UC, SEND_LAST), + IB_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, SEND_ONLY), + IB_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_FIRST), + IB_OPCODE(UC, RDMA_WRITE_MIDDLE), + IB_OPCODE(UC, RDMA_WRITE_LAST), + IB_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_ONLY), + IB_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + + /* RD */ + IB_OPCODE(RD, SEND_FIRST), + IB_OPCODE(RD, SEND_MIDDLE), + IB_OPCODE(RD, SEND_LAST), + IB_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, SEND_ONLY), + IB_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_FIRST), + IB_OPCODE(RD, RDMA_WRITE_MIDDLE), + IB_OPCODE(RD, RDMA_WRITE_LAST), + IB_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_ONLY), + IB_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_READ_REQUEST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RD, ACKNOWLEDGE), + IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RD, COMPARE_SWAP), + IB_OPCODE(RD, FETCH_ADD), + + /* UD */ + IB_OPCODE(UD, SEND_ONLY), + IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) +}; + +enum { + IB_LNH_RAW = 0, + IB_LNH_IP = 1, + IB_LNH_IBA_LOCAL = 2, + IB_LNH_IBA_GLOBAL = 3 +}; + +struct ib_unpacked_lrh { + u8 virtual_lane; + u8 link_version; + u8 service_level; + u8 link_next_header; + __be16 destination_lid; + __be16 packet_length; + __be16 source_lid; +}; + +struct ib_unpacked_grh { + u8 ip_version; + u8 traffic_class; + __be32 flow_label; + __be16 payload_length; + u8 next_header; + u8 hop_limit; + union ib_gid source_gid; + union ib_gid destination_gid; +}; + +struct ib_unpacked_bth { + u8 opcode; + u8 solicited_event; + u8 mig_req; + u8 pad_count; + u8 transport_header_version; + __be16 pkey; + __be32 destination_qpn; + u8 ack_req; + __be32 psn; +}; + +struct ib_unpacked_deth { + __be32 qkey; + __be32 source_qpn; +}; + +struct ib_ud_header { + struct ib_unpacked_lrh lrh; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf); + +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure); + +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header); + +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf); + +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header); + +#endif /* IB_PACK_H */ + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_verbs.h 2004-11-23 08:10:15.974206969 -0800 @@ -0,0 +1,984 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id: ib_verbs.h 1226 2004-11-13 04:35:49Z roland $ + */ + +#if !defined(IB_VERBS_H) +#define IB_VERBS_H + +#include +#include +#include + +union ib_gid { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + +enum ib_device_cap_flags { + IB_DEVICE_RESIZE_MAX_WR = 1, + IB_DEVICE_BAD_PKEY_CNTR = (1<<1), + IB_DEVICE_BAD_QKEY_CNTR = (1<<2), + IB_DEVICE_RAW_MULTI = (1<<3), + IB_DEVICE_AUTO_PATH_MIG = (1<<4), + IB_DEVICE_CHANGE_PHY_PORT = (1<<5), + IB_DEVICE_UD_AV_PORT_ENFORCE = (1<<6), + IB_DEVICE_CURR_QP_STATE_MOD = (1<<7), + IB_DEVICE_SHUTDOWN_PORT = (1<<8), + IB_DEVICE_INIT_TYPE = (1<<9), + IB_DEVICE_PORT_ACTIVE_EVENT = (1<<10), + IB_DEVICE_SYS_IMAGE_GUID = (1<<11), + IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), + IB_DEVICE_SRQ_RESIZE = (1<<13), + IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_RQ_SIG_TYPE = (1<<15) +}; + +enum ib_atomic_cap { + IB_ATOMIC_NONE, + IB_ATOMIC_HCA, + IB_ATOMIC_GLOB +}; + +struct ib_device_attr { + u64 fw_ver; + u64 node_guid; + u64 sys_image_guid; + u64 max_mr_size; + u64 page_size_cap; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + int max_qp; + int max_qp_wr; + int device_cap_flags; + int max_sge; + int max_sge_rd; + int max_cq; + int max_cqe; + int max_mr; + int max_pd; + int max_qp_rd_atom; + int max_ee_rd_atom; + int max_res_rd_atom; + int max_qp_init_rd_atom; + int max_ee_init_rd_atom; + enum ib_atomic_cap atomic_cap; + int max_ee; + int max_rdd; + int max_mw; + int max_raw_ipv6_qp; + int max_raw_ethy_qp; + int max_mcast_grp; + int max_mcast_qp_attach; + int max_total_mcast_qp_attach; + int max_ah; + int max_fmr; + int max_map_per_fmr; + int max_srq; + int max_srq_wr; + int max_srq_sge; + u16 max_pkeys; + u8 local_ca_ack_delay; +}; + +enum ib_mtu { + IB_MTU_256 = 1, + IB_MTU_512 = 2, + IB_MTU_1024 = 3, + IB_MTU_2048 = 4, + IB_MTU_4096 = 5 +}; + +static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) +{ + switch (mtu) { + case IB_MTU_256: return 256; + case IB_MTU_512: return 512; + case IB_MTU_1024: return 1024; + case IB_MTU_2048: return 2048; + case IB_MTU_4096: return 4096; + default: return -1; + } +} + +enum ib_static_rate { + IB_STATIC_RATE_FULL = 0, + IB_STATIC_RATE_12X_TO_4X = 2, + IB_STATIC_RATE_4X_TO_1X = 3, + IB_STATIC_RATE_12X_TO_1X = 11 +}; + +enum ib_port_state { + IB_PORT_NOP = 0, + IB_PORT_DOWN = 1, + IB_PORT_INIT = 2, + IB_PORT_ARMED = 3, + IB_PORT_ACTIVE = 4, + IB_PORT_ACTIVE_DEFER = 5 +}; + +enum ib_port_cap_flags { + IB_PORT_SM = (1<<31), + IB_PORT_NOTICE_SUP = (1<<30), + IB_PORT_TRAP_SUP = (1<<29), + IB_PORT_AUTO_MIGR_SUP = (1<<27), + IB_PORT_SL_MAP_SUP = (1<<26), + IB_PORT_MKEY_NVRAM = (1<<25), + IB_PORT_PKEY_NVRAM = (1<<24), + IB_PORT_LED_INFO_SUP = (1<<23), + IB_PORT_SM_DISABLED = (1<<22), + IB_PORT_SYS_IMAGE_GUID_SUP = (1<<21), + IB_PORT_PKEY_SW_EXT_PORT_TRAP_SUP = (1<<20), + IB_PORT_CM_SUP = (1<<16), + IB_PORT_SNMP_TUNNEL_SUP = (1<<15), + IB_PORT_REINIT_SUP = (1<<14), + IB_PORT_DEVICE_MGMT_SUP = (1<<13), + IB_PORT_VENDOR_CLASS_SUP = (1<<12), + IB_PORT_DR_NOTICE_SUP = (1<<11), + IB_PORT_PORT_NOTICE_SUP = (1<<10), + IB_PORT_BOOT_MGMT_SUP = (1<<9) +}; + +struct ib_port_attr { + enum ib_port_state state; + enum ib_mtu max_mtu; + enum ib_mtu active_mtu; + int gid_tbl_len; + u32 port_cap_flags; + u32 max_msg_sz; + u32 bad_pkey_cntr; + u32 qkey_viol_cntr; + u16 pkey_tbl_len; + u16 lid; + u16 sm_lid; + u8 lmc; + u8 max_vl_num; + u8 sm_sl; + u8 subnet_timeout; + u8 init_type_reply; +}; + +enum ib_device_modify_flags { + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 +}; + +struct ib_device_modify { + u64 sys_image_guid; +}; + +enum ib_port_modify_flags { + IB_PORT_SHUTDOWN = 1, + IB_PORT_INIT_TYPE = (1<<2), + IB_PORT_RESET_QKEY_CNTR = (1<<3) +}; + +struct ib_port_modify { + u32 set_port_cap_mask; + u32 clr_port_cap_mask; + u8 init_type; +}; + +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + +#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler) \ + do { \ + (_ptr)->device = _device; \ + (_ptr)->handler = _handler; \ + INIT_LIST_HEAD(&(_ptr)->list); \ + } while (0) + +struct ib_global_route { + union ib_gid dgid; + u32 flow_label; + u8 sgid_index; + u8 hop_limit; + u8 traffic_class; +}; + +enum { + IB_MULTICAST_QPN = 0xffffff +}; + +enum ib_ah_flags { + IB_AH_GRH = 1 +}; + +struct ib_ah_attr { + struct ib_global_route grh; + u16 dlid; + u8 sl; + u8 src_path_bits; + u8 static_rate; + u8 ah_flags; + u8 port_num; +}; + +enum ib_wc_status { + IB_WC_SUCCESS, + IB_WC_LOC_LEN_ERR, + IB_WC_LOC_QP_OP_ERR, + IB_WC_LOC_EEC_OP_ERR, + IB_WC_LOC_PROT_ERR, + IB_WC_WR_FLUSH_ERR, + IB_WC_MW_BIND_ERR, + IB_WC_BAD_RESP_ERR, + IB_WC_LOC_ACCESS_ERR, + IB_WC_REM_INV_REQ_ERR, + IB_WC_REM_ACCESS_ERR, + IB_WC_REM_OP_ERR, + IB_WC_RETRY_EXC_ERR, + IB_WC_RNR_RETRY_EXC_ERR, + IB_WC_LOC_RDD_VIOL_ERR, + IB_WC_REM_INV_RD_REQ_ERR, + IB_WC_REM_ABORT_ERR, + IB_WC_INV_EECN_ERR, + IB_WC_INV_EEC_STATE_ERR, + IB_WC_FATAL_ERR, + IB_WC_RESP_TIMEOUT_ERR, + IB_WC_GENERAL_ERR +}; + +enum ib_wc_opcode { + IB_WC_SEND, + IB_WC_RDMA_WRITE, + IB_WC_RDMA_READ, + IB_WC_COMP_SWAP, + IB_WC_FETCH_ADD, + IB_WC_BIND_MW, +/* + * Set value of IB_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & IB_WC_RECV). + */ + IB_WC_RECV = 1 << 7, + IB_WC_RECV_RDMA_WITH_IMM +}; + +enum ib_wc_flags { + IB_WC_GRH = 1, + IB_WC_WITH_IMM = (1<<1) +}; + +struct ib_wc { + u64 wr_id; + enum ib_wc_status status; + enum ib_wc_opcode opcode; + u32 vendor_err; + u32 byte_len; + __be32 imm_data; + u32 src_qp; + int wc_flags; + u16 pkey_index; + u16 slid; + u8 sl; + u8 dlid_path_bits; + u8 port_num; /* valid only for DR SMPs on switches */ +}; + +enum ib_cq_notify { + IB_CQ_SOLICITED, + IB_CQ_NEXT_COMP +}; + +struct ib_qp_cap { + u32 max_send_wr; + u32 max_recv_wr; + u32 max_send_sge; + u32 max_recv_sge; + u32 max_inline_data; +}; + +enum ib_sig_type { + IB_SIGNAL_ALL_WR, + IB_SIGNAL_REQ_WR +}; + +enum ib_qp_type { + /* + * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries + * here (and in that order) since the MAD layer uses them as + * indices into a 2-entry table. + */ + IB_QPT_SMI, + IB_QPT_GSI, + + IB_QPT_RC, + IB_QPT_UC, + IB_QPT_UD, + IB_QPT_RAW_IPV6, + IB_QPT_RAW_ETY +}; + +struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + struct ib_qp_cap cap; + enum ib_sig_type sq_sig_type; + enum ib_sig_type rq_sig_type; + enum ib_qp_type qp_type; + u8 port_num; /* special QP types only */ +}; + +enum ib_rnr_timeout { + IB_RNR_TIMER_655_36 = 0, + IB_RNR_TIMER_000_01 = 1, + IB_RNR_TIMER_000_02 = 2, + IB_RNR_TIMER_000_03 = 3, + IB_RNR_TIMER_000_04 = 4, + IB_RNR_TIMER_000_06 = 5, + IB_RNR_TIMER_000_08 = 6, + IB_RNR_TIMER_000_12 = 7, + IB_RNR_TIMER_000_16 = 8, + IB_RNR_TIMER_000_24 = 9, + IB_RNR_TIMER_000_32 = 10, + IB_RNR_TIMER_000_48 = 11, + IB_RNR_TIMER_000_64 = 12, + IB_RNR_TIMER_000_96 = 13, + IB_RNR_TIMER_001_28 = 14, + IB_RNR_TIMER_001_92 = 15, + IB_RNR_TIMER_002_56 = 16, + IB_RNR_TIMER_003_84 = 17, + IB_RNR_TIMER_005_12 = 18, + IB_RNR_TIMER_007_68 = 19, + IB_RNR_TIMER_010_24 = 20, + IB_RNR_TIMER_015_36 = 21, + IB_RNR_TIMER_020_48 = 22, + IB_RNR_TIMER_030_72 = 23, + IB_RNR_TIMER_040_96 = 24, + IB_RNR_TIMER_061_44 = 25, + IB_RNR_TIMER_081_92 = 26, + IB_RNR_TIMER_122_88 = 27, + IB_RNR_TIMER_163_84 = 28, + IB_RNR_TIMER_245_76 = 29, + IB_RNR_TIMER_327_68 = 30, + IB_RNR_TIMER_491_52 = 31 +}; + +enum ib_qp_attr_mask { + IB_QP_STATE = 1, + IB_QP_CUR_STATE = (1<<1), + IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<2), + IB_QP_ACCESS_FLAGS = (1<<3), + IB_QP_PKEY_INDEX = (1<<4), + IB_QP_PORT = (1<<5), + IB_QP_QKEY = (1<<6), + IB_QP_AV = (1<<7), + IB_QP_PATH_MTU = (1<<8), + IB_QP_TIMEOUT = (1<<9), + IB_QP_RETRY_CNT = (1<<10), + IB_QP_RNR_RETRY = (1<<11), + IB_QP_RQ_PSN = (1<<12), + IB_QP_MAX_QP_RD_ATOMIC = (1<<13), + IB_QP_ALT_PATH = (1<<14), + IB_QP_MIN_RNR_TIMER = (1<<15), + IB_QP_SQ_PSN = (1<<16), + IB_QP_MAX_DEST_RD_ATOMIC = (1<<17), + IB_QP_PATH_MIG_STATE = (1<<18), + IB_QP_CAP = (1<<19), + IB_QP_DEST_QPN = (1<<20) +}; + +enum ib_qp_state { + IB_QPS_RESET, + IB_QPS_INIT, + IB_QPS_RTR, + IB_QPS_RTS, + IB_QPS_SQD, + IB_QPS_SQE, + IB_QPS_ERR +}; + +enum ib_mig_state { + IB_MIG_MIGRATED, + IB_MIG_REARM, + IB_MIG_ARMED +}; + +struct ib_qp_attr { + enum ib_qp_state qp_state; + enum ib_qp_state cur_qp_state; + enum ib_mtu path_mtu; + enum ib_mig_state path_mig_state; + u32 qkey; + u32 rq_psn; + u32 sq_psn; + u32 dest_qp_num; + int qp_access_flags; + struct ib_qp_cap cap; + struct ib_ah_attr ah_attr; + struct ib_ah_attr alt_ah_attr; + u16 pkey_index; + u16 alt_pkey_index; + u8 en_sqd_async_notify; + u8 sq_draining; + u8 max_rd_atomic; + u8 max_dest_rd_atomic; + u8 min_rnr_timer; + u8 port_num; + u8 timeout; + u8 retry_cnt; + u8 rnr_retry; + u8 alt_port_num; + u8 alt_timeout; +}; + +enum ib_wr_opcode { + IB_WR_RDMA_WRITE, + IB_WR_RDMA_WRITE_WITH_IMM, + IB_WR_SEND, + IB_WR_SEND_WITH_IMM, + IB_WR_RDMA_READ, + IB_WR_ATOMIC_CMP_AND_SWP, + IB_WR_ATOMIC_FETCH_AND_ADD +}; + +enum ib_send_flags { + IB_SEND_FENCE = 1, + IB_SEND_SIGNALED = (1<<1), + IB_SEND_SOLICITED = (1<<2), + IB_SEND_INLINE = (1<<3) +}; + +enum ib_recv_flags { + IB_RECV_SIGNALED = 1 +}; + +struct ib_sge { + u64 addr; + u32 length; + u32 lkey; +}; + +struct ib_send_wr { + struct ib_send_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + enum ib_wr_opcode opcode; + int send_flags; + u32 imm_data; + union { + struct { + u64 remote_addr; + u32 rkey; + } rdma; + struct { + u64 remote_addr; + u64 compare_add; + u64 swap; + u32 rkey; + } atomic; + struct { + struct ib_ah *ah; + struct ib_mad_hdr *mad_hdr; + u32 remote_qpn; + u32 remote_qkey; + int timeout_ms; /* valid for MADs only */ + u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ + } ud; + } wr; +}; + +struct ib_recv_wr { + struct ib_recv_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + int recv_flags; +}; + +enum ib_access_flags { + IB_ACCESS_LOCAL_WRITE = 1, + IB_ACCESS_REMOTE_WRITE = (1<<1), + IB_ACCESS_REMOTE_READ = (1<<2), + IB_ACCESS_REMOTE_ATOMIC = (1<<3), + IB_ACCESS_MW_BIND = (1<<4) +}; + +struct ib_phys_buf { + u64 addr; + u64 size; +}; + +struct ib_mr_attr { + struct ib_pd *pd; + u64 device_virt_addr; + u64 size; + int mr_access_flags; + u32 lkey; + u32 rkey; +}; + +enum ib_mr_rereg_flags { + IB_MR_REREG_TRANS = 1, + IB_MR_REREG_PD = (1<<1), + IB_MR_REREG_ACCESS = (1<<2) +}; + +struct ib_mw_bind { + struct ib_mr *mr; + u64 wr_id; + u64 addr; + u32 length; + int send_flags; + int mw_access_flags; +}; + +struct ib_fmr_attr { + int max_pages; + int max_maps; + u8 page_size; +}; + +struct ib_pd { + struct ib_device *device; + atomic_t usecnt; /* count all resources */ +}; + +struct ib_ah { + struct ib_device *device; + struct ib_pd *pd; +}; + +typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context); + +struct ib_cq { + struct ib_device *device; + ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); + void * cq_context; + int cqe; + atomic_t usecnt; /* count number of work queues */ +}; + +struct ib_srq { + struct ib_device *device; + struct ib_pd *pd; + void *srq_context; + atomic_t usecnt; +}; + +struct ib_qp { + struct ib_device *device; + struct ib_pd *pd; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + u32 qp_num; +}; + +struct ib_mr { + struct ib_device *device; + struct ib_pd *pd; + u32 lkey; + u32 rkey; + atomic_t usecnt; /* count number of MWs */ +}; + +struct ib_mw { + struct ib_device *device; + struct ib_pd *pd; + u32 rkey; +}; + +struct ib_fmr { + struct ib_device *device; + struct ib_pd *pd; + struct list_head list; + u32 lkey; + u32 rkey; +}; + +struct ib_mad; + +enum ib_process_mad_flags { + IB_MAD_IGNORE_MKEY = 1 +}; + +enum ib_mad_result { + IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ + IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ +}; + +#define IB_DEVICE_NAME_MAX 64 + +struct ib_cache { + struct ib_event_handler event_handler; + struct ib_pkey_cache **pkey_cache; + struct ib_gid_cache **gid_cache; +}; + +struct ib_device { + struct device *dma_device; + + char name[IB_DEVICE_NAME_MAX]; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + + struct list_head core_list; + struct list_head client_data_list; + spinlock_t client_data_lock; + + struct ib_cache cache; + + u32 flags; + + int (*query_device)(struct ib_device *device, + struct ib_device_attr *device_attr); + int (*query_port)(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr); + int (*query_gid)(struct ib_device *device, + u8 port_num, int index, + union ib_gid *gid); + int (*query_pkey)(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + int (*modify_device)(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + int (*modify_port)(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + struct ib_pd * (*alloc_pd)(struct ib_device *device); + int (*dealloc_pd)(struct ib_pd *pd); + struct ib_ah * (*create_ah)(struct ib_pd *pd, + struct ib_ah_attr *ah_attr); + int (*modify_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*query_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*destroy_ah)(struct ib_ah *ah); + struct ib_qp * (*create_qp)(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + int (*modify_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + int (*query_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int (*destroy_qp)(struct ib_qp *qp); + int (*post_send)(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + int (*post_recv)(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + struct ib_cq * (*create_cq)(struct ib_device *device, + int cqe); + int (*destroy_cq)(struct ib_cq *cq); + int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*poll_cq)(struct ib_cq *cq, int num_entries, + struct ib_wc *wc); + int (*peek_cq)(struct ib_cq *cq, int wc_cnt); + int (*req_notify_cq)(struct ib_cq *cq, + enum ib_cq_notify cq_notify); + int (*req_ncomp_notif)(struct ib_cq *cq, + int wc_cnt); + struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, + int mr_access_flags); + struct ib_mr * (*reg_phys_mr)(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + int (*query_mr)(struct ib_mr *mr, + struct ib_mr_attr *mr_attr); + int (*dereg_mr)(struct ib_mr *mr); + int (*rereg_phys_mr)(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + struct ib_mw * (*alloc_mw)(struct ib_pd *pd); + int (*bind_mw)(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + int (*dealloc_mw)(struct ib_mw *mw); + struct ib_fmr * (*alloc_fmr)(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + int (*map_phys_fmr)(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova); + int (*unmap_fmr)(struct list_head *fmr_list); + int (*dealloc_fmr)(struct ib_fmr *fmr); + int (*attach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*detach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + u16 source_lid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); + + struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; + + enum { + IB_DEV_UNINITIALIZED, + IB_DEV_REGISTERED, + IB_DEV_UNREGISTERED + } reg_state; + + u8 node_type; + u8 phys_port_cnt; +}; + +struct ib_client { + char *name; + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + +struct ib_device *ib_alloc_device(size_t size); +void ib_dealloc_device(struct ib_device *device); + +int ib_register_device (struct ib_device *device); +void ib_unregister_device(struct ib_device *device); + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + +int ib_register_event_handler (struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr); + +int ib_query_port(struct ib_device *device, + u8 port_num, struct ib_port_attr *port_attr); + +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid); + +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + +struct ib_pd *ib_alloc_pd(struct ib_device *device); +int ib_dealloc_pd(struct ib_pd *pd); + +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); +int ib_destroy_ah(struct ib_ah *ah); + +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + +int ib_destroy_qp(struct ib_qp *qp); + +static inline int ib_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + return qp->device->post_send(qp, send_wr, bad_send_wr); +} + +static inline int ib_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return qp->device->post_recv(qp, recv_wr, bad_recv_wr); +} + +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe); + +int ib_resize_cq(struct ib_cq *cq, int cqe); +int ib_destroy_cq(struct ib_cq *cq); + +/** + * ib_poll_cq - poll a CQ for completion(s) + * @cq:the CQ being polled + * @num_entries:maximum number of completions to return + * @wc:array of at least @num_entries &struct ib_wc where completions + * will be returned + * + * Poll a CQ for (possibly multiple) completions. If the return value + * is < 0, an error occurred. If the return value is >= 0, it is the + * number of completions returned. If the return value is + * non-negative and < num_entries, then the CQ was emptied. + */ +static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, + struct ib_wc *wc) +{ + return cq->device->poll_cq(cq, num_entries, wc); +} + +int ib_peek_cq(struct ib_cq *cq, int wc_cnt); + +/** + * ib_req_notify_cq - request completion notification + * @cq:the CQ to generate an event for + * @cq_notify:%IB_CQ_SOLICITED for next solicited event, + * %IB_CQ_NEXT_COMP for any completion. + */ +static inline int ib_req_notify_cq(struct ib_cq *cq, + enum ib_cq_notify cq_notify) +{ + return cq->device->req_notify_cq(cq, cq_notify); +} + +static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) +{ + return cq->device->req_ncomp_notif ? + cq->device->req_ncomp_notif(cq, wc_cnt) : + -ENOSYS; +} + +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); +int ib_dereg_mr(struct ib_mr *mr); + +struct ib_mw *ib_alloc_mw(struct ib_pd *pd); + +static inline int ib_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + /* XXX reference counting in corresponding MR? */ + return mw->device->bind_mw ? + mw->device->bind_mw(qp, mw, mw_bind) : + -ENOSYS; +} + +int ib_dealloc_mw(struct ib_mw *mw); + +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +static inline int ib_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova) +{ + return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova); +} + +int ib_unmap_fmr(struct list_head *fmr_list); +int ib_dealloc_fmr(struct ib_fmr *fmr); + +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +#endif /* IB_VERBS_H */ From roland at topspin.com Tue Nov 23 08:14:19 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:19 -0800 Subject: [openib-general] [PATCH][RFC/v2][2/21] Add core InfiniBand support In-Reply-To: <20041123814.rXLIXw020elfd6Da@topspin.com> Message-ID: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> Add implementation of core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Kconfig 2004-11-23 08:10:16.399144313 -0800 @@ -0,0 +1,11 @@ +menu "InfiniBand support" + +config INFINIBAND + tristate "InfiniBand support" + default n + ---help--- + Core support for InfiniBand (IB). Make sure to also select + any protocols you wish to use as well as drivers for your + InfiniBand hardware. + +endmenu --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Makefile 2004-11-23 08:10:16.436138859 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND) += core/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/Makefile 2004-11-23 08:10:16.496130013 -0800 @@ -0,0 +1,13 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND) += \ + ib_core.o + +ib_core-objs := \ + packer.o \ + ud_header.o \ + verbs.o \ + sysfs.o \ + device.o \ + fmr_pool.o \ + cache.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/cache.c 2004-11-23 08:10:16.816082837 -0800 @@ -0,0 +1,338 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: cache.c 1257 2004-11-17 23:12:18Z roland $ +*/ + +#include +#include +#include +#include +#include + +#include "core_priv.h" + +struct ib_pkey_cache { + struct rcu_head rcu; + int table_len; + u16 table[0]; +}; + +struct ib_gid_cache { + struct rcu_head rcu; + int table_len; + union ib_gid table[0]; +}; + +struct ib_update_work { + struct work_struct work; + struct ib_device *device; + u8 port_num; +}; + +static inline int start_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : 1; +} + +static inline int end_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; +} + +static void rcu_free_pkey(struct rcu_head *head) +{ + struct ib_pkey_cache *cache = + container_of(head, struct ib_pkey_cache, rcu); + kfree(cache); +} + +static void rcu_free_gid(struct rcu_head *head) +{ + struct ib_gid_cache *cache = + container_of(head, struct ib_gid_cache, rcu); + kfree(cache); +} + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid) +{ + struct ib_gid_cache *cache; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + rcu_read_lock(); + + cache = rcu_dereference(device->cache.gid_cache[port - start_port(device)]); + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *gid = cache->table[index]; + + rcu_read_unlock(); + + return ret; +} +EXPORT_SYMBOL(ib_cached_gid_get); + +int ib_cached_pkey_get(struct ib_device *device, + u8 port, + int index, + u16 *pkey) +{ + struct ib_pkey_cache *cache; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + rcu_read_lock(); + + cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]); + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *pkey = cache->table[index]; + + rcu_read_unlock(); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_get); + +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index) +{ + struct ib_pkey_cache *cache; + int i; + int ret = -ENOENT; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + rcu_read_lock(); + + cache = rcu_dereference(device->cache.pkey_cache[port - start_port(device)]); + + *index = -1; + + for (i = 0; i < cache->table_len; ++i) + if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { + *index = i; + ret = 0; + break; + } + + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_find); + +static void ib_cache_update(struct ib_device *device, + u8 port) +{ + struct ib_port_attr *tprops = NULL; + struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; + struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; + int i; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return; + + ret = ib_query_port(device, port, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", + ret, device->name); + goto err; + } + + pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * + sizeof *pkey_cache->table, GFP_KERNEL); + if (!pkey_cache) + goto err; + + INIT_RCU_HEAD(&pkey_cache->rcu); + pkey_cache->table_len = tprops->pkey_tbl_len; + + gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * + sizeof *gid_cache->table, GFP_KERNEL); + if (!gid_cache) + goto err; + + INIT_RCU_HEAD(&gid_cache->rcu); + gid_cache->table_len = tprops->gid_tbl_len; + + for (i = 0; i < pkey_cache->table_len; ++i) { + ret = ib_query_pkey(device, port, i, pkey_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + for (i = 0; i < gid_cache->table_len; ++i) { + ret = ib_query_gid(device, port, i, gid_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; + old_gid_cache = device->cache.gid_cache [port - start_port(device)]; + + rcu_assign_pointer(device->cache.pkey_cache[port - start_port(device)], + pkey_cache); + rcu_assign_pointer(device->cache.gid_cache [port - start_port(device)], + gid_cache); + + if (old_pkey_cache) + call_rcu(&old_pkey_cache->rcu, rcu_free_pkey); + if (old_gid_cache) + call_rcu(&old_gid_cache->rcu, rcu_free_gid); + + kfree(tprops); + return; + +err: + kfree(pkey_cache); + kfree(gid_cache); + kfree(tprops); +} + +static void ib_cache_task(void *work_ptr) +{ + struct ib_update_work *work = work_ptr; + + ib_cache_update(work->device, work->port_num); + kfree(work); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct ib_update_work *work; + + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (work) { + INIT_WORK(&work->work, ib_cache_task, work); + work->device = event->device; + work->port_num = event->element.port_num; + schedule_work(&work->work); + } + } +} + +void ib_cache_setup_one(struct ib_device *device) +{ + int p; + + device->cache.pkey_cache = + kmalloc(sizeof *device->cache.pkey_cache * + end_port(device) - start_port(device), GFP_KERNEL); + device->cache.gid_cache = + kmalloc(sizeof *device->cache.pkey_cache * + end_port(device) - start_port(device), GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache) { + printk(KERN_WARNING "Couldn't allocate cache " + "for %s\n", device->name); + goto err; + } + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + device->cache.pkey_cache[p] = NULL; + device->cache.gid_cache [p] = NULL; + ib_cache_update(device, p + start_port(device)); + } + + INIT_IB_EVENT_HANDLER(&device->cache.event_handler, + device, ib_cache_event); + if (ib_register_event_handler(&device->cache.event_handler)) + goto err_cache; + + return; + +err_cache: + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + +err: + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +void ib_cache_cleanup_one(struct ib_device *device) +{ + int p; + + ib_unregister_event_handler(&device->cache.event_handler); + flush_scheduled_work(); + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +struct ib_client cache_client = { + .name = "cache", + .add = ib_cache_setup_one, + .remove = ib_cache_cleanup_one +}; + +int __init ib_cache_setup(void) +{ + return ib_register_client(&cache_client); +} + +void __exit ib_cache_cleanup(void) +{ + ib_unregister_client(&cache_client); +} + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/core_priv.h 2004-11-23 08:10:16.845078561 -0800 @@ -0,0 +1,48 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: core_priv.h 1179 2004-11-09 05:04:42Z roland $ +*/ + +#ifndef _CORE_PRIV_H +#define _CORE_PRIV_H + +#include +#include + +#include + +int ib_device_register_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); + +int ib_sysfs_setup(void); +void ib_sysfs_cleanup(void); + +int ib_cache_setup(void); +void ib_cache_cleanup(void); + +#endif /* _CORE_PRIV_H */ + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/device.c 2004-11-23 08:10:16.735094778 -0800 @@ -0,0 +1,462 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: device.c 1179 2004-11-09 05:04:42Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("core kernel InfiniBand API"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void * data; +}; + +static LIST_HEAD(device_list); +static LIST_HEAD(client_list); + +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + +static int ib_device_check_mandatory(struct ib_device *device) +{ +#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } + static const struct { + size_t offset; + char *name; + } mandatory_table[] = { + IB_MANDATORY_FUNC(query_device), + IB_MANDATORY_FUNC(query_port), + IB_MANDATORY_FUNC(query_pkey), + IB_MANDATORY_FUNC(query_gid), + IB_MANDATORY_FUNC(alloc_pd), + IB_MANDATORY_FUNC(dealloc_pd), + IB_MANDATORY_FUNC(create_ah), + IB_MANDATORY_FUNC(destroy_ah), + IB_MANDATORY_FUNC(create_qp), + IB_MANDATORY_FUNC(modify_qp), + IB_MANDATORY_FUNC(destroy_qp), + IB_MANDATORY_FUNC(post_send), + IB_MANDATORY_FUNC(post_recv), + IB_MANDATORY_FUNC(create_cq), + IB_MANDATORY_FUNC(destroy_cq), + IB_MANDATORY_FUNC(poll_cq), + IB_MANDATORY_FUNC(req_notify_cq), + IB_MANDATORY_FUNC(get_dma_mr), + IB_MANDATORY_FUNC(dereg_mr) + }; + int i; + + for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { + if (!*(void **) ((void *) device + mandatory_table[i].offset)) { + printk(KERN_WARNING "Device %s is missing mandatory function %s\n", + device->name, mandatory_table[i].name); + return -EINVAL; + } + } + + return 0; +} + +static struct ib_device *__ib_device_get_by_name(const char *name) +{ + struct ib_device *device; + + list_for_each_entry(device, &device_list, core_list) + if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) + return device; + + return NULL; +} + + +static int alloc_name(char *name) +{ + long *inuse; + char buf[IB_DEVICE_NAME_MAX]; + struct ib_device *device; + int i; + + inuse = (long *) get_zeroed_page(GFP_KERNEL); + if (!inuse) + return -ENOMEM; + + list_for_each_entry(device, &device_list, core_list) { + if (!sscanf(device->name, name, &i)) + continue; + if (i < 0 || i >= PAGE_SIZE * 8) + continue; + snprintf(buf, sizeof buf, name, i); + if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) + set_bit(i, inuse); + } + + i = find_first_zero_bit(inuse, PAGE_SIZE * 8); + free_page((unsigned long) inuse); + snprintf(buf, sizeof buf, name, i); + + if (__ib_device_get_by_name(buf)) + return -ENFILE; + + strlcpy(name, buf, IB_DEVICE_NAME_MAX); + return 0; +} + +struct ib_device *ib_alloc_device(size_t size) +{ + void *dev; + + BUG_ON(size < sizeof (struct ib_device)); + + dev = kmalloc(size, GFP_KERNEL); + if (!dev) + return NULL; + + memset(dev, 0, size); + + return dev; +} +EXPORT_SYMBOL(ib_alloc_device); + +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) { + kfree(device); + return; + } + + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_unregister_sysfs(device); +} +EXPORT_SYMBOL(ib_dealloc_device); + +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + +int ib_register_device(struct ib_device *device) +{ + int ret; + + down(&device_sem); + + if (strchr(device->name, '%')) { + ret = alloc_name(device->name); + if (ret) + goto out; + } + + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + + INIT_LIST_HEAD(&device->event_handler_list); + INIT_LIST_HEAD(&device->client_data_list); + spin_lock_init(&device->event_handler_lock); + spin_lock_init(&device->client_data_lock); + + ret = ib_device_register_sysfs(device); + if (ret) { + printk(KERN_WARNING "Couldn't register device %s with driver model\n", + device->name); + goto out; + } + + list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + + { + struct ib_client *client; + + list_for_each_entry(client, &client_list, list) + if (client->add && !add_client_context(device, client)) + client->add(device); + } + + out: + up(&device_sem); + return ret; +} +EXPORT_SYMBOL(ib_register_device); + +void ib_unregister_device(struct ib_device *device) +{ + struct ib_client *client; + struct ib_client_data *context, *tmp; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + + list_del(&device->core_list); + + up(&device_sem); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + device->reg_state = IB_DEV_UNREGISTERED; +} +EXPORT_SYMBOL(ib_unregister_device); + +int ib_register_client(struct ib_client *client) +{ + struct ib_device *device; + + down(&device_sem); + + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add && !add_client_context(device, client)) + client->add(device); + + up(&device_sem); + + return 0; +} +EXPORT_SYMBOL(ib_register_client); + +void ib_unregister_client(struct ib_client *client) +{ + struct ib_client_data *context, *tmp; + struct ib_device *device; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) { + if (client->remove) + client->remove(device); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + if (context->client == client) { + list_del(&context->list); + kfree(context); + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + } + list_del(&client->list); + + up(&device_sem); +} +EXPORT_SYMBOL(ib_unregister_client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + void *ret = NULL; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + ret = context->data; + break; + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_client_data); + +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) +{ + struct ib_client_data *context; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + context->data = data; + goto out; + } + + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); + +out: + spin_unlock_irqrestore(&device->client_data_lock, flags); +} +EXPORT_SYMBOL(ib_set_client_data); + +int ib_register_event_handler (struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_add_tail(&event_handler->list, + &event_handler->device->event_handler_list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_register_event_handler); + +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr) +{ + return device->query_device(device, device_attr); +} +EXPORT_SYMBOL(ib_query_device); + +int ib_query_port(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr) +{ + return device->query_port(device, port_num, port_attr); +} +EXPORT_SYMBOL(ib_query_port); + +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid) +{ + return device->query_gid(device, port_num, index, gid); +} +EXPORT_SYMBOL(ib_query_gid); + +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey) +{ + return device->query_pkey(device, port_num, index, pkey); +} +EXPORT_SYMBOL(ib_query_pkey); + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + return device->modify_device(device, device_modify_mask, + device_modify); +} +EXPORT_SYMBOL(ib_modify_device); + +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify) +{ + return device->modify_port(device, port_num, port_modify_mask, + port_modify); +} +EXPORT_SYMBOL(ib_modify_port); + +static int __init ib_core_init(void) +{ + int ret; + + ret = ib_sysfs_setup(); + if (ret) + printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + + ret = ib_cache_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); + ib_sysfs_cleanup(); + } + + return ret; +} + +static void __exit ib_core_cleanup(void) +{ + ib_cache_cleanup(); + ib_sysfs_cleanup(); +} + +module_init(ib_core_init); +module_exit(ib_core_cleanup); + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/fmr_pool.c 2004-11-23 08:10:16.773089176 -0800 @@ -0,0 +1,470 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: fmr_pool.c 1082 2004-10-27 20:32:50Z roland $ +*/ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +enum { + IB_FMR_MAX_REMAPS = 32, + + IB_FMR_HASH_BITS = 8, + IB_FMR_HASH_SIZE = 1 << IB_FMR_HASH_BITS, + IB_FMR_HASH_MASK = IB_FMR_HASH_SIZE - 1 +}; + +/* + If an FMR is not in use, then the list member will point to either + its pool's free_list (if the FMR can be mapped again; that is, + remap_count < IB_FMR_MAX_REMAPS) or its pool's dirty_list (if the + FMR needs to be unmapped before being remapped). In either of these + cases it is a bug if the ref_count is not 0. In other words, if + ref_count is > 0, then the list member must not be linked into + either free_list or dirty_list. + + The cache_node member is used to link the FMR into a cache bucket + (if caching is enabled). This is independent of the reference count + of the FMR. When a valid FMR is released, its ref_count is + decremented, and if ref_count reaches 0, the FMR is placed in either + free_list or dirty_list as appropriate. However, it is not removed + from the cache and may be "revived" if a call to + ib_fmr_register_physical() occurs before the FMR is remapped. In + this case we just increment the ref_count and remove the FMR from + free_list/dirty_list. + + Before we remap an FMR from free_list, we remove it from the cache + (to prevent another user from obtaining a stale FMR). When an FMR + is released, we add it to the tail of the free list, so that our + cache eviction policy is "least recently used." + + All manipulation of ref_count, list and cache_node is protected by + pool_lock to maintain consistency. +*/ + +struct ib_fmr_pool { + spinlock_t pool_lock; + + int pool_size; + int max_pages; + int dirty_watermark; + int dirty_len; + struct list_head free_list; + struct list_head dirty_list; + struct hlist_head *cache_bucket; + + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + + struct task_struct *thread; + + atomic_t req_ser; + atomic_t flush_ser; + + wait_queue_head_t force_wait; +}; + +static inline u32 ib_fmr_hash(u64 first_page) +{ + return jhash_2words((u32) first_page, + (u32) (first_page >> 32), + 0); +} + +/* Caller must hold pool_lock */ +static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, + u64 *page_list, + int page_list_len, + u64 io_virtual_address) +{ + struct hlist_head *bucket; + struct ib_pool_fmr *fmr; + struct hlist_node *pos; + + if (!pool->cache_bucket) + return NULL; + + bucket = pool->cache_bucket + ib_fmr_hash(*page_list); + + hlist_for_each_entry(fmr, pos, bucket, cache_node) + if (io_virtual_address == fmr->io_virtual_address && + page_list_len == fmr->page_list_len && + !memcmp(page_list, fmr->page_list, + page_list_len * sizeof *page_list)) + return fmr; + + return NULL; +} + +static void ib_fmr_batch_release(struct ib_fmr_pool *pool) +{ + int ret; + struct ib_pool_fmr *fmr; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + + spin_lock_irq(&pool->pool_lock); + + list_for_each_entry(fmr, &pool->dirty_list, list) { + hlist_del_init(&fmr->cache_node); + fmr->remap_count = 0; + list_add_tail(&fmr->fmr->list, &fmr_list); + +#ifdef DEBUG + if (fmr->ref_count !=0) { + printk(KERN_WARNING "Unmapping FMR 0x%08x with ref count %d", + fmr, fmr->ref_count); + } +#endif + } + + list_splice(&pool->dirty_list, &unmap_list); + INIT_LIST_HEAD(&pool->dirty_list); + pool->dirty_len = 0; + + spin_unlock_irq(&pool->pool_lock); + + if (list_empty(&unmap_list)) { + return; + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "ib_unmap_fmr returned %d", ret); + + spin_lock_irq(&pool->pool_lock); + list_splice(&unmap_list, &pool->free_list); + spin_unlock_irq(&pool->pool_lock); +} + +static int ib_fmr_cleanup_thread(void *pool_ptr) +{ + struct ib_fmr_pool *pool = pool_ptr; + + do { + if (pool->dirty_len >= pool->dirty_watermark || + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) < 0) { + ib_fmr_batch_release(pool); + + atomic_inc(&pool->flush_ser); + wake_up_interruptible(&pool->force_wait); + + if (pool->flush_function) + pool->flush_function(pool, pool->flush_arg); + } + + set_current_state(TASK_INTERRUPTIBLE); + if (pool->dirty_len < pool->dirty_watermark && + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) >= 0 && + !kthread_should_stop()) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; +} + +int ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params, + struct ib_fmr_pool **pool_handle) +{ + struct ib_device *device; + struct ib_fmr_pool *pool; + int i; + int ret; + + if (!params) { + return -EINVAL; + } + + device = pd->device; + if (!device->alloc_fmr || + !device->dealloc_fmr || + !device->map_phys_fmr || + !device->unmap_fmr) { + printk(KERN_WARNING "Device %s does not support fast memory regions", + device->name); + return -ENOSYS; + } + + pool = kmalloc(sizeof *pool, GFP_KERNEL); + if (!pool) { + printk(KERN_WARNING "couldn't allocate pool struct"); + return -ENOMEM; + } + + pool->cache_bucket = NULL; + + pool->flush_function = params->flush_function; + pool->flush_arg = params->flush_arg; + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->dirty_list); + + if (params->cache) { + pool->cache_bucket = + kmalloc(IB_FMR_HASH_SIZE * sizeof *pool->cache_bucket, + GFP_KERNEL); + if (!pool->cache_bucket) { + printk(KERN_WARNING "Failed to allocate cache in pool"); + ret = -ENOMEM; + goto out_free_pool; + } + + for (i = 0; i < IB_FMR_HASH_SIZE; ++i) + INIT_HLIST_HEAD(pool->cache_bucket + i); + } + + pool->pool_size = 0; + pool->max_pages = params->max_pages_per_fmr; + pool->dirty_watermark = params->dirty_watermark; + pool->dirty_len = 0; + spin_lock_init(&pool->pool_lock); + atomic_set(&pool->req_ser, 0); + atomic_set(&pool->flush_ser, 0); + init_waitqueue_head(&pool->force_wait); + + pool->thread = kthread_create(ib_fmr_cleanup_thread, + pool, + "ib_fmr(%s)", + device->name); + if (IS_ERR(pool->thread)) { + printk(KERN_WARNING "couldn't start cleanup thread"); + ret = PTR_ERR(pool->thread); + goto out_free_pool; + } + + { + struct ib_pool_fmr *fmr; + struct ib_fmr_attr attr = { + .max_pages = params->max_pages_per_fmr, + .max_maps = IB_FMR_MAX_REMAPS, + .page_size = PAGE_SHIFT + }; + + for (i = 0; i < params->pool_size; ++i) { + fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), + GFP_KERNEL); + if (!fmr) { + printk(KERN_WARNING "failed to allocate fmr struct for FMR %d", i); + goto out_fail; + } + + fmr->pool = pool; + fmr->remap_count = 0; + fmr->ref_count = 0; + INIT_HLIST_NODE(&fmr->cache_node); + + fmr->fmr = ib_alloc_fmr(pd, params->access, &attr); + if (IS_ERR(fmr->fmr)) { + printk(KERN_WARNING "fmr_create failed for FMR %d", i); + kfree(fmr); + goto out_fail; + } + + list_add_tail(&fmr->list, &pool->free_list); + ++pool->pool_size; + } + } + + *pool_handle = pool; + return 0; + + out_free_pool: + kfree(pool->cache_bucket); + kfree(pool); + + return ret; + + out_fail: + ib_destroy_fmr_pool(pool); + *pool_handle = NULL; + + return -ENOMEM; +} +EXPORT_SYMBOL(ib_create_fmr_pool); + +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool) +{ + struct ib_pool_fmr *fmr; + struct ib_pool_fmr *tmp; + int i; + + kthread_stop(pool->thread); + ib_fmr_batch_release(pool); + + i = 0; + list_for_each_entry_safe(fmr, tmp, &pool->free_list, list) { + ib_dealloc_fmr(fmr->fmr); + list_del(&fmr->list); + kfree(fmr); + ++i; + } + + if (i < pool->pool_size) + printk(KERN_WARNING "pool still has %d regions registered", + pool->pool_size - i); + + kfree(pool->cache_bucket); + kfree(pool); + + return 0; +} +EXPORT_SYMBOL(ib_destroy_fmr_pool); + +int ib_flush_fmr_pool(struct ib_fmr_pool *pool) +{ + int serial; + + atomic_inc(&pool->req_ser); + /* It's OK if someone else bumps req_ser again here -- we'll + just wait a little longer. */ + serial = atomic_read(&pool->req_ser); + + wake_up_process(pool->thread); + + if (wait_event_interruptible(pool->force_wait, + atomic_read(&pool->flush_ser) - + atomic_read(&pool->req_ser) >= 0)) + return -EINTR; + + return 0; +} +EXPORT_SYMBOL(ib_flush_fmr_pool); + +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address) +{ + struct ib_fmr_pool *pool = pool_handle; + struct ib_pool_fmr *fmr; + unsigned long flags; + int result; + + if (list_len < 1 || list_len > pool->max_pages) + return ERR_PTR(-EINVAL); + + spin_lock_irqsave(&pool->pool_lock, flags); + fmr = ib_fmr_cache_lookup(pool, + page_list, + list_len, + *io_virtual_address); + if (fmr) { + /* found in cache */ + ++fmr->ref_count; + if (fmr->ref_count == 1) { + list_del(&fmr->list); + } + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return fmr; + } + + if (list_empty(&pool->free_list)) { + spin_unlock_irqrestore(&pool->pool_lock, flags); + return ERR_PTR(-EAGAIN); + } + + fmr = list_entry(pool->free_list.next, struct ib_pool_fmr, list); + list_del(&fmr->list); + hlist_del_init(&fmr->cache_node); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + result = ib_map_phys_fmr(fmr->fmr, page_list, list_len, + *io_virtual_address); + + if (result) { + spin_lock_irqsave(&pool->pool_lock, flags); + list_add(&fmr->list, &pool->free_list); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + printk(KERN_WARNING "fmr_map returns %d", + result); + + return ERR_PTR(result); + } + + ++fmr->remap_count; + fmr->ref_count = 1; + + if (pool->cache_bucket) { + fmr->io_virtual_address = *io_virtual_address; + fmr->page_list_len = list_len; + memcpy(fmr->page_list, page_list, list_len * sizeof(*page_list)); + + spin_lock_irqsave(&pool->pool_lock, flags); + hlist_add_head(&fmr->cache_node, + pool->cache_bucket + ib_fmr_hash(fmr->page_list[0])); + spin_unlock_irqrestore(&pool->pool_lock, flags); + } + + return fmr; +} +EXPORT_SYMBOL(ib_fmr_pool_map_phys); + +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr) +{ + struct ib_fmr_pool *pool; + unsigned long flags; + + pool = fmr->pool; + + spin_lock_irqsave(&pool->pool_lock, flags); + + --fmr->ref_count; + if (!fmr->ref_count) { + if (fmr->remap_count < IB_FMR_MAX_REMAPS) { + list_add_tail(&fmr->list, &pool->free_list); + } else { + list_add_tail(&fmr->list, &pool->dirty_list); + ++pool->dirty_len; + wake_up_process(pool->thread); + } + } + +#ifdef DEBUG + if (fmr->ref_count < 0) + printk(KERN_WARNING "FMR %p has ref count %d < 0", + fmr, fmr->ref_count); +#endif + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_fmr_pool_unmap); + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/packer.c 2004-11-23 08:10:16.560120578 -0800 @@ -0,0 +1,177 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: packer.c 1027 2004-10-20 03:59:00Z roland $ + */ + +#include + +static u64 value_read(int offset, int size, void *structure) +{ + switch (size) { + case 1: return *(u8 *) (structure + offset); + case 2: return be16_to_cpup((__be16 *) (structure + offset)); + case 4: return be32_to_cpup((__be32 *) (structure + offset)); + case 8: return be64_to_cpup((__be64 *) (structure + offset)); + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + return 0; + } +} + +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + __be32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be32(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be32 *) buf + desc[i].offset_words; + *addr = (*addr & ~mask) | (cpu_to_be32(val) & mask); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + __be64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be64(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be64 *) ((__be32 *) buf + desc[i].offset_words); + *addr = (*addr & ~mask) | (cpu_to_be64(val) & mask); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + if (desc[i].struct_size_bytes) + memcpy(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + structure + desc[i].struct_offset_bytes, + desc[i].size_bits / 8); + else + memset(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + 0, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_pack); + +static void value_write(int offset, int size, u64 val, void *structure) +{ + switch (size * 8) { + case 8: *( u8 *) (structure + offset) = val; break; + case 16: *(__be16 *) (structure + offset) = cpu_to_be16(val); break; + case 32: *(__be32 *) (structure + offset) = cpu_to_be32(val); break; + case 64: *(__be64 *) (structure + offset) = cpu_to_be64(val); break; + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + } +} + +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (!desc[i].struct_size_bytes) + continue; + + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + u32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be32 *) buf + desc[i].offset_words; + val = (be32_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + u64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be64 *) buf + desc[i].offset_words; + val = (be64_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + memcpy(structure + desc[i].struct_offset_bytes, + buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_unpack); + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sysfs.c 2004-11-23 08:10:16.690101412 -0800 @@ -0,0 +1,684 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: sysfs.c 1257 2004-11-17 23:12:18Z roland $ + */ + +#include "core_priv.h" + +#include + +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + struct attribute_group pkey_group; + struct attribute **pkey_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, + const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) + +#define PORT_ATTR_RO(_name) \ +struct port_attribute port_attr_##_name = __ATTR_RO(_name) + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +static ssize_t lid_mask_count_show(struct ib_port *p, + struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.lmc); +} + +static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.sm_lid); +} + +static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.sm_sl); +} + +static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%08x\n", attr.port_cap_flags); +} + +static PORT_ATTR_RO(state); +static PORT_ATTR_RO(lid); +static PORT_ATTR_RO(lid_mask_count); +static PORT_ATTR_RO(sm_lid); +static PORT_ATTR_RO(sm_sl); +static PORT_ATTR_RO(cap_mask); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + &port_attr_lid_mask_count.attr, + &port_attr_sm_lid.attr, + &port_attr_sm_sl.attr, + &port_attr_cap_mask.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + u16 pkey; + ssize_t ret; + + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (ret) + return ret; + + return sprintf(buf, "0x%04x\n", pkey); +} + +#define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ +struct port_table_attribute port_pma_attr_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL), \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0xffff; + int width = (tab_attr->index >> 16) & 0xff; + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + ssize_t ret; + + if (!p->ibdev->process_mad) + return sprintf(buf, "N/A (no PMA)\n"); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + if (!in_mad || !out_mad) { + ret = -ENOMEM; + goto out; + } + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + + in_mad->data[41] = p->port_num; /* PortSelect field */ + + if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff, + in_mad, out_mad) & + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { + ret = -EINVAL; + goto out; + } + + switch (width) { + case 4: + ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >> + (offset % 4)) & 0xf); + break; + case 8: + ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]); + break; + case 16: + ret = sprintf(buf, "%u\n", + be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8))); + break; + case 32: + ret = sprintf(buf, "%u\n", + be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8))); + break; + default: + ret = 0; + } + +out: + kfree(in_mad); + kfree(out_mad); + + return ret; +} + +static PORT_PMA_ATTR(symbol_error , 0, 16, 32); +static PORT_PMA_ATTR(link_error_recovery , 1, 8, 48); +static PORT_PMA_ATTR(link_downed , 2, 8, 56); +static PORT_PMA_ATTR(port_rcv_errors , 3, 16, 64); +static PORT_PMA_ATTR(port_rcv_remote_physical_errors, 4, 16, 80); +static PORT_PMA_ATTR(port_rcv_switch_relay_errors , 5, 16, 96); +static PORT_PMA_ATTR(port_xmit_discards , 6, 16, 112); +static PORT_PMA_ATTR(port_xmit_constraint_errors , 7, 8, 128); +static PORT_PMA_ATTR(port_rcv_constraint_errors , 8, 8, 136); +static PORT_PMA_ATTR(local_link_integrity_errors , 9, 4, 152); +static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10, 4, 156); +static PORT_PMA_ATTR(VL15_dropped , 11, 16, 176); +static PORT_PMA_ATTR(port_xmit_data , 12, 32, 192); +static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224); +static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); +static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); + +static struct attribute *pma_attrs[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_local_link_integrity_errors.attr.attr, + &port_pma_attr_excessive_buffer_overrun_errors.attr.attr, + &port_pma_attr_VL15_dropped.attr.attr, + &port_pma_attr_port_xmit_data.attr.attr, + &port_pma_attr_port_rcv_data.attr.attr, + &port_pma_attr_port_xmit_packets.attr.attr, + &port_pma_attr_port_rcv_packets.attr.attr, + NULL +}; + +static struct attribute_group pma_group = { + .name = "counters", + .attrs = pma_attrs +}; + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + for (i = 0; (a = p->pkey_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + +static void ib_device_release(struct class_device *cdev) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + kfree(dev); +} + +static int ib_device_hotplug(struct class_device *cdev, char **envp, + int num_envp, char *buf, int size) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + int i = 0, len = 0; + + if (add_hotplug_env_var(envp, num_envp, &i, buf, size, &len, + "NAME=%s", dev->name)) + return -ENOMEM; + + /* + * It might be nice to pass the node GUID to hotplug, but + * right now the only way to get it is to query the device + * provider, and this can crash during device removal because + * we are will be running after driver removal has started. + * We could add a node_guid field to struct ib_device, or we + * could just let the hotplug script read the node GUID from + * sysfs when devices are added. + */ + + envp[i] = NULL; + return 0; +} + +static int alloc_group(struct attribute ***attr, + ssize_t (*show)(struct ib_port *, + struct port_attribute *, char *buf), + int len) +{ + struct port_table_attribute ***tab_attr = + (struct port_table_attribute ***) attr; + int i; + int ret; + + *tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL); + if (!*tab_attr) + return -ENOMEM; + + memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr); + + for (i = 0; i < len; ++i) { + (*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL); + if (!(*tab_attr)[i]) { + ret = -ENOMEM; + goto err; + } + memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]); + (*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!(*tab_attr)[i]->attr.attr.name) { + ret = -ENOMEM; + goto err; + } + + if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err; + } + + (*tab_attr)[i]->attr.attr.mode = S_IRUGO; + (*tab_attr)[i]->attr.attr.owner = THIS_MODULE; + (*tab_attr)[i]->attr.show = show; + (*tab_attr)[i]->index = i; + } + + return 0; + +err: + for (i = 0; i < len; ++i) { + if ((*tab_attr)[i]) + kfree((*tab_attr)[i]->attr.attr.name); + kfree((*tab_attr)[i]); + } + + kfree(*tab_attr); + + return ret; +} + +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + ret = sysfs_create_group(&p->kobj, &pma_group); + if (ret) + goto err_put; + + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_remove_pma; + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free_gid; + + ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len); + if (ret) + goto err_remove_gid; + + p->pkey_group.name = "pkeys"; + p->pkey_group.attrs = p->pkey_attr; + + ret = sysfs_create_group(&p->kobj, &p->pkey_group); + if (ret) + goto err_free_pkey; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free_pkey: + for (i = 0; i < attr.pkey_tbl_len; ++i) { + kfree(p->pkey_attr[i]->name); + kfree(p->pkey_attr[i]); + } + + kfree(p->pkey_attr); + +err_remove_gid: + sysfs_remove_group(&p->kobj, &p->gid_group); + +err_free_gid: + for (i = 0; i < attr.gid_tbl_len; ++i) { + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_remove_pma: + sysfs_remove_group(&p->kobj, &pma_group); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[3])); +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); +static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_sys_image_guid, + &class_device_attr_node_guid +}; + +static struct class ib_class = { + .name = "infiniband", + .release = ib_device_release, + .hotplug = ib_device_hotplug, +}; + +int ib_device_register_sysfs(struct ib_device *device) +{ + struct class_device *class_dev = &device->class_dev; + int ret; + int i; + + class_dev->class = &ib_class; + class_dev->class_data = device; + strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); + + INIT_LIST_HEAD(&device->port_list); + + ret = class_device_register(class_dev); + if (ret) + goto err; + + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + int i; + + for (i = 1; i <= device->phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; +} + +void ib_device_unregister_sysfs(struct ib_device *device) +{ + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); + class_device_unregister(&device->class_dev); +} + +int ib_sysfs_setup(void) +{ + return class_register(&ib_class); +} + +void ib_sysfs_cleanup(void) +{ + class_unregister(&ib_class); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/ud_header.c 2004-11-23 08:10:16.600114681 -0800 @@ -0,0 +1,333 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ud_header.c 1027 2004-10-20 03:59:00Z roland $ + */ + +#include + +#include + +#define STRUCT_FIELD(header, field) \ + .struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field), \ + .struct_size_bytes = sizeof ((struct ib_unpacked_ ## header *) 0)->field, \ + .field_name = #header ":" #field + +static const struct ib_field lrh_table[] = { + { STRUCT_FIELD(lrh, virtual_lane), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, link_version), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, service_level), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 4 }, + { RESERVED, + .offset_words = 0, + .offset_bits = 12, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, link_next_header), + .offset_words = 0, + .offset_bits = 14, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, destination_lid), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 5 }, + { STRUCT_FIELD(lrh, packet_length), + .offset_words = 1, + .offset_bits = 5, + .size_bits = 11 }, + { STRUCT_FIELD(lrh, source_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 } +}; + +static const struct ib_field grh_table[] = { + { STRUCT_FIELD(grh, ip_version), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(grh, traffic_class), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 8 }, + { STRUCT_FIELD(grh, flow_label), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 20 }, + { STRUCT_FIELD(grh, payload_length), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(grh, next_header), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 8 }, + { STRUCT_FIELD(grh, hop_limit), + .offset_words = 1, + .offset_bits = 24, + .size_bits = 8 }, + { STRUCT_FIELD(grh, source_gid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { STRUCT_FIELD(grh, destination_gid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 } +}; + +static const struct ib_field bth_table[] = { + { STRUCT_FIELD(bth, opcode), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, solicited_event), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 1 }, + { STRUCT_FIELD(bth, mig_req), + .offset_words = 0, + .offset_bits = 9, + .size_bits = 1 }, + { STRUCT_FIELD(bth, pad_count), + .offset_words = 0, + .offset_bits = 10, + .size_bits = 2 }, + { STRUCT_FIELD(bth, transport_header_version), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 4 }, + { STRUCT_FIELD(bth, pkey), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, destination_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 }, + { STRUCT_FIELD(bth, ack_req), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 2, + .offset_bits = 1, + .size_bits = 7 }, + { STRUCT_FIELD(bth, psn), + .offset_words = 2, + .offset_bits = 8, + .size_bits = 24 } +}; + +static const struct ib_field deth_table[] = { + { STRUCT_FIELD(deth, qkey), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(deth, source_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 } +}; + +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) { + header_len += IB_GRH_BYTES; + } + + header->lrh.link_version = 0; + header->lrh.link_next_header = + grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL; + header->lrh.packet_length = (IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) / 4; /* round up */ + + header->grh_present = grh_present; + if (grh_present) { + header->lrh.packet_length += IB_GRH_BYTES / 4; + + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + cpu_to_be16s(&header->lrh.packet_length); + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_ud_header_init); + +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(lrh_table, ARRAY_SIZE(lrh_table), + &header->lrh, buf); + len += IB_LRH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(ib_ud_header_pack); + +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header) +{ + ib_unpack(lrh_table, ARRAY_SIZE(lrh_table), + buf, &header->lrh); + buf += IB_LRH_BYTES; + + if (header->lrh.link_version != 0) { + printk(KERN_WARNING "Invalid LRH.link_version %d\n", + header->lrh.link_version); + return -EINVAL; + } + + switch (header->lrh.link_next_header) { + case IB_LNH_IBA_LOCAL: + header->grh_present = 0; + break; + + case IB_LNH_IBA_GLOBAL: + header->grh_present = 1; + ib_unpack(grh_table, ARRAY_SIZE(grh_table), + buf, &header->grh); + buf += IB_GRH_BYTES; + + if (header->grh.ip_version != 6) { + printk(KERN_WARNING "Invalid GRH.ip_version %d\n", + header->grh.ip_version); + return -EINVAL; + } + if (header->grh.next_header != 0x1b) { + printk(KERN_WARNING "Invalid GRH.next_header 0x%02x\n", + header->grh.next_header); + return -EINVAL; + } + break; + + default: + printk(KERN_WARNING "Invalid LRH.link_next_header %d\n", + header->lrh.link_next_header); + return -EINVAL; + } + + ib_unpack(bth_table, ARRAY_SIZE(bth_table), + buf, &header->bth); + buf += IB_BTH_BYTES; + + switch (header->bth.opcode) { + case IB_OPCODE_UD_SEND_ONLY: + header->immediate_present = 0; + break; + case IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE: + header->immediate_present = 1; + break; + default: + printk(KERN_WARNING "Invalid BTH.opcode 0x%02x\n", + header->bth.opcode); + return -EINVAL; + } + + if (header->bth.transport_header_version != 0) { + printk(KERN_WARNING "Invalid BTH.transport_header_version %d\n", + header->bth.transport_header_version); + return -EINVAL; + } + + ib_unpack(deth_table, ARRAY_SIZE(deth_table), + buf, &header->deth); + buf += IB_DETH_BYTES; + + if (header->immediate_present) + memcpy(&header->immediate_data, buf, sizeof header->immediate_data); + + return 0; +} +EXPORT_SYMBOL(ib_ud_header_unpack); + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/verbs.c 2004-11-23 08:10:16.644108194 -0800 @@ -0,0 +1,420 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include +#include + +#include + +/* Protection domains */ + +struct ib_pd *ib_alloc_pd(struct ib_device *device) +{ + struct ib_pd *pd; + + pd = device->alloc_pd(device); + + if (!IS_ERR(pd)) { + pd->device = device; + atomic_set(&pd->usecnt, 0); + } + + return pd; +} +EXPORT_SYMBOL(ib_alloc_pd); + +int ib_dealloc_pd(struct ib_pd *pd) +{ + if (atomic_read(&pd->usecnt)) + return -EBUSY; + + return pd->device->dealloc_pd(pd); +} +EXPORT_SYMBOL(ib_dealloc_pd); + +/* Address handles */ + +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct ib_ah *ah; + + ah = pd->device->create_ah(pd, ah_attr); + + if (!IS_ERR(ah)) { + ah->device = pd->device; + ah->pd = pd; + atomic_inc(&pd->usecnt); + } + + return ah; +} +EXPORT_SYMBOL(ib_create_ah); + +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->modify_ah ? + ah->device->modify_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_ah); + +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->query_ah ? + ah->device->query_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_ah); + +int ib_destroy_ah(struct ib_ah *ah) +{ + struct ib_pd *pd; + int ret; + + pd = ah->pd; + ret = ah->device->destroy_ah(ah); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_destroy_ah); + +/* Queue pairs */ + +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct ib_qp *qp; + + qp = pd->device->create_qp(pd, qp_init_attr); + + if (!IS_ERR(qp)) { + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; + atomic_inc(&pd->usecnt); + atomic_inc(&qp_init_attr->send_cq->usecnt); + atomic_inc(&qp_init_attr->recv_cq->usecnt); + if (qp_init_attr->srq) + atomic_inc(&qp_init_attr->srq->usecnt); + } + + return qp; +} +EXPORT_SYMBOL(ib_create_qp); + +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask) +{ + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); +} +EXPORT_SYMBOL(ib_modify_qp); + +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + return qp->device->query_qp ? + qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_qp); + +int ib_destroy_qp(struct ib_qp *qp) +{ + struct ib_pd *pd; + struct ib_cq *scq, *rcq; + struct ib_srq *srq; + int ret; + + pd = qp->pd; + scq = qp->send_cq; + rcq = qp->recv_cq; + srq = qp->srq; + + ret = qp->device->destroy_qp(qp); + if (!ret) { + atomic_dec(&pd->usecnt); + atomic_dec(&scq->usecnt); + atomic_dec(&rcq->usecnt); + if (srq) + atomic_dec(&srq->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_destroy_qp); + +/* Completion queues */ + +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe) +{ + struct ib_cq *cq; + + cq = device->create_cq(device, cqe); + + if (!IS_ERR(cq)) { + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->cq_context = cq_context; + atomic_set(&cq->usecnt, 0); + } + + return cq; +} +EXPORT_SYMBOL(ib_create_cq); + +int ib_destroy_cq(struct ib_cq *cq) +{ + if (atomic_read(&cq->usecnt)) + return -EBUSY; + + return cq->device->destroy_cq(cq); +} +EXPORT_SYMBOL(ib_destroy_cq); + +int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + int ret; + + if (!cq->device->resize_cq) + return -ENOSYS; + + ret = cq->device->resize_cq(cq, &cqe); + if (!ret) + cq->cqe = cqe; + + return ret; +} +EXPORT_SYMBOL(ib_resize_cq); + +/* Memory regions */ + +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *mr; + + mr = pd->device->get_dma_mr(pd, mr_access_flags); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_get_dma_mr); + +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *mr; + + mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_reg_phys_mr); + +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_pd *old_pd; + int ret; + + if (!mr->device->rereg_phys_mr) + return -ENOSYS; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + old_pd = mr->pd; + + ret = mr->device->rereg_phys_mr(mr, mr_rereg_mask, pd, + phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!ret && (mr_rereg_mask & IB_MR_REREG_PD)) { + atomic_dec(&old_pd->usecnt); + atomic_inc(&pd->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_rereg_phys_mr); + +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + return mr->device->query_mr ? + mr->device->query_mr(mr, mr_attr) : -ENOSYS; +} +EXPORT_SYMBOL(ib_query_mr); + +int ib_dereg_mr(struct ib_mr *mr) +{ + struct ib_pd *pd; + int ret; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + pd = mr->pd; + ret = mr->device->dereg_mr(mr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dereg_mr); + +/* Memory windows */ + +struct ib_mw *ib_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *mw; + + if (!pd->device->alloc_mw) + return ERR_PTR(-ENOSYS); + + mw = pd->device->alloc_mw(pd); + if (!IS_ERR(mw)) { + mw->device = pd->device; + mw->pd = pd; + atomic_inc(&pd->usecnt); + } + + return mw; +} +EXPORT_SYMBOL(ib_alloc_mw); + +int ib_dealloc_mw(struct ib_mw *mw) +{ + struct ib_pd *pd; + int ret; + + pd = mw->pd; + ret = mw->device->dealloc_mw(mw); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_mw); + +/* "Fast" memory regions */ + +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *fmr; + + if (!pd->device->alloc_fmr) + return ERR_PTR(-ENOSYS); + + fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr); + if (!IS_ERR(fmr)) { + fmr->device = pd->device; + fmr->pd = pd; + atomic_inc(&pd->usecnt); + } + + return fmr; +} +EXPORT_SYMBOL(ib_alloc_fmr); + +int ib_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + + if (list_empty(fmr_list)) + return 0; + + fmr = list_entry(fmr_list->next, struct ib_fmr, list); + return fmr->device->unmap_fmr(fmr_list); +} +EXPORT_SYMBOL(ib_unmap_fmr); + +int ib_dealloc_fmr(struct ib_fmr *fmr) +{ + struct ib_pd *pd; + int ret; + + pd = fmr->pd; + ret = fmr->device->dealloc_fmr(fmr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_fmr); + +/* Multicast groups */ + +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->attach_mcast ? + qp->device->attach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_attach_mcast); + +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->detach_mcast ? + qp->device->detach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_detach_mcast); From roland at topspin.com Tue Nov 23 08:14:26 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:26 -0800 Subject: [openib-general] [PATCH][RFC/v2][3/21] Hook up drivers/infiniband In-Reply-To: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> Message-ID: <20041123814.LeHMD5hRZLn6VbLm@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/Kconfig 2004-11-23 08:09:54.858320443 -0800 +++ linux-bk/drivers/Kconfig 2004-11-23 08:10:17.410995118 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu --- linux-bk.orig/drivers/Makefile 2004-11-23 08:10:06.504603238 -0800 +++ linux-bk/drivers/Makefile 2004-11-23 08:10:17.411994971 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Tue Nov 23 08:14:31 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:31 -0800 Subject: [openib-general] [PATCH][RFC/v2][4/21] Add InfiniBand MAD (management datagram) support (public headers) In-Reply-To: <20041123814.LeHMD5hRZLn6VbLm@topspin.com> Message-ID: <20041123814.xOcI2C4YpT1G9jQi@topspin.com> Add public headers for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_mad.h 2004-11-23 08:10:17.682955018 -0800 @@ -0,0 +1,334 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_MAD_H ) +#define IB_MAD_H + +#include + +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + + +#define IB_MGMT_MAX_METHODS 128 + +#define IB_QP0 0 +#define IB_QP1 cpu_to_be32(1) +#define IB_QP1_QKEY 0x80010000 + +struct ib_grh { + u32 version_tclass_flow; + u16 paylen; + u8 next_hdr; + u8 hop_limit; + union ib_gid sgid; + union ib_gid dgid; +} __attribute__ ((packed)); + +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +} __attribute__ ((packed)); + +struct ib_rmpp_hdr { + u8 rmpp_version; + u8 rmpp_type; + u8 rmpp_rtime_flags; + u8 rmpp_status; + u32 seg_num; + u32 paylen_newwin; +} __attribute__ ((packed)); + +struct ib_mad { + struct ib_mad_hdr mad_hdr; + u8 data[232]; +} __attribute__ ((packed)); + +struct ib_rmpp_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 data[220]; +} __attribute__ ((packed)); + +struct ib_mad_agent; +struct ib_mad_send_wc; +struct ib_mad_recv_wc; + +/** + * ib_mad_send_handler - callback handler for a sent MAD. + * @mad_agent - MAD agent that sent the MAD. + * @mad_send_wc - Send work completion information on the sent MAD. + */ +typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_recv_handler - callback handler for a received MAD. + * @mad_agent - MAD agent requesting the received MAD. + * @mad_recv_wc - Received work completion information on the received MAD. + * + * MADs received in response to a send request operation will be handed to + * the user after the send operation completes. All data buffers given + * to the user through this routine are owned by the receiving client. + */ +typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_mad_agent - Used to track MAD registration with the access layer. + * @device - Reference to device registration is on. + * @qp - Reference to QP used for sending and receiving MADs. + * @recv_handler - Callback handler for a received MAD. + * @send_handler - Callback handler for a sent MAD. + * @context - User-specified context associated with this registration. + * @hi_tid - Access layer assigned transaction ID for this client. + * Unsolicited MADs sent by this client will have the upper 32-bits + * of their TID set to this value. + * @port_num - Port number on which QP is registered + */ +struct ib_mad_agent { + struct ib_device *device; + struct ib_qp *qp; + ib_mad_recv_handler recv_handler; + ib_mad_send_handler send_handler; + void *context; + u32 hi_tid; + u8 port_num; +}; + +/** + * ib_mad_send_wc - MAD send completion information. + * @wr_id - Work request identifier associated with the send MAD request. + * @status - Completion status. + * @vendor_err - Optional vendor error information returned with a failed + * request. + */ +struct ib_mad_send_wc { + u64 wr_id; + enum ib_wc_status status; + u32 vendor_err; +}; + +/** + * ib_mad_recv_buf - received MAD buffer information. + * @list - Reference to next data buffer for a received RMPP MAD. + * @grh - References a data buffer containing the global route header. + * The data refereced by this buffer is only valid if the GRH is + * valid. + * @mad - References the start of the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc - Completion information for the received data. + * @recv_buf - Specifies the location of the received data buffer(s). + * @mad_len - The length of the received MAD, without duplicated headers. + * + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. + */ +struct ib_mad_recv_wc { + struct ib_wc *wc; + struct ib_mad_recv_buf *recv_buf; + int mad_len; +}; + +/** + * ib_mad_reg_req - MAD registration request + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + */ +struct ib_mad_reg_req { + u8 mgmt_class; + u8 mgmt_class_version; + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); +}; + +/** + * ib_register_mad_agent - Register to send/receive MADs. + * @device - The device to register with. + * @port_num - The port on the specified device to use. + * @qp_type - Specifies which QP to access. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_reg_req - Specifies which unsolicited MADs should be received + * by the caller. This parameter may be NULL if the caller only + * wishes to receive solicited responses. + * @rmpp_version - If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler - The completion callback routine invoked after a send + * request has completed. + * @recv_handler - The completion callback routine invoked for a received + * MAD. + * @context - User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_unregister_mad_agent - Unregisters a client from using MAD services. + * @mad_agent - Corresponding MAD registration request to deregister. + * + * After invoking this routine, MAD services are no longer usable by the + * client on the associated QP. + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); + +/** + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client. + * @mad_agent - Specifies the associated registration to post the send to. + * @send_wr - Specifies the information needed to send the MAD(s). + * @bad_send_wr - Specifies the MAD on which an error was encountered. + * + * Sent MADs are not guaranteed to complete in the order that they were posted. + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +/** + * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. + * @mad_recv_wc - Work completion information for a received MAD. + * @buf - User-provided data buffer to receive the coalesced buffers. The + * referenced buffer should be at least the size of the mad_len specified + * by @mad_recv_wc. + * + * This call copies a chain of received RMPP MADs into a single data buffer, + * removing duplicated headers. + */ +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf); + +/** + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + * @mad_recv_wc - Work completion information for a received MAD. + * + * Clients receiving MADs through their ib_mad_recv_handler must call this + * routine to return the work completion buffers to the access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent - Specifies the registration associated with sent MAD. + * @wr_id - Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** + * ib_redirect_mad_qp - Registers a QP for MAD services. + * @qp - Reference to a QP that requires MAD services. + * @rmpp_version - If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler - The completion callback routine invoked after a send + * request has completed. + * @recv_handler - The completion callback routine invoked for a received + * MAD. + * @context - User specified context associated with the registration. + * + * Use of this call allows clients to use MAD services, such as RMPP, + * on user-owned QPs. After calling this routine, users may send + * MADs on the specified QP by calling ib_mad_post_send. + */ +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_process_mad_wc - Processes a work completion associated with a + * MAD sent or received on a redirected QP. + * @mad_agent - Specifies the registered MAD service using the redirected QP. + * @wc - References a work completion associated with a sent or received + * MAD segment. + * + * This routine is used to complete or continue processing on a MAD request. + * If the work completion is associated with a send operation, calling + * this routine is required to continue an RMPP transfer or to wait for a + * corresponding response, if it is a request. If the work completion is + * associated with a receive operation, calling this routine is required to + * process an inbound or outbound RMPP transfer, or to match a response MAD + * with its corresponding request. + */ +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc); + +#endif /* IB_MAD_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_smi.h 2004-11-23 08:10:17.722949121 -0800 @@ -0,0 +1,67 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_SMI_H ) +#define IB_SMI_H + +#include + +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION cpu_to_be16(0x8000) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + +#endif /* IB_SMI_H */ From roland at topspin.com Tue Nov 23 08:14:40 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:40 -0800 Subject: [openib-general] [PATCH][RFC/v2][5/21] Add InfiniBand MAD (management datagram) support In-Reply-To: <20041123814.xOcI2C4YpT1G9jQi@topspin.com> Message-ID: <20041123814.sBoIUxeLIDc9lo4V@topspin.com> Add support for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. This is required for an SM (subnet manager) to discover and configure the host, since the SM's query MADs must be passed to the local SMA (subnet management agent). In addition, this support is used by upper level protocols to send queries to and receive responses from the SM. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-11-23 08:10:16.496130013 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-11-23 08:10:17.978911380 -0800 @@ -1,7 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include obj-$(CONFIG_INFINIBAND) += \ - ib_core.o + ib_core.o \ + ib_mad.o ib_core-objs := \ packer.o \ @@ -11,3 +12,8 @@ device.o \ fmr_pool.o \ cache.o + +ib_mad-objs := \ + mad.o \ + smi.o \ + agent.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.c 2004-11-23 08:10:18.065898554 -0800 @@ -0,0 +1,390 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + +#include + +#include + +#include "smi.h" +#include "agent_priv.h" +#include "mad_priv.h" + + +spinlock_t ib_agent_port_list_lock; +static LIST_HEAD(ib_agent_port_list); + +extern kmem_cache_t *ib_mad_cache; + + +/* + * Caller must hold ib_agent_port_list_lock + */ +static inline struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->dr_smp_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if ((entry->dr_smp_agent == mad_agent) || + (entry->lr_smp_agent == mad_agent) || + (entry->perf_mgmt_agent == mad_agent)) + return entry; + } + } + return NULL; +} + +static inline struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + entry = __ib_get_agent_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return entry; +} + +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + + if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + return 1; + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " + "not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_agent_port_private *port_priv, + struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_agent_send_wr *agent_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); + if (!agent_send_wr) + goto out; + agent_send_wr->mad = mad; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad->mad, + sizeof(mad->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad->mad); + gather_list.lkey = (*port_priv->mr).lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(struct ib_grh)); + } else { + ah_attr.ah_flags = 0; /* No GRH for SM class */ + } + } else { + /* Directed route or LID routed SM class */ + ah_attr.ah_flags = 0; /* No GRH */ + } + + agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(agent_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(agent_send_wr); + goto out; + } + + send_wr.wr.ud.ah = agent_send_wr->ah; + if (mad->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + } else { + send_wr.wr.ud.pkey_index = 0; /* Should only matter for GMPs */ + send_wr.wr.ud.remote_qkey = 0; /* for SMPs */ + } + send_wr.wr.ud.mad_hdr = &mad->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)agent_send_wr; + + pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(mad->mad), + DMA_TO_DEVICE); + ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); + } else { + list_add_tail(&agent_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); + return 1; + } + + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad.mad.mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; + } + + return agent_mad_send(mad_agent, port_priv, mad, grh, wc); +} + +static void agent_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_agent_port_private *port_priv; + struct ib_agent_send_wr *agent_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_agent_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(agent_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(agent_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); + kfree(agent_send_wr); +} + +int ib_agent_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_reg_req reg_req; + unsigned long flags; + + /* First, check if port already open for SMI */ + port_priv = ib_get_agent_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + /* Obtain MAD agent for directed route SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; + reg_req.mgmt_class_version = 1; + + port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + + if (IS_ERR(port_priv->dr_smp_agent)) { + ret = PTR_ERR(port_priv->dr_smp_agent); + goto error2; + } + + /* Obtain MAD agent for LID routed SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->lr_smp_agent)) { + ret = PTR_ERR(port_priv->lr_smp_agent); + goto error3; + } + + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->perf_mgmt_agent)) { + ret = PTR_ERR(port_priv->perf_mgmt_agent); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_agent_port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return 0; + +error5: + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); +error4: + ib_unregister_mad_agent(port_priv->lr_smp_agent); +error3: + ib_unregister_mad_agent(port_priv->dr_smp_agent); +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_agent_port_close(struct ib_device *device, int port_num) +{ + struct ib_agent_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + port_priv = __ib_get_agent_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + ib_dereg_mr(port_priv->mr); + + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); + ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->dr_smp_agent); + kfree(port_priv); + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.h 2004-11-23 08:10:18.154885433 -0800 @@ -0,0 +1,42 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __AGENT_H_ +#define __AGENT_H_ + +extern spinlock_t ib_agent_port_list_lock; + +extern int ib_agent_port_open(struct ib_device *device, + int port_num); + +extern int ib_agent_port_close(struct ib_device *device, int port_num); + +extern int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + +#endif /* __AGENT_H_ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent_priv.h 2004-11-23 08:10:18.178881895 -0800 @@ -0,0 +1,51 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __IB_AGENT_PRIV_H__ +#define __IB_AGENT_PRIV_H__ + +#include + +#define SPFX "ib_agent: " + +struct ib_agent_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_agent_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *dr_smp_agent; /* DR SM class */ + struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ + struct ib_mr *mr; +}; + +#endif /* __IB_AGENT_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad.c 2004-11-23 08:10:18.021905041 -0800 @@ -0,0 +1,2109 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * Maintained by: vtrmaint1 at voltaire.com + * + * This program is intended for the purpose of Infiniband + * protocol stack for Linux Servers. + * + * This software program is free software and you are free to modifyi + * and/or redistribute it under a choice of one of the following two + * licenses: + * + * 1) under either the GNU General Public License (GPL) Version 2, June 1991, + * a copy of which is in the file LICENSE_GPL_V2.txt in the root directory. + * This GPL license is also available from the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the + * web at http://www.fsf.org/copyleft/gpl.html + * + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, on the web at + * http://www.opensource.org/licenses/bsd-license.php. + * + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * + * + * To obtain a copy of these licenses, the source code to this software or + * for other questions, you may write to Voltaire, Inc., + * Attention: Voltaire openSource maintainer, + * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or + * by Email: vtrmaint1 at voltaire.com + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#include +#include + +#include + +#include "mad_priv.h" +#include "smi.h" +#include "agent.h" + + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("kernel IB MAD API"); +MODULE_AUTHOR("Hal Rosenstock"); +MODULE_AUTHOR("Sean Hefty"); + + +kmem_cache_t *ib_mad_cache; +static struct list_head ib_mad_port_list; +static u32 ib_mad_client_id = 0; + +/* Port list lock */ +static spinlock_t ib_mad_port_list_lock; + + +/* Forward declarations */ +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req); +static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *priv); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); +static void timeout_sends(void *data); +static int solicited_mad(struct ib_mad *mad); + +/* + * Returns a ib_mad_port_private structure or NULL for a device/port + * Assumes ib_mad_port_list_lock is being held + */ +static inline struct ib_mad_port_private * +__ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + + list_for_each_entry(entry, &ib_mad_port_list, port_list) { + if (entry->device == device && entry->port_num == port_num) + return entry; + } + return NULL; +} + +/* + * Wrapper function to return a ib_mad_port_private structure or NULL + * for a device/port + */ +static inline struct ib_mad_port_private * +ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + entry = __ib_get_mad_port(device, port_num); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + return entry; +} + +static inline u8 convert_mgmt_class(u8 mgmt_class) +{ + /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ + return mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? + 0 : mgmt_class; +} + +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + +/* + * ib_register_mad_agent - Register to send/receive MADs + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_reg_req *reg_req = NULL; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + int ret2, qpn; + unsigned long flags; + u8 mgmt_class; + + /* Validate parameters */ + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + + if (rmpp_version) { + ret = ERR_PTR(-EINVAL); /* XXX: until RMPP implemented */ + goto error1; + } + + /* Validate MAD registration request if supplied */ + if (mad_reg_req) { + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + if (!recv_handler) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { + /* + * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only + * one in this range currently allowed + */ + if (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + } else if (mad_reg_req->mgmt_class == 0) { + /* + * Class 0 is reserved in IBA and is used for + * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE + */ + ret = ERR_PTR(-EINVAL); + goto error1; + } + } else { + /* No registration request supplied */ + if (!send_handler) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + } + + /* Validate device and port */ + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + + /* Allocate structures */ + mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + if (!mad_agent_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + if (mad_reg_req) { + reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); + if (!reg_req) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } + /* Make a copy of the MAD registration request */ + memcpy(reg_req, mad_reg_req, sizeof *reg_req); + } + + /* Now, fill in the various structures */ + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; + mad_agent_priv->reg_req = reg_req; + mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.device = device; + mad_agent_priv->agent.recv_handler = recv_handler; + mad_agent_priv->agent.send_handler = send_handler; + mad_agent_priv->agent.context = context; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_agent_priv->agent.port_num = port_num; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; + + /* + * Make sure MAD registration (if supplied) + * is non overlapping with any existing ones + */ + if (mad_reg_req) { + class = port_priv->version[mad_reg_req->mgmt_class_version]; + if (class) { + mgmt_class = convert_mgmt_class( + mad_reg_req->mgmt_class); + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, mad_reg_req)) { + ret = ERR_PTR(-EINVAL); + goto error3; + } + } + } + } + + ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; + } + + /* Add mad agent into port's agent list */ + list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + spin_lock_init(&mad_agent_priv->lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); + INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_WORK(&mad_agent_priv->work, timeout_sends, mad_agent_priv); + atomic_set(&mad_agent_priv->refcount, 1); + init_waitqueue_head(&mad_agent_priv->wait); + + return &mad_agent_priv->agent; + +error3: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + kfree(reg_req); +error2: + kfree(mad_agent_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_agent); + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_port_private *port_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + + /* Note that we could still be handling received MADs */ + + /* + * Canceling all sends results in dropping received response + * MADs, preventing us from queuing additional work + */ + cancel_mads(mad_agent_priv); + + port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->work); + flush_workqueue(port_priv->wq); + + spin_lock_irqsave(&port_priv->reg_lock, flags); + remove_mad_reg_req(mad_agent_priv); + list_del(&mad_agent_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + /* XXX: Cleanup pending RMPP receives for this agent */ + + atomic_dec(&mad_agent_priv->refcount); + wait_event(mad_agent_priv->wait, + !atomic_read(&mad_agent_priv->refcount)); + + if (mad_agent_priv->reg_req) + kfree(mad_agent_priv->reg_req); + kfree(mad_agent_priv); + return 0; +} +EXPORT_SYMBOL(ib_unregister_mad_agent); + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +/* + * Return 0 if SMP is to be sent + * Return 1 if SMP was consumed locally (whether or not solicited) + * Return < 0 if error + */ +static int handle_outgoing_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp, + struct ib_send_wr *send_wr) +{ + int ret; + + if (!smi_handle_dr_smp_send(smp, + mad_agent->device->node_type, + mad_agent->port_num)) { + ret = -EINVAL; + printk(KERN_ERR PFX "Invalid directed route\n"); + goto error1; + } + if (smi_check_local_dr_smp(smp, + mad_agent->device, + mad_agent->port_num)) { + struct ib_mad_private *mad_priv; + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wc mad_send_wc; + + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local " + "response MAD\n"); + goto error1; + } + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + + if (mad_agent->device->process_mad) { + ret = mad_agent->device->process_mad( + mad_agent->device, + 0, + mad_agent->port_num, + smp->dr_slid, /* ? */ + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) { + ret = 1; + goto error1; + } + if (ret & IB_MAD_RESULT_REPLY) { + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) { + struct ib_wc wc; + + /* + * Defined behavior is to + * complete response before + * request + */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = 0; /* IB_QPT_SMI ? */ + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); + mad_priv->header.recv_buf.grh = NULL; + mad_priv->header.recv_buf.mad = + &mad_priv->mad.mad; + mad_priv->header.recv_wc.recv_buf = + &mad_priv->header.recv_buf; + mad_agent_priv->agent.recv_handler( + mad_agent, + &mad_priv->header.recv_wc); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + } + + if (mad_agent_priv->agent.send_handler) { + /* Now, complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = send_wr->wr_id; + mad_agent_priv->agent.send_handler( + mad_agent, + &mad_send_wc); + ret = 1; + } else + ret = -EINVAL; + } else + ret = 0; + +error1: + return ret; +} + +static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_mad_qp_info *qp_info; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; + mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; + mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; + mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->send_queue.list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, + &mad_send_wr->send_wr, &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + dequeue_mad(&mad_send_wr->mad_list); + } + } else { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->overflow_list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = 0; + } + return ret; +} + +/* + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + int ret = -EINVAL; + struct ib_mad_agent_private *mad_agent_priv; + + /* Validate supplied parameters */ + if (!bad_send_wr) + goto error1; + + if (!mad_agent || !send_wr) + goto error2; + + if (!mad_agent->send_handler) + goto error2; + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + + /* Walk list of send WRs and post each on send list */ + while (send_wr) { + unsigned long flags; + struct ib_send_wr *next_send_wr; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_smp *smp; + + /* Validate more parameters */ + if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) + goto error2; + + if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) + goto error2; + + if (!send_wr->wr.ud.mad_hdr) { + printk(KERN_ERR PFX "MAD header must be supplied " + "in WR %p\n", send_wr); + goto error2; + } + + /* + * Save pointer to next work request to post in case the + * current one completes, and the user modifies the work + * request associated with the completion + */ + next_send_wr = (struct ib_send_wr *)send_wr->next; + + smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_smp(mad_agent, smp, send_wr); + if (ret < 0) /* error */ + goto error2; + else if (ret == 1) /* locally consumed */ + goto next; + } + + /* Allocate MAD send WR tracking structure */ + mad_send_wr = kmalloc(sizeof *mad_send_wr, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_send_wr) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_send_wr_private\n"); + ret = -ENOMEM; + goto error2; + } + + mad_send_wr->send_wr = *send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + memcpy(mad_send_wr->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + mad_send_wr->send_wr.next = NULL; + mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; + mad_send_wr->agent = mad_agent; + /* Timeout will be updated after send completes */ + mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. + ud.timeout_ms); + mad_send_wr->retry = 0; + /* One reference for each work request to QP + response */ + mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); + mad_send_wr->status = IB_WC_SUCCESS; + + /* Reference MAD agent until send completes */ + atomic_inc(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&mad_send_wr->agent_list, + &mad_agent_priv->send_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + ret = ib_send_mad(mad_agent_priv, mad_send_wr); + if (ret) { + /* Fail send request */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + goto error2; + } +next: + send_wr = next_send_wr; + } + return 0; + +error2: + *bad_send_wr = send_wr; +error1: + return ret; +} +EXPORT_SYMBOL(ib_post_send_mad); + +/* + * ib_free_recv_mad - Returns data buffers used to receive + * a MAD to the access layer + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_recv_buf *entry; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *priv; + + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, header); + + /* + * Walk receive buffer list associated with this WC + * No need to remove them from list of receive buffers + */ + list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { + /* Free previous receive buffer */ + kmem_cache_free(ib_mad_cache, priv); + mad_priv_hdr = container_of(entry, struct ib_mad_private_header, + recv_buf); + priv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + } + + /* Free last buffer */ + kmem_cache_free(ib_mad_cache, priv); +} +EXPORT_SYMBOL(ib_free_recv_mad); + +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf) +{ + printk(KERN_ERR PFX "ib_coalesce_recv_mad() not implemented yet\n"); +} +EXPORT_SYMBOL(ib_coalesce_recv_mad); + +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + return ERR_PTR(-EINVAL); /* XXX: for now */ +} +EXPORT_SYMBOL(ib_redirect_mad_qp); + +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc) +{ + printk(KERN_ERR PFX "ib_process_mad_wc() not implemented yet\n"); + return 0; +} +EXPORT_SYMBOL(ib_process_mad_wc); + +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req) +{ + int i; + + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + if ((*method)->agent[i]) { + printk(KERN_ERR PFX "Method %d already in use\n", i); + return -EINVAL; + } + } + return 0; +} + +static int allocate_method_table(struct ib_mad_mgmt_method_table **method) +{ + /* Allocate management method table */ + *method = kmalloc(sizeof **method, GFP_ATOMIC); + if (!*method) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_method_table\n"); + return -ENOMEM; + } + /* Clear management method table */ + memset(*method, 0, sizeof **method); + + return 0; +} + +/* + * Check to see if there are any methods still in use + */ +static int check_method_table(struct ib_mad_mgmt_method_table *method) +{ + int i; + + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) + if (method->agent[i]) + return 1; + return 0; +} + +/* + * Check to see if there are any method tables for this class still in use + */ +static int check_class_table(struct ib_mad_mgmt_class_table *class) +{ + int i; + + for (i = 0; i < MAX_MGMT_CLASS; i++) + if (class->method_table[i]) + return 1; + return 0; +} + +static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, + struct ib_mad_agent_private *agent) +{ + int i; + + /* Remove any methods for this mad agent */ + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { + if (method->agent[i] == agent) { + method->agent[i] = NULL; + } + } +} + +static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *priv) +{ + struct ib_mad_port_private *private; + struct ib_mad_mgmt_class_table **class; + struct ib_mad_mgmt_method_table **method; + + int i, ret; + u8 mgmt_class; + + /* Make sure MAD registration request supplied */ + if (!mad_reg_req) + return 0; + + private = priv->qp_info->port_priv; + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + class = &private->version[mad_reg_req->mgmt_class_version]; + if (!*class) { + /* Allocate management class table for "new" class version */ + *class = kmalloc(sizeof **class, GFP_ATOMIC); + if (!*class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_class_table\n"); + ret = -ENOMEM; + goto error1; + } + /* Clear management class table for this class version */ + memset((*class)->method_table, 0, + sizeof((*class)->method_table)); + /* Allocate method table for this management class */ + method = &(*class)->method_table[mgmt_class]; + if ((ret = allocate_method_table(method))) + goto error2; + } else { + method = &(*class)->method_table[mgmt_class]; + if (!*method) { + /* Allocate method table for this management class */ + if ((ret = allocate_method_table(method))) + goto error1; + } + } + + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error3; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = priv; + } + return 0; + +error3: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; + goto error1; +error2: + kfree(*class); + *class = NULL; +error1: + return ret; +} + +static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + u8 mgmt_class; + + /* + * Was MAD registration request supplied + * with original registration ? + */ + if (!agent_priv->reg_req) { + goto out; + } + + port_priv = agent_priv->qp_info->port_priv; + class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; + if (!class) { + printk(KERN_ERR PFX "No class table yet MAD registration " + "request supplied\n"); + goto out; + } + + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); + method = class->method_table[mgmt_class]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* Now, check to see if there are any methods still in use */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + class->method_table[mgmt_class] = NULL; + /* Any management classes left ? */ + if (!check_class_table(class)) { + /* If not, release management class table */ + kfree(class); + port_priv->version[agent_priv->reg_req-> + mgmt_class_version]= NULL; + } + } + } + +out: + return; +} + +static int response_mad(struct ib_mad *mad) +{ + /* Trap represses are responses although response bit is reset */ + return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || + (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)); +} + +static int solicited_mad(struct ib_mad *mad) +{ + /* CM MADs are never solicited */ + if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) { + return 0; + } + + /* XXX: Determine whether MAD is using RMPP */ + + /* Not using RMPP */ + /* Is this MAD a response to a previous MAD ? */ + return response_mad(mad); +} + +static struct ib_mad_agent_private * +find_mad_agent(struct ib_mad_port_private *port_priv, + struct ib_mad *mad, + int solicited) +{ + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + + /* + * Whether MAD was solicited determines type of routing to + * MAD client. + */ + if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + + /* + * Routing is based on high 32 bits of transaction ID + * of MAD. + */ + hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; + list_for_each_entry(entry, &port_priv->agent_list, + agent_list) { + if (entry->agent.hi_tid == hi_tid) { + mad_agent = entry; + break; + } + } + } else { + struct ib_mad_mgmt_class_table *version; + struct ib_mad_mgmt_method_table *class; + + /* Routing is based on version, class, and method */ + if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) + goto out; + version = port_priv->version[mad->mad_hdr.class_version]; + if (!version) + goto out; + class = version->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (class) + mad_agent = class->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + printk(KERN_NOTICE PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + mad_agent = NULL; + } + } +out: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + return mad_agent; +} + +static int validate_mad(struct ib_mad *mad, u32 qp_num) +{ + int valid = 0; + + /* Make sure MAD base version is understood */ + if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { + printk(KERN_ERR PFX "MAD received with unsupported base " + "version %d\n", mad->mad_hdr.base_version); + goto out; + } + + /* Filter SMI packets sent to other than QP0 */ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { + if (qp_num == 0) + valid = 1; + } else { + /* Filter GSI packets sent to QP0 */ + if (qp_num != 0) + valid = 1; + } + +out: + return valid; +} + +/* + * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet + */ +static struct ib_mad_private * +reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv) +{ + /* Until we have RMPP, all receives are reassembled!... */ + INIT_LIST_HEAD(&recv->header.recv_buf.list); + return recv; +} + +static struct ib_mad_send_wr_private* +find_send_req(struct ib_mad_agent_private *mad_agent_priv, + u64 tid) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->tid == tid) + return mad_send_wr; + } + + /* + * It's possible to receive the response before we've + * been notified that the send has completed + */ + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->tid == tid && mad_send_wr->timeout) { + /* Verify request has not been canceled */ + return (mad_send_wr->status == IB_WC_SUCCESS) ? + mad_send_wr : NULL; + } + } + return NULL; +} + +static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv, + int solicited) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + /* Fully reassemble receive before processing */ + recv = reassemble_recv(mad_agent_priv, recv); + if (!recv) { + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + + /* Complete corresponding request */ + if (solicited) { + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_req(mad_agent_priv, + recv->mad.mad.mad_hdr.tid); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + ib_free_recv_mad(&recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + /* Timeout = 0 means that we won't wait for a response */ + mad_send_wr->timeout = 0; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Defined behavior is to complete response before request */ + recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + atomic_dec(&mad_agent_priv->refcount); + + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); + } else { + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + +static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_qp_info *qp_info; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv, *response; + struct ib_mad_list_head *mad_list; + struct ib_mad_agent_private *mad_agent; + struct ib_smp *smp; + int solicited; + + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); + dma_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + + /* Setup MAD receive work completion from "normal" work completion */ + recv->header.recv_wc.wc = wc; + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); + recv->header.recv_wc.recv_buf = &recv->header.recv_buf; + recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; + recv->header.recv_buf.grh = &recv->grh; + + /* Validate MAD */ + if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) + goto out; + + if (recv->header.recv_buf.mad->mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + smp = (struct ib_smp *)recv->header.recv_buf.mad; + if (!smi_handle_dr_smp_recv(smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) + goto out; + if (!smi_check_forward_dr_smp(smp)) + goto local; + if (!smi_handle_dr_smp_send(smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(smp, + port_priv->device, + port_priv->port_num)) + goto out; + } + +local: + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + int ret; + + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* + * Is it better to assume that + * it wouldn't be processed ? + */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + recv->header.recv_buf.mad, + &response->mad.mad); + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) + goto out; + if (ret & IB_MAD_RESULT_REPLY) { + /* Send response */ + if (!agent_send(response, &recv->grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; + goto out; + } + } + } + + /* Determine corresponding MAD agent for incoming receive MAD */ + solicited = solicited_mad(recv->header.recv_buf.mad); + mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, + solicited); + if (mad_agent) { + ib_mad_complete_recv(mad_agent, recv, solicited); + /* + * recv is freed up in error cases in ib_mad_complete_recv + * or via recv_handler in ib_mad_complete_recv() + */ + recv = NULL; + } + +out: + /* Post another receive request for this QP */ + if (response) { + ib_mad_post_receive_mads(qp_info, response); + if (recv) + kmem_cache_free(ib_mad_cache, recv); + } else + ib_mad_post_receive_mads(qp_info, recv); +} + +static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long delay; + + if (list_empty(&mad_agent_priv->wait_list)) { + cancel_delayed_work(&mad_agent_priv->work); + } else { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_agent_priv->timeout, + mad_send_wr->timeout)) { + mad_agent_priv->timeout = mad_send_wr->timeout; + cancel_delayed_work(&mad_agent_priv->work); + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->work, delay); + } + } +} + +static void wait_for_response(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr ) +{ + struct ib_mad_send_wr_private *temp_mad_send_wr; + struct list_head *list_item; + unsigned long delay; + + list_del(&mad_send_wr->agent_list); + + delay = mad_send_wr->timeout; + mad_send_wr->timeout += jiffies; + + list_for_each_prev(list_item, &mad_agent_priv->wait_list) { + temp_mad_send_wr = list_entry(list_item, + struct ib_mad_send_wr_private, + agent_list); + if (time_after(mad_send_wr->timeout, + temp_mad_send_wr->timeout)) + break; + } + list_add(&mad_send_wr->agent_list, list_item); + + /* Reschedule a work item if we have a shorter timeout */ + if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { + cancel_delayed_work(&mad_agent_priv->work); + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->work, delay); + } +} + +/* + * Process a send work completion + */ +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_send_wr->agent, + struct ib_mad_agent_private, agent); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (mad_send_wc->status != IB_WC_SUCCESS && + mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = mad_send_wc->status; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + + if (--mad_send_wr->refcount > 0) { + if (mad_send_wr->refcount == 1 && mad_send_wr->timeout && + mad_send_wr->status == IB_WC_SUCCESS) { + wait_for_response(mad_agent_priv, mad_send_wr); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + return; + } + + /* Remove send from MAD agent and notify client of completion */ + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + if (mad_send_wr->status != IB_WC_SUCCESS ) + mad_send_wc->status = mad_send_wr->status; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); + + /* Release reference on agent taken when sending */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + + kfree(mad_send_wr); +} + +static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_send_wr_private *mad_send_wr, *queued_send_wr; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_queue *send_queue; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + send_queue = mad_list->mad_queue; + qp_info = send_queue->qp_info; + +retry: + queued_send_wr = NULL; + spin_lock_irqsave(&send_queue->lock, flags); + list_del(&mad_list->list); + + /* Move queued send to the send queue */ + if (send_queue->count-- > send_queue->max_active) { + mad_list = container_of(qp_info->overflow_list.next, + struct ib_mad_list_head, list); + queued_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + list_del(&mad_list->list); + list_add_tail(&mad_list->list, &send_queue->list); + } + spin_unlock_irqrestore(&send_queue->lock, flags); + + /* Restore client wr_id in WC and complete send */ + wc->wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); + + if (queued_send_wr) { + ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, + &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + mad_send_wr = queued_send_wr; + wc->status = IB_WC_LOC_QP_OP_ERR; + goto retry; + } + } +} + +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + unsigned long flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup + */ + return; + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + struct ib_qp_attr *attr; + + /* Transition QP to RTS and fail offending send */ + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + attr->cur_qp_state = IB_QPS_SQE; + ret = ib_modify_qp(qp_info->qp, attr, + IB_QP_STATE | IB_QP_CUR_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } + ib_mad_send_done_handler(port_priv, wc); + } +} + +/* + * IB MAD completion callback + */ +static void ib_mad_completion_handler(void *data) +{ + struct ib_mad_port_private *port_priv; + struct ib_wc wc; + + port_priv = (struct ib_mad_port_private *)data; + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); + } +} + +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv) +{ + unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + struct list_head cancel_list; + + INIT_LIST_HEAD(&cancel_list); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &mad_agent_priv->send_list, agent_list) { + if (mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + } + + /* Empty wait list to prevent receives from finding a request */ + list_splice_init(&mad_agent_priv->wait_list, &cancel_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Report all cancelled requests */ + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &cancel_list, agent_list) { + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + list_del(&mad_send_wr->agent_list); + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + } +} + +static struct ib_mad_send_wr_private* +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, + u64 wr_id) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + return NULL; +} + +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + + if (mad_send_wr->refcount != 0) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + +out: + return; +} +EXPORT_SYMBOL(ib_cancel_mad); + +static void timeout_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags, delay; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->wait_list)) { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_send_wr->timeout, jiffies)) { + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->work, delay); + break; + } + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void ib_mad_thread_completion_handler(struct ib_cq *cq) +{ + struct ib_mad_port_private *port_priv = cq->cq_context; + queue_work(port_priv->wq, &port_priv->work); +} + +/* + * Allocate receive MADs and post receive WRs for them + */ +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) +{ + unsigned long flags; + int post, ret; + struct ib_mad_private *mad_priv; + struct ib_sge sg_list; + struct ib_recv_wr recv_wr, *bad_recv_wr; + struct ib_mad_queue *recv_queue = &qp_info->recv_queue; + + /* Initialize common scatter list fields */ + sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; + + /* Initialize common receive WR fields */ + recv_wr.next = NULL; + recv_wr.sg_list = &sg_list; + recv_wr.num_sge = 1; + recv_wr.recv_flags = IB_RECV_SIGNALED; + + do { + /* Allocate and map receive buffer */ + if (mad) { + mad_priv = mad; + mad = NULL; + } else { + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + ret = -ENOMEM; + break; + } + } + sg_list.addr = dma_map_single(qp_info->port_priv-> + device->dma_device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; + mad_priv->header.mad_list.mad_queue = recv_queue; + + /* Post receive WR */ + spin_lock_irqsave(&recv_queue->lock, flags); + post = (++recv_queue->count < recv_queue->max_active); + list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); + spin_unlock_irqrestore(&recv_queue->lock, flags); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); + if (ret) { + spin_lock_irqsave(&recv_queue->lock, flags); + list_del(&mad_priv->header.mad_list.list); + recv_queue->count--; + spin_unlock_irqrestore(&recv_queue->lock, flags); + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, mad_priv); + printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); + break; + } + } while (post); + + return ret; +} + +/* + * Return all the posted receive MADs + */ +static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; + + while (!list_empty(&qp_info->recv_queue.list)) { + + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); + + /* Undo PCI mapping */ + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, recv); + } + + qp_info->recv_queue.count = 0; +} + +/* + * Start the port + */ +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) +{ + int ret, i; + struct ib_qp_attr *attr; + struct ib_qp *qp; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); + return -ENOMEM; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "INIT: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTR: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTS: %d\n", i, ret); + goto out; + } + } + + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion " + "notification: %d\n", ret); + goto out; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); + goto out; + } + } +out: + kfree(attr); + return ret; +} + +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static void init_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info) +{ + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + INIT_LIST_HEAD(&qp_info->overflow_list); +} + +static int create_mad_qp(struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = qp_info->port_priv->cq; + qp_init_attr.recv_cq = qp_info->port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = qp_info->port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; + qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + /* Use minimum queue sizes unless the CQ is resized */ + qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; + qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); +} + +/* + * Open the port + * Create the QP, PD, MR, and CQ if needed + */ +static int ib_mad_port_open(struct ib_device *device, + int port_num) +{ + int ret, cq_size; + struct ib_mad_port_private *port_priv; + unsigned long flags; + char name[sizeof "ib_mad123"]; + + /* First, check if port already open at MAD layer */ + port_priv = ib_get_mad_port(device, port_num); + if (port_priv) { + printk(KERN_DEBUG PFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); + return -ENOMEM; + } + memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_init(&port_priv->reg_lock); + INIT_LIST_HEAD(&port_priv->agent_list); + init_mad_qp(port_priv, &port_priv->qp_info[0]); + init_mad_qp(port_priv, &port_priv->qp_info[1]); + + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + port_priv->cq = ib_create_cq(port_priv->device, + (ib_comp_handler) + ib_mad_thread_completion_handler, + NULL, port_priv, cq_size); + if (IS_ERR(port_priv->cq)) { + printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); + ret = PTR_ERR(port_priv->cq); + goto error3; + } + + port_priv->pd = ib_alloc_pd(device); + if (IS_ERR(port_priv->pd)) { + printk(KERN_ERR PFX "Couldn't create ib_mad PD\n"); + ret = PTR_ERR(port_priv->pd); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR PFX "Couldn't get ib_mad DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; + + snprintf(name, sizeof name, "ib_mad%d", port_num); + port_priv->wq = create_workqueue(name); + if (!port_priv->wq) { + ret = -ENOMEM; + goto error8; + } + INIT_WORK(&port_priv->work, ib_mad_completion_handler, port_priv); + + ret = ib_mad_port_start(port_priv); + if (ret) { + printk(KERN_ERR PFX "Couldn't start port\n"); + goto error9; + } + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_mad_port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + return 0; + +error9: + destroy_workqueue(port_priv->wq); +error8: + destroy_mad_qp(&port_priv->qp_info[1]); +error7: + destroy_mad_qp(&port_priv->qp_info[0]); +error6: + ib_dereg_mr(port_priv->mr); +error5: + ib_dealloc_pd(port_priv->pd); +error4: + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); +error3: + kfree(port_priv); + + return ret; +} + +/* + * Close the port + * If there are no classes using the port, free the port + * resources (CQ, MR, PD, QP) and remove the port's info structure + */ +static int ib_mad_port_close(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + port_priv = __ib_get_mad_port(device, port_num); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + printk(KERN_ERR PFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + /* Stop processing completions. */ + flush_workqueue(port_priv->wq); + destroy_workqueue(port_priv->wq); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); + ib_dereg_mr(port_priv->mr); + ib_dealloc_pd(port_priv->pd); + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); + /* XXX: Handle deallocation of MAD registration tables */ + + kfree(port_priv); + + return 0; +} + +static void ib_mad_init_device(struct ib_device *device) +{ + int ret, num_ports, cur_port, i, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_mad_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d\n", + device->name, cur_port); + goto error_device_open; + } + ret = ib_agent_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for agents\n", + device->name, cur_port); + goto error_device_open; + } + } + + goto error_device_query; + +error_device_open: + while (i > 0) { + cur_port--; + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } + i--; + } + +error_device_query: + return; +} + +static void ib_mad_remove_device(struct ib_device *device) +{ + int ret = 0, i, num_ports, cur_port, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + } +} + +static struct ib_client mad_client = { + .name = "mad", + .add = ib_mad_init_device, + .remove = ib_mad_remove_device +}; + +static int __init ib_mad_init_module(void) +{ + int ret; + + spin_lock_init(&ib_mad_port_list_lock); + spin_lock_init(&ib_agent_port_list_lock); + + ib_mad_cache = kmem_cache_create("ib_mad", + sizeof(struct ib_mad_private), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!ib_mad_cache) { + printk(KERN_ERR PFX "Couldn't create ib_mad cache\n"); + ret = -ENOMEM; + goto error1; + } + + INIT_LIST_HEAD(&ib_mad_port_list); + + if (ib_register_client(&mad_client)) { + printk(KERN_ERR PFX "Couldn't register ib_mad client\n"); + ret = -EINVAL; + goto error2; + } + + return 0; + +error2: + kmem_cache_destroy(ib_mad_cache); +error1: + return ret; +} + +static void __exit ib_mad_cleanup_module(void) +{ + ib_unregister_client(&mad_client); + + if (kmem_cache_destroy(ib_mad_cache)) { + printk(KERN_DEBUG PFX "Failed to destroy ib_mad cache\n"); + } +} + +module_init(ib_mad_init_module); +module_exit(ib_mad_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad_priv.h 2004-11-23 08:10:18.221875555 -0800 @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * Maintained by: vtrmaint1 at voltaire.com + * + * This program is intended for the purpose of Infiniband + * protocol stack for Linux Servers. + * + * This software program is free software and you are free to modifyi + * and/or redistribute it under a choice of one of the following two + * licenses: + * + * 1) under either the GNU General Public License (GPL) Version 2, June 1991, + * a copy of which is in the file LICENSE_GPL_V2.txt in the root directory. + * This GPL license is also available from the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the + * web at http://www.fsf.org/copyleft/gpl.html + * + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, on the web at + * http://www.opensource.org/licenses/bsd-license.php. + * + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * + * + * To obtain a copy of these licenses, the source code to this software or + * for other questions, you may write to Voltaire, Inc., + * Attention: Voltaire openSource maintainer, + * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or + * by Email: vtrmaint1 at voltaire.com + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#ifndef __IB_MAD_PRIV_H__ +#define __IB_MAD_PRIV_H__ + +#include +#include +#include +#include +#include + + +#define PFX "ib_mad: " + +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ + +/* QP and CQ parameters */ +#define IB_MAD_QP_SEND_SIZE 2048 +#define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_SEND_REQ_MAX_SG 2 +#define IB_MAD_RECV_REQ_MAX_SG 1 + +#define IB_MAD_SEND_Q_PSN 0 + +/* Registration table sizes */ +#define MAX_MGMT_CLASS 80 +#define MAX_MGMT_VERSION 8 + +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; +}; + +struct ib_mad_private_header { + struct ib_mad_list_head mad_list; + struct ib_mad_recv_wc recv_wc; + struct ib_mad_recv_buf recv_buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +} __attribute__ ((packed)); + +struct ib_mad_private { + struct ib_mad_private_header header; + struct ib_grh grh; + union { + struct ib_mad mad; + struct ib_rmpp_mad rmpp_mad; + struct ib_smp smp; + } mad; +} __attribute__ ((packed)); + +struct ib_mad_agent_private { + struct list_head agent_list; + struct ib_mad_agent agent; + struct ib_mad_reg_req *reg_req; + struct ib_mad_qp_info *qp_info; + + spinlock_t lock; + struct list_head send_list; + struct list_head wait_list; + struct work_struct work; + unsigned long timeout; + + atomic_t refcount; + wait_queue_head_t wait; + u8 rmpp_version; +}; + +struct ib_mad_send_wr_private { + struct ib_mad_list_head mad_list; + struct list_head agent_list; + struct ib_mad_agent *agent; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; + unsigned long timeout; + int retry; + int refcount; + enum ib_wc_status status; +}; + +struct ib_mad_mgmt_method_table { + struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS]; +}; + +struct ib_mad_mgmt_class_table { + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; +}; + +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + int max_active; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + struct list_head overflow_list; +}; + +struct ib_mad_port_private { + struct list_head port_list; + struct ib_device *device; + int port_num; + struct ib_cq *cq; + struct ib_pd *pd; + struct ib_mr *mr; + + spinlock_t reg_lock; + struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; + struct list_head agent_list; + struct workqueue_struct *wq; + struct work_struct work; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; +}; + +#endif /* __IB_MAD_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.c 2004-11-23 08:10:18.110891920 -0800 @@ -0,0 +1,222 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + + +/* + * Fixup a directed route SMP for sending + * Return 0 if the SMP should be discarded + */ +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + /* smp->return_path set when received */ + smp->hop_ptr++; + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- Fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (node_type == IB_NODE_SWITCH || + smp->dr_slid == IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer */ + return 0; + } +} + +/* + * Adjust information for a received SMP + * Return 0 if the SMP should be dropped + */ +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) + return 0; + + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; + } + /* smp->hop_ptr updated when sending */ + return (node_type == IB_NODE_SWITCH); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM */ + /* C14-13:5 -- Check for unreasonable hop pointer */ + return (hop_ptr == 0); + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue + * Return 0 if the SMP should be completed up the stack + */ +int smi_check_forward_dr_smp(struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) + return 1; + + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) + return 1; + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.h 2004-11-23 08:10:18.259869953 -0800 @@ -0,0 +1,54 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __SMI_H_ +#define __SMI_H_ + +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); + +/* + * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + */ +static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ + return ((mad_agent->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1))); +} + +#endif /* __SMI_H_ */ From roland at topspin.com Tue Nov 23 08:14:47 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:47 -0800 Subject: [openib-general] [PATCH][RFC/v2][6/21] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041123814.sBoIUxeLIDc9lo4V@topspin.com> Message-ID: <20041123814.UmUHBktptJzFvsrR@topspin.com> Add support for sending queries to the SA (Subnet Administration). In particular the PathRecord and MCMember (multicast group member) used by the IP-over-InfiniBand driver are implemented. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-11-23 08:10:17.978911380 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-11-23 08:10:18.652812015 -0800 @@ -2,7 +2,8 @@ obj-$(CONFIG_INFINIBAND) += \ ib_core.o \ - ib_mad.o + ib_mad.o \ + ib_sa.o ib_core-objs := \ packer.o \ @@ -17,3 +18,5 @@ mad.o \ smi.o \ agent.o + +ib_sa-objs := sa_query.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sa_query.c 2004-11-23 08:10:18.678808182 -0800 @@ -0,0 +1,816 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand subnet administration query support"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +} __attribute__ ((packed)); + +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +} __attribute__ ((packed)); + +struct ib_sa_sm_ah { + struct ib_ah *ah; + struct kref ref; +}; + +struct ib_sa_port { + struct ib_mad_agent *agent; + struct ib_mr *mr; + struct ib_sa_sm_ah *sm_ah; + struct work_struct update_task; + spinlock_t ah_lock; + u8 port_num; +}; + +struct ib_sa_device { + int start_port, end_port; + struct ib_event_handler event_handler; + struct ib_sa_port port[0]; +}; + +struct ib_sa_query { + void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); + void (*release)(struct ib_sa_query *); + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_sm_ah *sm_ah; + DECLARE_PCI_UNMAP_ADDR(mapping) + int id; +}; + +struct ib_sa_path_query { + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +struct ib_sa_mcmember_query { + void (*callback)(int, struct ib_sa_mcmember_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +static void ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device); + +static struct ib_client sa_client = { + .name = "sa", + .add = ib_sa_add_one, + .remove = ib_sa_remove_one +}; + +static spinlock_t idr_lock; +static DEFINE_IDR(query_idr); + +static spinlock_t tid_lock; +static u32 tid; + +enum { + IB_SA_ATTR_CLASS_PORTINFO = 0x01, + IB_SA_ATTR_NOTICE = 0x02, + IB_SA_ATTR_INFORM_INFO = 0x03, + IB_SA_ATTR_NODE_REC = 0x11, + IB_SA_ATTR_PORT_INFO_REC = 0x12, + IB_SA_ATTR_SL2VL_REC = 0x13, + IB_SA_ATTR_SWITCH_REC = 0x14, + IB_SA_ATTR_LINEAR_FDB_REC = 0x15, + IB_SA_ATTR_RANDOM_FDB_REC = 0x16, + IB_SA_ATTR_MCAST_FDB_REC = 0x17, + IB_SA_ATTR_SM_INFO_REC = 0x18, + IB_SA_ATTR_LINK_REC = 0x20, + IB_SA_ATTR_GUID_INFO_REC = 0x30, + IB_SA_ATTR_SERVICE_REC = 0x31, + IB_SA_ATTR_PARTITION_REC = 0x33, + IB_SA_ATTR_RANGE_REC = 0x34, + IB_SA_ATTR_PATH_REC = 0x35, + IB_SA_ATTR_VL_ARB_REC = 0x36, + IB_SA_ATTR_MC_GROUP_REC = 0x37, + IB_SA_ATTR_MC_MEMBER_REC = 0x38, + IB_SA_ATTR_TRACE_REC = 0x39, + IB_SA_ATTR_MULTI_PATH_REC = 0x3a, + IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b +}; + +#define PATH_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_path_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_path_rec *) 0)->field, \ + .field_name = "sa_path_rec:" #field + +static const struct ib_field path_rec_table[] = { + { RESERVED, + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 32 }, + { PATH_REC_FIELD(dgid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(sgid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(dlid), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { PATH_REC_FIELD(slid), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 16 }, + { PATH_REC_FIELD(raw_traffic), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 11, + .offset_bits = 1, + .size_bits = 3 }, + { PATH_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { PATH_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { PATH_REC_FIELD(traffic_class), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 8 }, + { PATH_REC_FIELD(reversible), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { PATH_REC_FIELD(numb_path), + .offset_words = 12, + .offset_bits = 9, + .size_bits = 7 }, + { PATH_REC_FIELD(pkey), + .offset_words = 12, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 13, + .offset_bits = 0, + .size_bits = 12 }, + { PATH_REC_FIELD(sl), + .offset_words = 13, + .offset_bits = 12, + .size_bits = 4 }, + { PATH_REC_FIELD(mtu_selector), + .offset_words = 13, + .offset_bits = 16, + .size_bits = 2 }, + { PATH_REC_FIELD(mtu), + .offset_words = 13, + .offset_bits = 18, + .size_bits = 6 }, + { PATH_REC_FIELD(rate_selector), + .offset_words = 13, + .offset_bits = 24, + .size_bits = 2 }, + { PATH_REC_FIELD(rate), + .offset_words = 13, + .offset_bits = 26, + .size_bits = 6 }, + { PATH_REC_FIELD(packet_life_time_selector), + .offset_words = 14, + .offset_bits = 0, + .size_bits = 2 }, + { PATH_REC_FIELD(packet_life_time), + .offset_words = 14, + .offset_bits = 2, + .size_bits = 6 }, + { PATH_REC_FIELD(preference), + .offset_words = 14, + .offset_bits = 8, + .size_bits = 8 }, + { RESERVED, + .offset_words = 14, + .offset_bits = 16, + .size_bits = 48 }, +}; + +#define MCMEMBER_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_mcmember_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_mcmember_rec *) 0)->field, \ + .field_name = "sa_mcmember_rec:" #field + +static const struct ib_field mcmember_rec_table[] = { + { MCMEMBER_REC_FIELD(mgid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(port_gid), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(qkey), + .offset_words = 8, + .offset_bits = 0, + .size_bits = 32 }, + { MCMEMBER_REC_FIELD(mlid), + .offset_words = 9, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(mtu_selector), + .offset_words = 9, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(mtu), + .offset_words = 9, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(traffic_class), + .offset_words = 9, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(pkey), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(rate_selector), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(rate), + .offset_words = 10, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(packet_life_time_selector), + .offset_words = 10, + .offset_bits = 24, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(packet_life_time), + .offset_words = 10, + .offset_bits = 26, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(sl), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { MCMEMBER_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(scope), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(join_state), + .offset_words = 12, + .offset_bits = 4, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(proxy_join), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { RESERVED, + .offset_words = 12, + .offset_bits = 9, + .size_bits = 23 }, +}; + +static void free_sm_ah(struct kref *kref) +{ + struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); + + ib_destroy_ah(sm_ah->ah); + kfree(sm_ah); +} + +static void update_sm_ah(void *port_ptr) +{ + struct ib_sa_port *port = port_ptr; + struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_port_attr port_attr; + struct ib_ah_attr ah_attr; + + if (ib_query_port(port->agent->device, port->port_num, &port_attr)) { + printk(KERN_WARNING "Couldn't query port\n"); + return; + } + + new_ah = kmalloc(sizeof *new_ah, GFP_KERNEL); + if (!new_ah) { + printk(KERN_WARNING "Couldn't allocate new SM AH\n"); + return; + } + + kref_init(&new_ah->ref); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port_attr.sm_lid; + ah_attr.sl = port_attr.sm_sl; + ah_attr.port_num = port->port_num; + + new_ah->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(new_ah->ah)) { + printk(KERN_WARNING "Couldn't create new SM AH\n"); + kfree(new_ah); + return; + } + + spin_lock_irq(&port->ah_lock); + old_ah = port->sm_ah; + port->sm_ah = new_ah; + spin_unlock_irq(&port->ah_lock); + + if (old_ah) + kref_put(&old_ah->ref, free_sm_ah); +} + +static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_sa_device *sa_dev = + ib_get_client_data(event->device, &sa_client); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } +} + +void ib_sa_cancel_query(int id, struct ib_sa_query *query) +{ + unsigned long flags; + struct ib_mad_agent *agent; + + spin_lock_irqsave(&idr_lock, flags); + if (idr_find(&query_idr, id) != query) { + spin_unlock_irqrestore(&idr_lock, flags); + return; + } + agent = query->port->agent; + spin_unlock_irqrestore(&idr_lock, flags); + + ib_cancel_mad(agent, id); +} +EXPORT_SYMBOL(ib_sa_cancel_query); + +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +{ + unsigned long flags; + + memset(mad, 0, sizeof *mad); + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + + spin_lock_irqsave(&tid_lock, flags); + mad->mad_hdr.tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | tid++); + spin_unlock_irqrestore(&tid_lock, flags); +} + +static int send_mad(struct ib_sa_query *query, int timeout_ms) +{ + struct ib_sa_port *port = query->port; + unsigned long flags; + int ret; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .mad_hdr = &query->mad->mad_hdr, + .remote_qpn = 1, + .remote_qkey = IB_QP1_QKEY, + .timeout_ms = timeout_ms + } + } + }; + +retry: + if (!idr_pre_get(&query_idr, GFP_ATOMIC)) + return -ENOMEM; + spin_lock_irqsave(&idr_lock, flags); + ret = idr_get_new(&query_idr, query, &query->id); + spin_unlock_irqrestore(&idr_lock, flags); + if (ret == -EAGAIN) + goto retry; + if (ret) + return ret; + + wr.wr_id = query->id; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + wr.wr.ud.ah = port->sm_ah->ah; + spin_unlock_irqrestore(&port->ah_lock, flags); + + gather_list.addr = dma_map_single(port->agent->device->dma_device, + query->mad, + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + gather_list.length = sizeof (struct ib_sa_mad); + gather_list.lkey = port->mr->lkey; + pci_unmap_addr_set(query, mapping, gather_list.addr); + + ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(port->agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, query->id); + spin_unlock_irqrestore(&idr_lock, flags); + } + + return ret; +} + +static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_path_query *query = + container_of(sa_query, struct ib_sa_path_query, sa_query); + + if (mad) { + struct ib_sa_path_rec rec; + + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); +} + +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_path_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_path_rec_callback; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_mcmember_query *query = + container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + + if (mad) { + struct ib_sa_mcmember_rec rec; + + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); +} + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_mcmember_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_mcmember_rec_callback; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = method; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_mcmember_rec_query); + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (!query) + return; + + switch (mad_send_wc->status) { + case IB_WC_SUCCESS: + /* No callback -- already got recv */ + break; + case IB_WC_RESP_TIMEOUT_ERR: + query->callback(query, -ETIMEDOUT, NULL); + break; + case IB_WC_WR_FLUSH_ERR: + query->callback(query, -EINTR, NULL); + break; + default: + query->callback(query, -EIO, NULL); + break; + } + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + + query->release(query); + + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (query) { + if (mad_recv_wc->wc->status == IB_WC_SUCCESS) + query->callback(query, + mad_recv_wc->recv_buf->mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf->mad); + else + query->callback(query, -EIO, NULL); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void ib_sa_add_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + sa_dev = kmalloc(sizeof *sa_dev + + (e - s + 1) * sizeof (struct ib_sa_port), + GFP_KERNEL); + if (!sa_dev) + return; + + sa_dev->start_port = s; + sa_dev->end_port = e; + + for (i = 0; i <= e - s; ++i) { + sa_dev->port[i].mr = NULL; + sa_dev->port[i].sm_ah = NULL; + sa_dev->port[i].port_num = i + s; + spin_lock_init(&sa_dev->port[i].ah_lock); + + sa_dev->port[i].agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + NULL, 0, send_handler, + recv_handler, sa_dev); + if (IS_ERR(sa_dev->port[i].agent)) + goto err; + + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[i].mr)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + goto err; + } + + INIT_WORK(&sa_dev->port[i].update_task, + update_sm_ah, &sa_dev->port[i]); + } + + /* + * We register our event handler after everything is set up, + * and then update our cached info after the event handler is + * registered to avoid any problems if a port changes state + * during our initialization. + */ + + INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); + if (ib_register_event_handler(&sa_dev->event_handler)) + goto err; + + for (i = 0; i <= e - s; ++i) + update_sm_ah(&sa_dev->port[i]); + + ib_set_client_data(device, &sa_client, sa_dev); + + return; + +err: + while (--i >= 0) { + ib_dereg_mr(sa_dev->port[i].mr); + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + + kfree(sa_dev); + + return; +} + +static void ib_sa_remove_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + int i; + + if (!sa_dev) + return; + + ib_unregister_event_handler(&sa_dev->event_handler); + + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } + + kfree(sa_dev); +} + +static int __init ib_sa_init(void) +{ + int ret; + + spin_lock_init(&idr_lock); + spin_lock_init(&tid_lock); + + get_random_bytes(&tid, sizeof tid); + + ret = ib_register_client(&sa_client); + if (ret) + printk(KERN_ERR "Couldn't register ib_sa client\n"); + + return ret; +} + +static void __exit ib_sa_cleanup(void) +{ + ib_unregister_client(&sa_client); +} + +module_init(ib_sa_init); +module_exit(ib_sa_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_sa.h 2004-11-23 08:10:18.729800663 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +#include +#include + +enum { + IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ + + IB_SA_METHOD_DELETE = 0x15 +}; + +enum ib_sa_selector { + IB_SA_GTE = 0, + IB_SA_LTE = 1, + IB_SA_EQ = 2, + /* + * The meaning of "best" depends on the attribute: for + * example, for MTU best will return the largest available + * MTU, while for packet life time, best will return the + * smallest available life time. + */ + IB_SA_BEST = 3 +}; + +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +/* + * Structures for SA records are named "struct ib_sa_xxx_rec." No + * attempt is made to pack structures to match the physical layout of + * SA records in SA MADs; all packing and unpacking is handled by the + * SA query code. + * + * For a record with structure ib_sa_xxx_rec, the naming convention + * for the component mask value for field yyy is IB_SA_XXX_REC_YYY (we + * never use different abbreviations or otherwise change the spelling + * of xxx/yyy between ib_sa_xxx_rec.yyy and IB_SA_XXX_REC_YYY). + * + * Reserved rows are indicated with comments to help maintainability. + */ + +/* reserved: 0 */ +/* reserved: 1 */ +#define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) +#define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) +#define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) +#define IB_SA_PATH_REC_SLID IB_SA_COMP_MASK( 5) +#define IB_SA_PATH_REC_RAW_TRAFFIC IB_SA_COMP_MASK( 6) +/* reserved: 7 */ +#define IB_SA_PATH_REC_FLOW_LABEL IB_SA_COMP_MASK( 8) +#define IB_SA_PATH_REC_HOP_LIMIT IB_SA_COMP_MASK( 9) +#define IB_SA_PATH_REC_TRAFFIC_CLASS IB_SA_COMP_MASK(10) +#define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) +#define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) +#define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) +/* reserved: 14 */ +#define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) +#define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) +#define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) +#define IB_SA_PATH_REC_RATE_SELECTOR IB_SA_COMP_MASK(18) +#define IB_SA_PATH_REC_RATE IB_SA_COMP_MASK(19) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(20) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(21) +#define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ib_gid dgid; + union ib_gid sgid; + u16 dlid; + u16 slid; + int raw_traffic; + /* reserved */ + u32 flow_label; + u8 hop_limit; + u8 traffic_class; + int reversible; + u8 numb_path; + u16 pkey; + /* reserved */ + u8 sl; + u8 mtu_selector; + enum ib_mtu mtu; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 preference; +}; + +#define IB_SA_MCMEMBER_REC_MGID IB_SA_COMP_MASK( 0) +#define IB_SA_MCMEMBER_REC_PORT_GID IB_SA_COMP_MASK( 1) +#define IB_SA_MCMEMBER_REC_QKEY IB_SA_COMP_MASK( 2) +#define IB_SA_MCMEMBER_REC_MLID IB_SA_COMP_MASK( 3) +#define IB_SA_MCMEMBER_REC_MTU_SELECTOR IB_SA_COMP_MASK( 4) +#define IB_SA_MCMEMBER_REC_MTU IB_SA_COMP_MASK( 5) +#define IB_SA_MCMEMBER_REC_TRAFFIC_CLASS IB_SA_COMP_MASK( 6) +#define IB_SA_MCMEMBER_REC_PKEY IB_SA_COMP_MASK( 7) +#define IB_SA_MCMEMBER_REC_RATE_SELECTOR IB_SA_COMP_MASK( 8) +#define IB_SA_MCMEMBER_REC_RATE IB_SA_COMP_MASK( 9) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(10) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(11) +#define IB_SA_MCMEMBER_REC_SL IB_SA_COMP_MASK(12) +#define IB_SA_MCMEMBER_REC_FLOW_LABEL IB_SA_COMP_MASK(13) +#define IB_SA_MCMEMBER_REC_HOP_LIMIT IB_SA_COMP_MASK(14) +#define IB_SA_MCMEMBER_REC_SCOPE IB_SA_COMP_MASK(15) +#define IB_SA_MCMEMBER_REC_JOIN_STATE IB_SA_COMP_MASK(16) +#define IB_SA_MCMEMBER_REC_PROXY_JOIN IB_SA_COMP_MASK(17) + +struct ib_sa_mcmember_rec { + union ib_gid mgid; + union ib_gid port_gid; + u32 qkey; + u16 mlid; + u8 mtu_selector; + enum ib_mtu mtu; + u8 traffic_class; + u16 pkey; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 sl; + u32 flow_label; + u8 hop_limit; + u8 scope; + u8 join_state; + int proxy_join; +}; + +struct ib_sa_query; + +void ib_sa_cancel_query(int id, struct ib_sa_query *query); + +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +static inline int +ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + +static inline int +ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + + +#endif /* IB_SA_H */ From roland at topspin.com Tue Nov 23 08:14:52 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:52 -0800 Subject: [openib-general] [PATCH][RFC/v2][7/21] Add Mellanox HCA low-level driver In-Reply-To: <20041123814.UmUHBktptJzFvsrR@topspin.com> Message-ID: <20041123814.y2QOtktHRf35o3M9@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-11-23 08:10:16.399144313 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-11-23 08:10:19.036755403 -0800 @@ -8,4 +8,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/mthca/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-11-23 08:10:16.436138859 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-11-23 08:10:18.998761005 -0800 @@ -1 +1,2 @@ obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-11-23 08:10:19.090747442 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-11-23 08:10:19.146739186 -0800 @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ + mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ + mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ + mthca_provider.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-11-23 08:10:19.197731667 -0800 @@ -0,0 +1,175 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_allocator.c 182 2004-05-21 22:19:11Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-11-23 08:10:19.234726213 -0800 @@ -0,0 +1,51 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_config_reg.h 182 2004-05-21 22:19:11Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-11-23 08:10:19.274720315 -0800 @@ -0,0 +1,387 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_dev.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-11-23 08:10:19.314714418 -0800 @@ -0,0 +1,119 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_doorbell.h 1238 2004-11-15 21:58:14Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-11-23 08:10:19.352708816 -0800 @@ -0,0 +1,888 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_main.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DEV_LIM(mdev, &dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + if (dev_lim.min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim.min_page_sz, PAGE_SIZE); + err = -ENODEV; + goto err_out_disable; + } + if (dev_lim.num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim.num_ports, MTHCA_MAX_PORTS); + err = -ENODEV; + goto err_out_disable; + } + + mdev->limits.num_ports = dev_lim.num_ports; + mdev->limits.vl_cap = dev_lim.max_vl; + mdev->limits.mtu_cap = dev_lim.max_mtu; + mdev->limits.gid_table_len = dev_lim.max_gids; + mdev->limits.pkey_table_len = dev_lim.max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim.local_ca_ack_delay; + mdev->limits.max_sg = dev_lim.max_sg; + mdev->limits.reserved_qps = dev_lim.reserved_qps; + mdev->limits.reserved_srqs = dev_lim.reserved_srqs; + mdev->limits.reserved_eecs = dev_lim.reserved_eecs; + mdev->limits.reserved_cqs = dev_lim.reserved_cqs; + mdev->limits.reserved_eqs = dev_lim.reserved_eqs; + mdev->limits.reserved_mtts = dev_lim.reserved_mtts; + mdev->limits.reserved_mrws = dev_lim.reserved_mrws; + mdev->limits.reserved_uars = dev_lim.reserved_uars; + mdev->limits.reserved_pds = dev_lim.reserved_pds; + + if (dev_lim.flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_sg; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) { + mdev->fw.arbel.mem[i].page = alloc_page(GFP_HIGHUSER); + mdev->fw.arbel.mem[i].length = PAGE_SIZE; + if (!mdev->fw.arbel.mem[i].page) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + return -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + __free_page(mdev->fw.arbel.mem[i].page); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:14:58 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:14:58 -0800 Subject: [openib-general] [PATCH][RFC/v2][8/21] Add Mellanox HCA low-level driver (midlayer interface) In-Reply-To: <20041123814.y2QOtktHRf35o3M9@topspin.com> Message-ID: <20041123814.Yu9sv2vgFBLAV3pZ@topspin.com> Add midlayer interface code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-11-23 08:10:19.734652499 -0800 @@ -0,0 +1,629 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.c 1169 2004-11-08 17:23:45Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* Temporary until we get core support straightened out */ +enum { + IB_SMP_ATTRIB_NODE_INFO = 0x0011, + IB_SMP_ATTRIB_GUID_INFO = 0x0014, + IB_SMP_ATTRIB_PORT_INFO = 0x0015, + IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 +}; + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent < entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = &dev->pdev->dev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-11-23 08:10:19.785644981 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.h 996 2004-10-14 05:47:49Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:07 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:07 -0800 Subject: [openib-general] [PATCH][RFC/v2][9/21] Add Mellanox HCA low-level driver (FW commands) In-Reply-To: <20041123814.Yu9sv2vgFBLAV3pZ@topspin.com> Message-ID: <20041123815.4PYKXCiYMYCttxq4@topspin.com> Add firmware command processing code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-11-23 08:10:20.044606797 -0800 @@ -0,0 +1,1522 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.c 1229 2004-11-15 04:50:35Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* + * Flush posted writes so GO bit is written last (needed with + * __raw_writel, which may not order writes). + */ + readl(dev->hcr + HCR_STATUS_OFFSET); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = readb(dev->hcr + HCR_STATUS_OFFSET); + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->max_avs = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-11-23 08:10:20.076602080 -0800 @@ -0,0 +1,260 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.h 1229 2004-11-15 04:50:35Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_avs; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) x, MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:14 -0800 Subject: [openib-general] [PATCH][RFC/v2][10/21] Add Mellanox HCA low-level driver (EQ) In-Reply-To: <20041123815.4PYKXCiYMYCttxq4@topspin.com> Message-ID: <20041123815.Ai338wEt3YqtY107@topspin.com> Add event queue code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-11-23 08:10:20.359560358 -0800 @@ -0,0 +1,650 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_eq.c 887 2004-09-25 16:16:56Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_EQ_OVERFLOW) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + int work = 0; + + while (1) { + if (!next_eqe_sw(eq)) + break; + + eqe = get_eqe(eq, eq->cons_index); + work = 1; + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + } + + if (work) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } + + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:20 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:20 -0800 Subject: [openib-general] [PATCH][RFC/v2][11/21] Add Mellanox HCA low-level driver (initialization) In-Reply-To: <20041123815.Ai338wEt3YqtY107@topspin.com> Message-ID: <20041123815.dUhm1PnERtccLLnp@topspin.com> Add device initializaton code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-11-23 08:10:20.600524828 -0800 @@ -0,0 +1,222 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.c 1239 2004-11-15 23:14:21Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-11-23 08:10:20.642518636 -0800 @@ -0,0 +1,58 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.h 186 2004-05-24 02:23:08Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-11-23 08:10:20.724506547 -0800 @@ -0,0 +1,228 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_reset.c 950 2004-10-07 18:21:02Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:25 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:25 -0800 Subject: [openib-general] [PATCH][RFC/v2][12/21] Add Mellanox HCA low-level driver (QP/CQ) In-Reply-To: <20041123815.dUhm1PnERtccLLnp@topspin.com> Message-ID: <20041123815.KMR5AMwRXU875N9Z@topspin.com> Add CQ (completion queue) and QP (queue pair) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-11-23 08:10:20.997466300 -0800 @@ -0,0 +1,821 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cq.c 996 2004-10-14 05:47:49Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +} __attribute__((packed)); + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1)); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe - 1) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1)); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + (cq->ibcq.cqe - 1)), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1); + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1); + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + wmb(); + inc_cons_index(dev, cq, freed); + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-11-23 08:10:21.032461140 -0800 @@ -0,0 +1,1485 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_qp.c 1270 2004-11-18 21:47:31Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +} __attribute__((packed)); + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +} __attribute__((packed)); + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +} __attribute__((packed)); + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +} __attribute__((packed)); + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +} __attribute__((packed)); + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +} __attribute__((packed)); + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +} __attribute__((packed)); + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) { + spin_lock_irq(&qp->lock); + /* XXX deal with async transitions to ERROR */ + qp->state = new_state; + spin_unlock_irq(&qp->lock); + } + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (qp->transport == UD) { + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + } else if (qp->transport == MLX) { + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:38 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:38 -0800 Subject: [openib-general] [PATCH][RFC/v2][13/21] Add Mellanox HCA low-level driver (last bits) In-Reply-To: <20041123815.KMR5AMwRXU875N9Z@topspin.com> Message-ID: <20041123815.NWFV7rNrbnpqbYAH@topspin.com> Add code for remaining InfiniBand objects (address vectors, multicast groups, memory regions and protection domains) Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-11-23 08:10:21.345414995 -0800 @@ -0,0 +1,212 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_av.c 1180 2004-11-09 05:12:12Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +} __attribute__((packed)); + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-11-23 08:10:21.371411162 -0800 @@ -0,0 +1,372 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mcg.c 639 2004-08-13 17:54:32Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +} __attribute__((packed)); + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-11-23 08:10:21.410405413 -0800 @@ -0,0 +1,389 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mr.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-11-23 08:10:21.436401580 -0800 @@ -0,0 +1,76 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_pd.c 1029 2004-10-20 23:16:28Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:46 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:46 -0800 Subject: [openib-general] [PATCH][RFC/v2][14/21] Add Mellanox HCA low-level driver (MAD) In-Reply-To: <20041123815.NWFV7rNrbnpqbYAH@topspin.com> Message-ID: <20041123815.Irsm0l3oz7MStqls@topspin.com> Add MAD (management datagram) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-11-23 08:10:21.738357057 -0800 @@ -0,0 +1,321 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mad.c 1190 2004-11-10 17:12:44Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + IB_SM_PORT_INFO = 0x0015, + IB_SM_PKEY_TABLE = 0x0016, + IB_SM_SM_INFO = 0x0020, + IB_SM_VENDOR_START = 0xff00 +}; + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = dma_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + DMA_TO_DEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} + +/* + * Local Variables: + * c-file-style: "linux" + * indent-tabs-mode: t + * End: + */ From roland at topspin.com Tue Nov 23 08:15:52 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:52 -0800 Subject: [openib-general] [PATCH][RFC/v2][15/21] IPoIB IPv4 multicast In-Reply-To: <20041123815.Irsm0l3oz7MStqls@topspin.com> Message-ID: <20041123815.3UphmLcWp4RG6D85@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-11-23 08:10:22.004317841 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ --- linux-bk.orig/include/net/ip.h 2004-11-23 08:09:44.620829918 -0800 +++ linux-bk/include/net/ip.h 2004-11-23 08:10:22.005317694 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif --- linux-bk.orig/net/ipv4/arp.c 2004-11-23 08:09:54.024443395 -0800 +++ linux-bk/net/ipv4/arp.c 2004-11-23 08:10:22.005317694 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Tue Nov 23 08:15:57 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:15:57 -0800 Subject: [openib-general] [PATCH][RFC/v2][16/21] IPoIB IPv6 support In-Reply-To: <20041123815.3UphmLcWp4RG6D85@topspin.com> Message-ID: <20041123815.OuqXEOqXJtDtY180@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier --- linux-bk.orig/include/net/if_inet6.h 2004-11-23 08:09:55.180272973 -0800 +++ linux-bk/include/net/if_inet6.h 2004-11-23 08:10:22.300274203 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif --- linux-bk.orig/net/ipv6/addrconf.c 2004-11-23 08:09:54.776332532 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-11-23 08:10:22.302273908 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1098,6 +1099,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1797,6 +1804,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; --- linux-bk.orig/net/ipv6/ndisc.c 2004-11-23 08:09:38.159782567 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-11-23 08:10:22.302273908 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Tue Nov 23 08:16:03 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:16:03 -0800 Subject: [openib-general] [PATCH][RFC/v2][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041123815.OuqXEOqXJtDtY180@topspin.com> Message-ID: <20041123816.7BdwvFRYhI45pb9i@topspin.com> Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol. This is a network device driver of type ARPHRD_INFINIBAND (and addr_len INFINIBAND_ALEN bytes). The ARP/ND implementation for this driver is not completely straightforward, because InfiniBand requires an additional path lookup be performed (through an IB-specific mechanism) after a remote hardware address has been resolved. We are very open to suggestions of a better way to handle this than the current implementation. Although IB has a special multicast group join mode intended to support IP multicast routing (non member join), no means to identify different multicast styles has yet been determined, so all joins by the driver are currently full member joins. We are looking for guidance in how to solve this. The IPoIB protocol/encapsulation is described in the Internet-Drafts http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-11-23 08:10:19.036755403 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-11-23 08:10:22.620227027 -0800 @@ -10,4 +10,6 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/ulp/ipoib/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-11-23 08:10:18.998761005 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-11-23 08:10:22.583232481 -0800 @@ -1,2 +1,3 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig 2004-11-23 08:10:22.719212431 -0800 @@ -0,0 +1,33 @@ +config INFINIBAND_IPOIB + tristate "IP-over-InfiniBand" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the IP-over-InfiniBand protocol (IPoIB). This + transports IP packets over InfiniBand so you can use your IB + device as a fancy NIC. + + The IPoIB protocol is defined by the IETF ipoib working + group: . + +config INFINIBAND_IPOIB_DEBUG + bool "IP-over-InfiniBand debugging" + depends on INFINIBAND_IPOIB + ---help--- + This option causes debugging code to be compiled into the + IPoIB driver. The output can be turned on via the + debug_level and mcast_debug_level module parameters (which + can also be set after the driver is loaded through sysfs). + + This option also creates an "ipoib_debugfs," which can be + mounted to expose debugging information about IB multicast + groups used by the IPoIB driver. + +config INFINIBAND_IPOIB_DEBUG_DATA + bool "IP-over-InfiniBand data path debugging" + depends on INFINIBAND_IPOIB_DEBUG + ---help--- + This option compiles debugging code into the the data path + of the IPoIB driver. The output can be turned on by setting + the debug_level parameter to 2; however, even with output + turned off, this debugging code will have some performance + impact. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile 2004-11-23 08:10:22.683217739 -0800 @@ -0,0 +1,11 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o + +ib_ipoib-y := ipoib_main.o \ + ipoib_ib.o \ + ipoib_multicast.o \ + ipoib_verbs.o \ + ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h 2004-11-23 08:10:22.764205797 -0800 @@ -0,0 +1,314 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib.h 1275 2004-11-22 23:04:04Z roland $ + */ + +#ifndef _IPOIB_H +#define _IPOIB_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +#include +#include +#include + +#include "ipoib_proto.h" + +/* constants */ + +enum { + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + + IPOIB_ENCAP_LEN = 4, + + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, + + IPOIB_NUM_WC = 4, + + IPOIB_MAX_PATH_REC_QUEUE = 3, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_TX_FULL = 0, + IPOIB_FLAG_OPER_UP = 1, + IPOIB_FLAG_ADMIN_UP = 2, + IPOIB_PKEY_ASSIGNED = 3, + IPOIB_PKEY_STOP = 4, + IPOIB_FLAG_SUBINTERFACE = 5, + IPOIB_MCAST_RUN = 6, + IPOIB_STOP_REAPER = 7, + + IPOIB_MAX_BACKOFF_SECONDS = 16, + + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_SENDONLY = 1, + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_ATTACHED = 3, +}; + +/* structs */ + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +struct ipoib_mcast; + +struct ipoib_buf { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ipoib_dev_priv { + spinlock_t lock; + + struct net_device *dev; + + unsigned long flags; + + struct semaphore mcast_mutex; + struct semaphore vlan_mutex; + + struct ipoib_mcast *broadcast; + struct list_head multicast_list; + struct rb_root multicast_tree; + + struct work_struct pkey_task; + struct work_struct mcast_task; + struct work_struct flush_task; + struct work_struct restart_task; + struct work_struct ah_reap_task; + + struct ib_device *ca; + u8 port; + u16 pkey; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; + + union ib_gid local_gid; + u16 local_lid; + + unsigned int admin_mtu; + unsigned int mcast_mtu; + + struct ipoib_buf *rx_ring; + + struct ipoib_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + + struct ib_wc ibwc[IPOIB_NUM_WC]; + + struct list_head dead_ahs; + + struct ib_event_handler event_handler; + + struct net_device_stats stats; + + struct net_device *parent; + struct list_head child_intfs; + struct list_head list; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct list_head fs_list; + struct dentry *mcg_dentry; +#endif +}; + +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + +struct ipoib_path { + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct net_device *dev; + struct neighbour *neighbour; +}; + +static inline struct ipoib_path **to_ipoib_path(struct neighbour *neigh) +{ + return (struct ipoib_path **) (neigh->ha + 24); +} + +extern struct workqueue_struct *ipoib_workqueue; + +/* functions */ + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + +int ipoib_add_pkey_attr(struct net_device *dev); + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_ib_dev_flush(void *dev); +void ipoib_ib_dev_cleanup(struct net_device *dev); + +int ipoib_ib_dev_open(struct net_device *dev); +int ipoib_ib_dev_up(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_dev_cleanup(struct net_device *dev); + +void ipoib_mcast_join_task(void *dev_ptr); +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb); + +void ipoib_mcast_restart_task(void *dev_ptr); +int ipoib_mcast_start_thread(struct net_device *dev); +int ipoib_mcast_stop_thread(struct net_device *dev); + +void ipoib_mcast_dev_down(struct net_device *dev); +void ipoib_mcast_dev_flush(struct net_device *dev); + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *gid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only); + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); + +int ipoib_qp_create(struct net_device *dev); +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); +void ipoib_transport_dev_cleanup(struct net_device *dev); + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); + +void ipoib_pkey_poll(void *dev); +int ipoib_pkey_dev_delay_open(struct net_device *dev); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int ipoib_create_debug_file(struct net_device *dev); +void ipoib_delete_debug_file(struct net_device *dev); +int ipoib_register_debugfs(void); +void ipoib_unregister_debugfs(void); +#else +static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } +static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline int ipoib_register_debugfs(void) { return 0; } +static inline void ipoib_unregister_debugfs(void) { } +#endif + + +#define ipoib_printk(level, priv, format, arg...) \ + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) +#define ipoib_warn(priv, format, arg...) \ + ipoib_printk(KERN_WARNING, priv, format , ## arg) + + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +extern int debug_level; +extern int mcast_debug_level; + +#define ipoib_dbg(priv, format, arg...) \ + do { \ + if (debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { \ + if (mcast_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +#define ipoib_dbg(priv, format, arg...) \ + do { (void) (priv); } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define ipoib_dbg_data(priv, format, arg...) \ + do { \ + if (debug_level > 1) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ +#define ipoib_dbg_data(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ + + +#define IPOIB_GID_FMT "%x:%x:%x:%x:%x:%x:%x:%x" + +#define IPOIB_GID_ARG(gid) be16_to_cpup((__be16 *) ((gid).raw + 0)), \ + be16_to_cpup((__be16 *) ((gid).raw + 2)), \ + be16_to_cpup((__be16 *) ((gid).raw + 4)), \ + be16_to_cpup((__be16 *) ((gid).raw + 6)), \ + be16_to_cpup((__be16 *) ((gid).raw + 8)), \ + be16_to_cpup((__be16 *) ((gid).raw + 10)), \ + be16_to_cpup((__be16 *) ((gid).raw + 12)), \ + be16_to_cpup((__be16 *) ((gid).raw + 14)) + +#endif /* _IPOIB_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-11-23 08:10:22.816198131 -0800 @@ -0,0 +1,276 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include + +#include "ipoib.h" + +enum { + IPOIB_MAGIC = 0x49504942 /* "IPIB" */ +}; + +static DECLARE_MUTEX(ipoib_fs_mutex); +static struct dentry *ipoib_root; +static struct super_block *ipoib_sb; +static LIST_HEAD(ipoib_device_list); + +static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_mcast_iter *iter; + loff_t n = *pos; + + iter = ipoib_mcast_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + union ib_gid mgid; + int i, n; + unsigned long created; + unsigned int queuelen, complete, send_only; + + if (iter) { + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); + + for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { + n += sprintf(gid_buf + n, "%x", + be16_to_cpu(((u16 *)mgid.raw)[i])); + if (i < sizeof mgid / 2 - 1) + gid_buf[n++] = ':'; + } + } + + seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + + seq_printf(file, + " created: %10ld queuelen: %4d complete: %d send_only: %d\n", + created, queuelen, complete, send_only); + + return 0; +} + +static struct seq_operations ipoib_seq_ops = { + .start = ipoib_mcg_seq_start, + .next = ipoib_mcg_seq_next, + .stop = ipoib_mcg_seq_stop, + .show = ipoib_mcg_seq_show, +}; + +static int ipoib_mcg_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_fops = { + .owner = THIS_MODULE, + .open = ipoib_mcg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +static struct inode *ipoib_get_inode(void) +{ + struct inode *inode = new_inode(ipoib_sb); + + if (inode) { + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_fop = &ipoib_fops; + } + + return inode; +} + +static int __ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dentry *dentry; + struct inode *inode; + char name[IFNAMSIZ + sizeof "_mcg"]; + + snprintf(name, sizeof name, "%s_mcg", dev->name); + + dentry = d_alloc_name(ipoib_root, name); + if (!dentry) + return -ENOMEM; + + inode = ipoib_get_inode(); + if (!inode) { + dput(dentry); + return -ENOMEM; + } + + inode->u.generic_ip = dev; + priv->mcg_dentry = dentry; + + d_add(dentry, inode); + + return 0; +} + +int ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + + list_add_tail(&priv->fs_list, &ipoib_device_list); + + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return 0; + } + + up(&ipoib_fs_mutex); + + return __ipoib_create_debug_file(dev); +} + +void ipoib_delete_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + list_del(&priv->fs_list); + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return; + } + up(&ipoib_fs_mutex); + + if (priv->mcg_dentry) { + d_drop(priv->mcg_dentry); + simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); + } +} + +static int ipoib_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct tree_descr ipoib_files[] = { + { "" } + }; + struct ipoib_dev_priv *priv; + int ret; + + ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); + if (ret) + return ret; + + ipoib_root = sb->s_root; + + down(&ipoib_fs_mutex); + + ipoib_sb = sb; + + list_for_each_entry(priv, &ipoib_device_list, fs_list) { + ret = __ipoib_create_debug_file(priv->dev); + if (ret) + break; + } + + up(&ipoib_fs_mutex); + + return ret; +} + +static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, ipoib_fill_super); +} + +static void ipoib_kill_sb(struct super_block *sb) +{ + down(&ipoib_fs_mutex); + ipoib_sb = NULL; + up(&ipoib_fs_mutex); + + kill_litter_super(sb); +} + +static struct file_system_type ipoib_fs_type = { + .owner = THIS_MODULE, + .name = "ipoib_debugfs", + .get_sb = ipoib_get_sb, + .kill_sb = ipoib_kill_sb, +}; + +int ipoib_register_debugfs(void) +{ + return register_filesystem(&ipoib_fs_type); +} + +void ipoib_unregister_debugfs(void) +{ + unregister_filesystem(&ipoib_fs_type); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2004-11-23 08:10:22.857192086 -0800 @@ -0,0 +1,626 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_ib.c 1267 2004-11-18 20:31:22Z roland $ + */ + +#include + +#include + +#include "ipoib.h" + +#define IPOIB_OP_RECV (1ul << 31) + +static DECLARE_MUTEX(pkey_sem); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) +{ + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); +} + +static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, + unsigned int wr_id, + dma_addr_t addr) +{ + struct ib_sge list = { + .addr = addr, + .length = IPOIB_BUF_SIZE, + .lkey = priv->mr->lkey, + }; + struct ib_recv_wr param = { + .wr_id = wr_id | IPOIB_OP_RECV, + .sg_list = &list, + .num_sge = 1, + .recv_flags = IB_RECV_SIGNALED + }; + struct ib_recv_wr *bad_wr; + + return ib_post_recv(priv->qp, ¶m, &bad_wr); +} + +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + int ret; + + skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + if (!skb) { + ipoib_warn(priv, "failed to allocate receive buffer\n"); + + priv->rx_ring[id].skb = NULL; + return -ENOMEM; + } + skb_reserve(skb, 4); /* 16 byte align IP header */ + priv->rx_ring[id].skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); + + ret = ipoib_ib_receive(priv, id, addr); + if (ret) { + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", + id, ret); + priv->rx_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_ib_post_receives(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + return -EIO; + } + } + + return 0; +} + +static void ipoib_ib_handle_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (wr_id & IPOIB_OP_RECV) { + wr_id &= ~IPOIB_OP_RECV; + + if (wr_id < IPOIB_RX_RING_SIZE) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + + priv->rx_ring[wr_id].skb = NULL; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[wr_id], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + if (wc->status != IB_WC_SUCCESS) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + + /* repost receive */ + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); + + } else { + struct ipoib_buf *tx_req; + unsigned long flags; + + if (wr_id >= IPOIB_TX_RING_SIZE) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, IPOIB_TX_RING_SIZE); + return; + } + + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + + tx_req = &priv->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->lock, flags); + ++priv->tx_tail; + if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + } +} + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_wc(dev, priv->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + unsigned int wr_id, + struct ib_ah *address, u32 qpn, + dma_addr_t addr, int len) +{ + struct ib_sge list = { + .addr = addr, + .length = len, + .lkey = priv->mr->lkey, + }; + struct ib_send_wr param = { + .wr_id = wr_id, + .opcode = IB_WR_SEND, + .sg_list = &list, + .num_sge = 1, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = priv->qkey, + .ah = address + }, + }, + .send_flags = IB_SEND_SIGNALED, + }; + struct ib_send_wr *bad_wr; + + return ib_post_send(priv->qp, ¶m, &bad_wr); +} + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_buf *tx_req; + dma_addr_t addr; + + if (skb->len > dev->mtu + INFINIBAND_ALEN) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, dev->mtu + INFINIBAND_ALEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + if (!(skb = skb_unshare(skb, GFP_ATOMIC))) { + ipoib_warn(priv, "failed to unshare sk_buff. Dropping\n"); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + return; + } + + ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + skb->len, address, qpn); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, skb->len, + DMA_TO_DEVICE); + pci_unmap_addr_set(tx_req, mapping, addr); + + if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + address->ah, qpn, addr, skb->len)) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + } else { + unsigned long flags; + + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + ++priv->tx_head; + + spin_lock_irqsave(&priv->lock, flags); + if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + } + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_qp_create(dev); + if (ret) { + ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + return -1; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + return 0; +} + +static int recvs_pending(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + return 1; + + return 0; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int attr_mask; + int i; + + /* Kill the existing QP and allocate a new one */ + qp_attr.qp_state = IB_QPS_ERR; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); + + /* Wait for all sends and receives to complete */ + while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) + yield(); + + ipoib_dbg(priv, "All sends and receives done.\n"); + + qp_attr.qp_state = IB_QPS_RESET; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); + + /* Wait for all AHs to be reaped */ + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + yield(); + } + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ipoib_warn(priv, "Recv skb still around @ %d\n", i); + + return 0; +} + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + priv->ca = ca; + priv->port = port; + priv->qp = NULL; + + if (ipoib_transport_dev_init(dev, ca)) { + printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name); + return -ENODEV; + } + + if (dev->flags & IFF_UP) { + if (ipoib_ib_dev_open(dev)) { + ipoib_transport_dev_cleanup(dev); + return -ENODEV; + } + } + + return 0; +} + +void ipoib_ib_dev_flush(void *_dev) +{ + struct net_device *dev = (struct net_device *)_dev; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; + + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + + ipoib_dbg(priv, "flushing\n"); + + ipoib_ib_dev_down(dev); + + /* + * The device could have been brought down between the start and when + * we get here, don't bring it back up if it's not configured up + */ + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_ib_dev_up(dev); + + /* Flush any child interfaces too */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + ipoib_ib_dev_flush(&cpriv->dev); +} + +void ipoib_ib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "cleaning up ib_dev\n"); + + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); + + ipoib_transport_dev_cleanup(dev); +} + +/* + * Delayed P_Key Assigment Interim Support + * + * The following is initial implementation of delayed P_Key assigment + * mechanism. It is using the same approach implemented for the multicast + * group join. The single goal of this implementation is to quickly address + * Bug #2507. This implementation will probably be removed when the P_Key + * change async notification is available. + */ +int ipoib_open(struct net_device *dev); + +static void ipoib_pkey_dev_check_presence(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + u16 pkey_index = 0; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + else + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); +} + +void ipoib_pkey_poll(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_pkey_dev_check_presence(dev); + + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + ipoib_open(dev); + else { + down(&pkey_sem); + if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + } +} + +int ipoib_pkey_dev_delay_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Look for the interface pkey value in the IB Port P_Key table and */ + /* set the interface pkey assigment flag */ + ipoib_pkey_dev_check_presence(dev); + + /* P_Key value not assigned yet - start polling */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + clear_bit(IPOIB_PKEY_STOP, &priv->flags); + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + return 1; + } + + return 0; +} + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c 2004-11-23 08:10:22.898186042 -0800 @@ -0,0 +1,954 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_main.c 1273 2004-11-22 22:59:30Z roland $ + */ + +#include "ipoib.h" + +#include +#include + +#include +#include +#include + +#include /* For ARPHRD_xxx */ + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int debug_level; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define DATA_PATH_DEBUG_HELP " and data path tracing if > 1" +#else +#define DATA_PATH_DEBUG_HELP "" +#endif + +module_param(debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" DATA_PATH_DEBUG_HELP); + +int mcast_debug_level; + +module_param(mcast_debug_level, int, 0644); +MODULE_PARM_DESC(mcast_debug_level, + "Enable multicast debug tracing if > 0"); +#endif + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +struct workqueue_struct *ipoib_workqueue; + +static void ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device); + +static struct ib_client ipoib_client = { + .name = "ipoib", + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + +int ipoib_device_handle(struct net_device *dev, struct ib_device **ca, + u8 *port_num, union ib_gid *gid, u16 *pkey) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + *ca = priv->ca; + *port_num = priv->port; + *gid = priv->local_gid; + *pkey = priv->pkey; + + return 0; +} +EXPORT_SYMBOL(ipoib_device_handle); + +int ipoib_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "bringing up interface\n"); + + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + if (ipoib_pkey_dev_delay_open(dev)) + return 0; + + if (ipoib_ib_dev_open(dev)) + return -EINVAL; + + if (ipoib_ib_dev_up(dev)) + return -EINVAL; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring up any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (flags & IFF_UP) + continue; + + dev_change_flags(cpriv->dev, flags | IFF_UP); + } + up(&priv->vlan_mutex); + } + + netif_start_queue(dev); + + return 0; +} + +static int ipoib_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "stopping interface\n"); + + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + netif_stop_queue(dev); + + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring down any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (!(flags & IFF_UP)) + continue; + + dev_change_flags(cpriv->dev, flags & ~IFF_UP); + } + up(&priv->vlan_mutex); + } + + return 0; +} + +static int ipoib_change_mtu(struct net_device *dev, int new_mtu) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + return -EINVAL; + + priv->admin_mtu = new_mtu; + + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + return 0; +} + +static void path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *path_ptr) +{ + struct ipoib_path *path = path_ptr; + struct ipoib_dev_priv *priv = netdev_priv(path->dev); + struct sk_buff *skb; + struct ipoib_ah *ah; + + ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", + status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + + if (status != IB_WC_SUCCESS) + goto err; + + { + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .src_path_bits = 0, + .static_rate = 0, + .ah_flags = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(path->dev, priv->pd, &av); + } + + if (!ah) + goto err; + + path->ah = ah; + + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, pathrec->dlid, pathrec->sl); + + while ((skb = __skb_dequeue(&path->queue))) { + skb->dev = path->dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } + + return; + +err: + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb(skb); + + if (path->neighbour) + *to_ipoib_path(path->neighbour) = NULL; + + kfree(path); +} + +static int path_rec_start(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); + struct ib_sa_path_rec rec = { + .numb_path = 1 + }; + struct ib_sa_query *query; + + if (!path) + goto err; + + path->ah = NULL; + path->dev = dev; + skb_queue_head_init(&path->queue); + __skb_queue_tail(&path->queue, skb); + path->neighbour = NULL; + + rec.sgid = priv->local_gid; + memcpy(rec.dgid.raw, skb->dst->neighbour->ha + 4, 16); + rec.pkey = cpu_to_be16(priv->pkey); + + /* + * XXX there's a race here if path record completion runs + * before we get to finish up. Add a lock to path struct? + */ + if (ib_sa_path_rec_get(priv->ca, priv->port, &rec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &query) < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + goto err; + } + + path->neighbour = skb->dst->neighbour; + *to_ipoib_path(skb->dst->neighbour) = path; + return 0; + +err: + kfree(path); + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + return 0; +} + +static int path_lookup(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) + return path_rec_start(skb, dev); + + /* Add in the P_Key */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4), + skb); + return 0; +} + +static void unicast_arp_completion(int status, + struct ib_sa_path_rec *pathrec, + void *skb_ptr) +{ + struct sk_buff *skb = skb_ptr; + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + struct ipoib_ah *ah; + + ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", + status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + + if (status) + goto err; + + { + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .src_path_bits = 0, + .static_rate = 0, + .ah_flags = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(skb->dev, priv->pd, &av); + } + + if (!ah) + goto err; + + *(struct ipoib_ah **) skb->cb = ah; + + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue ARP packet\n"); + + return; + +err: + dev_kfree_skb(skb); +} + +static void unicast_arp_finish(struct sk_buff *skb) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + struct ipoib_ah *ah = *(struct ipoib_ah **) skb->cb; + unsigned long flags; + + if (ah) { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +/* + * For unicast packets with no skb->dst->neighbour (unicast ARPs are + * the main example), we fire off a path record query for each packet. + * This is pretty bad for scalability (since this is going to hammer + * the SM on a big fabric) but it's the best I can think of for now. + * + * Also we might have a problem if a path changes, because ARPs will + * still go through (since we'll get the new path from the SM for + * these queries) so we'll never update the neighbour. + */ +static int unicast_arp_start(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *tmp_skb; + struct ib_sa_path_rec rec = { + .numb_path = 1 + }; + struct ib_sa_query *query; + + if (skb->destructor) { + tmp_skb = skb; + skb = skb_clone(tmp_skb, GFP_ATOMIC); + dev_kfree_skb_any(tmp_skb); + if (!skb) { + ++priv->stats.tx_dropped; + return 0; + } + } + + skb->dev = dev; + skb->destructor = unicast_arp_finish; + memset(skb->cb, 0, sizeof skb->cb); + + rec.sgid = priv->local_gid; + memcpy(rec.dgid.raw, phdr->hwaddr + 4, 16); + rec.pkey = cpu_to_be16(priv->pkey); + + /* + * XXX We need to keep a record of the skb and TID somewhere + * so that we can cancel the request if the device goes down + * before it finishes. + */ + if (ib_sa_path_rec_get(priv->ca, priv->port, &rec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + unicast_arp_completion, + skb, &query) < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + return 0; +} + +static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + if (skb->dst && skb->dst->neighbour) { + if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) + return path_lookup(skb, dev); + + path = *to_ipoib_path(skb->dst->neighbour); + + if (likely(path->ah)) { + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + return 0; + } + + if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&path->queue, skb); + else + goto err; + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key */ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); + } + else { + /* unicast GID -- ARP reply?? */ + + /* + * If destructor is unicast_arp_finish, we've + * already been through the path lookup and + * now we can just send the packet. + */ + if (skb->destructor == unicast_arp_finish) { + ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb, + be32_to_cpup((u32 *) phdr->hwaddr)); + return 0; + } + + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((u16 *) skb->data), + be32_to_cpup((u32 *) phdr->hwaddr), + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + return 0; + } + + /* put the pseudoheader back on */ + skb_push(skb, sizeof *phdr); + return unicast_arp_start(skb, dev, phdr); + } + } + + return 0; + +err: + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + return 0; +} + +struct net_device_stats *ipoib_get_stats(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return &priv->stats; +} + +static void ipoib_timeout(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_warn(priv, "transmit timeout: latency %ld\n", + jiffies - dev->trans_start); + /* XXX reset QP, etc. */ +} + +static int ipoib_hard_header(struct sk_buff *skb, + struct net_device *dev, + unsigned short type, + void *daddr, void *saddr, unsigned len) +{ + struct ipoib_header *header; + + header = (struct ipoib_header *) skb_push(skb, sizeof *header); + + header->proto = htons(type); + header->reserved = 0; + + /* + * If we don't have a neighbour structure, stuff the + * destination address onto the front of the skb so we can + * figure out where to send the packet later. + */ + if (!skb->dst || !skb->dst->neighbour) { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); + } + + return 0; +} + +static void ipoib_set_mcast_list(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + schedule_work(&priv->restart_task); +} + +static void ipoib_neigh_destructor(struct neighbour *neigh) +{ + struct ipoib_path *path = *to_ipoib_path(neigh); + + ipoib_dbg(netdev_priv(neigh->dev), + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + be32_to_cpup((__be32 *) neigh->ha), + IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4)))); + + if (path && path->ah) { + ipoib_put_ah(path->ah); + kfree(path); + } +} + +static int ipoib_neigh_setup(struct neighbour *neigh) +{ + /* + * Is this kosher? I can't find anybody in the kernel that + * sets neigh->destructor, so we should be able to set it here + * without trouble. + */ + neigh->ops->destructor = ipoib_neigh_destructor; + + return 0; +} + +static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +{ + parms->neigh_setup = ipoib_neigh_setup; + + return 0; +} + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Allocate RX/TX "rings" to hold queued skbs */ + + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->rx_ring) { + printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", + ca->name, IPOIB_RX_RING_SIZE); + goto out; + } + memset(priv->rx_ring, 0, + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->tx_ring) { + printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", + ca->name, IPOIB_TX_RING_SIZE); + goto out_rx_ring_cleanup; + } + memset(priv->tx_ring, 0, + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + + /* priv->tx_head & tx_tail are already 0 */ + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_ring_cleanup; + + return 0; + +out_tx_ring_cleanup: + kfree(priv->tx_ring); + +out_rx_ring_cleanup: + kfree(priv->rx_ring); + +out: + return -ENOMEM; +} + +void ipoib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; + + ipoib_delete_debug_file(dev); + + /* Delete any child interfaces first */ + list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + free_netdev(cpriv->dev); + } + + ipoib_ib_dev_cleanup(dev); + + if (priv->rx_ring) { + kfree(priv->rx_ring); + priv->rx_ring = NULL; + } + + if (priv->tx_ring) { + kfree(priv->tx_ring); + priv->tx_ring = NULL; + } +} + +static void ipoib_setup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->get_stats = ipoib_get_stats; + dev->tx_timeout = ipoib_timeout; + dev->hard_header = ipoib_hard_header; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; + + dev->watchdog_timeo = HZ; + + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; + + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + + /* + * We add in INFINIBAND_ALEN to allow for the destination + * address "pseudoheader" for skbs without neighbour struct. + */ + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->features = NETIF_F_VLAN_CHALLENGED; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + + init_MUTEX(&priv->mcast_mutex); + init_MUTEX(&priv->vlan_mutex); + + INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); + INIT_LIST_HEAD(&priv->multicast_list); + + INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); + INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); +} + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); +} + +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + +static struct net_device *ipoib_add_port(const char *format, + struct ib_device *hca, u8 port) +{ + struct ipoib_dev_priv *priv; + int result = -ENOMEM; + + priv = ipoib_intf_alloc(format); + if (!priv) + goto alloc_mem_failed; + + SET_NETDEV_DEV(priv->dev, hca->dma_device); + + result = ib_query_pkey(hca, port, 0, &priv->pkey); + if (result) { + printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } + + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + + result = ib_query_gid(hca, port, 0, &priv->local_gid); + if (result) { + printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + + result = ipoib_dev_init(priv->dev, hca, port); + if (result < 0) { + printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n", + hca->name, port, result); + goto device_init_failed; + } + + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + result = ib_register_event_handler(&priv->event_handler); + if (result < 0) { + printk(KERN_WARNING "%s: ib_register_event_handler failed for " + "port %d (ret = %d)\n", + hca->name, port, result); + goto event_failed; + } + + result = register_netdev(priv->dev); + if (result) { + printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", + hca->name, port, result); + goto register_failed; + } + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto sysfs_failed; + + return priv->dev; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ib_unregister_event_handler(&priv->event_handler); + +event_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +alloc_mem_failed: + return ERR_PTR(result); +} + +static void ipoib_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct net_device *dev; + struct ipoib_dev_priv *priv; + int s, e, p; + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + return; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + dev = ipoib_add_port("ib%d", device, p); + if (!IS_ERR(dev)) { + priv = netdev_priv(dev); + list_add_tail(&priv->list, dev_list); + } + } + + ib_set_client_data(device, &ipoib_client, dev_list); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + struct list_head *dev_list; + + dev_list = ib_get_client_data(device, &ipoib_client); + + list_for_each_entry_safe(priv, tmp, dev_list, list) { + ib_unregister_event_handler(&priv->event_handler); + + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); + } +} + +static int __init ipoib_init_module(void) +{ + int ret; + + ret = ipoib_register_debugfs(); + if (ret) + return ret; + + /* + * We create our own workqueue mainly because we want to be + * able to flush it when devices are being removed. We can't + * use schedule_work()/flush_scheduled_work() because both + * unregister_netdev() and linkwatch_event take the rtnl lock, + * so flush_scheduled_work() can deadlock during device + * removal. + */ + ipoib_workqueue = create_singlethread_workqueue("ipoib"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_fs; + } + + ret = ib_register_client(&ipoib_client); + if (ret) + goto err_wq; + + return 0; + +err_fs: + ipoib_unregister_debugfs(); + +err_wq: + destroy_workqueue(ipoib_workqueue); + + return ret; +} + +static void __exit ipoib_cleanup_module(void) +{ + ipoib_unregister_debugfs(); + ib_unregister_client(&ipoib_client); + destroy_workqueue(ipoib_workqueue); +} + +module_init(ipoib_init_module); +module_exit(ipoib_cleanup_module); + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2004-11-23 08:10:22.940179850 -0800 @@ -0,0 +1,928 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_multicast.c 1277 2004-11-23 01:08:07Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipoib.h" + +static DECLARE_MUTEX(mcast_mutex); + +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + struct completion done; + + int query_id; + struct ib_sa_query *query; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; + +struct ipoib_mcast_iter { + struct net_device *dev; + union ib_gid mgid; + unsigned long created; + unsigned int queuelen; + unsigned int complete; + unsigned int send_only; +}; + +static void ipoib_mcast_free(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + + ipoib_dbg_mcast(netdev_priv(dev), + "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + if (mcast->ah) + ipoib_put_ah(mcast->ah); + + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + dev_kfree_skb_any(skb); + } + + kfree(mcast); +} + +static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, + int can_sleep) +{ + struct ipoib_mcast *mcast; + + mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + if (!mcast) + return NULL; + + memset(mcast, 0, sizeof (*mcast)); + + init_completion(&mcast->done); + + mcast->dev = dev; + mcast->created = jiffies; + mcast->backoff = HZ; + mcast->logcount = 0; + + INIT_LIST_HEAD(&mcast->list); + skb_queue_head_init(&mcast->pkt_queue); + + mcast->ah = NULL; + mcast->query = NULL; + + return mcast; +} + +static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->multicast_tree.rb_node; + + while (n) { + struct ipoib_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return mcast; + } + + return NULL; +} + +static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL; + + while (*n) { + struct ipoib_mcast *tmcast; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipoib_mcast, rb_node); + + ret = memcmp(mcast->mcmember.mgid.raw, tmcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &priv->multicast_tree); + + return 0; +} + +static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, + struct ib_sa_mcmember_rec *mcmember) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + mcast->mcmember = *mcmember; + + /* Set the cached Q_Key before we attach if it's the broadcast group */ + if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid))) + priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_warn(priv, "multicast group " IPOIB_GID_FMT + " already attached\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + return 0; + } + + ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret < 0) { + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); + return ret; + } + } + + { + struct ib_ah_attr av = { + .dlid = be16_to_cpu(mcast->mcmember.mlid), + .port_num = priv->port, + .sl = mcast->mcmember.sl, + .src_path_bits = 0, + .static_rate = 0, + .ah_flags = IB_AH_GRH, + .grh = { + .flow_label = be32_to_cpu(mcast->mcmember.flow_label), + .hop_limit = mcast->mcmember.hop_limit, + .sgid_index = 0, + .traffic_class = mcast->mcmember.traffic_class + } + }; + + av.grh.dgid = mcast->mcmember.mgid; + + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); + if (!mcast->ah) { + ipoib_warn(priv, "ib_address_create failed\n"); + } else { + ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT + " AV %p, LID 0x%04x, SL %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + mcast->ah->ah, + be16_to_cpu(mcast->mcmember.mlid), + mcast->mcmember.sl); + } + } + + /* actually send any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + } + + return 0; +} + +static void +ipoib_mcast_sendonly_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + + if (!status) + ipoib_mcast_join_finish(mcast, mcmember); + else { + if (mcast->logcount++ < 20) + ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + /* Flush out any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + dev_kfree_skb_any(skb); + } + + /* Clear the busy flag so we try again */ + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + } + + complete(&mcast->done); +} + +static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { +#if 0 /* Some SMs don't support send-only yet */ + .join_state = 4 +#else + .join_state = 1 +#endif + }; + int ret = 0; + + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { + ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n"); + return -ENODEV; + } + + if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); + return -EBUSY; + } + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 1000, GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast, &mcast->query); + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + ret); + } else { + ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT + ", starting join\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + mcast->query_id = ret; + } + + return ret; +} + +static void ipoib_mcast_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT + " (status %d)\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + mcast->backoff = HZ; + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + complete(&mcast->done); + return; + } + + if (status == -EINTR) { + complete(&mcast->done); + return; + } + + if (status && mcast->logcount++ < 20) { + if (status == -ETIMEDOUT || status == -EINTR) { + ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT + ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } else { + ipoib_warn(priv, "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } + } + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + mcast->query = NULL; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { + if (status == -ETIMEDOUT) + queue_work(ipoib_workqueue, &priv->mcast_task); + else + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); + } else + complete(&mcast->done); + up(&mcast_mutex); + + return; +} + +static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, + int create) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + ib_sa_comp_mask comp_mask; + int ret = 0; + + ipoib_dbg_mcast(priv, "joining MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + comp_mask = + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + if (create) { + comp_mask |= + IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + + rec.qkey = priv->broadcast->mcmember.qkey; + rec.sl = priv->broadcast->mcmember.sl; + rec.flow_label = priv->broadcast->mcmember.flow_label; + rec.traffic_class = priv->broadcast->mcmember.traffic_class; + } + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, + mcast->backoff * 1000, GFP_ATOMIC, + ipoib_mcast_join_complete, + mcast, &mcast->query); + + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, + mcast->backoff); + up(&mcast_mutex); + } else + mcast->query_id = ret; +} + +void ipoib_mcast_join_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) + return; + + if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) + ipoib_warn(priv, "ib_gid_entry_get() failed\n"); + else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + if (!priv->broadcast) { + priv->broadcast = ipoib_mcast_alloc(dev, 1); + if (!priv->broadcast) { + ipoib_warn(priv, "failed to allocate broadcast group\n"); + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, HZ); + up(&mcast_mutex); + return; + } + + memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid)); + + spin_lock_irq(&priv->lock); + __ipoib_mcast_add(dev, priv->broadcast); + spin_unlock_irq(&priv->lock); + } + + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + ipoib_mcast_join(dev, priv->broadcast, 0); + return; + } + + while (1) { + struct ipoib_mcast *mcast = NULL; + + spin_lock_irq(&priv->lock); + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + /* Found the next unjoined group */ + break; + } + } + spin_unlock_irq(&priv->lock); + + if (&mcast->list == &priv->multicast_list) { + /* All done */ + break; + } + + ipoib_mcast_join(dev, mcast, 1); + return; + } + + { + struct ib_port_attr attr; + + if (!ib_query_port(priv->ca, priv->port, &attr)) + priv->local_lid = attr.lid; + else + ipoib_warn(priv, "ib_query_port failed\n"); + } + + priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - + IPOIB_ENCAP_LEN; + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); + + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + netif_carrier_on(dev); +} + +int ipoib_mcast_start_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "starting multicast thread\n"); + + down(&mcast_mutex); + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + + return 0; +} + +int ipoib_mcast_stop_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + + down(&mcast_mutex); + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + cancel_delayed_work(&priv->mcast_task); + up(&mcast_mutex); + + flush_workqueue(ipoib_workqueue); + + if (priv->broadcast && priv->broadcast->query) { + ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); + priv->broadcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for bcast\n"); + wait_for_completion(&priv->broadcast->done); + } + + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + } + + return 0; +} + +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + int ret = 0; + + if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) + return 0; + + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + + /* + * Just make one shot at leaving and don't wait for a reply; + * if we fail, too bad. + */ + ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 0, GFP_ATOMIC, NULL, + mcast, &mcast->query); + if (ret < 0) + ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " + "for leave (result = %d)\n", ret); + + return 0; +} + +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + mcast = __ipoib_mcast_find(dev, mgid); + if (!mcast) { + /* Let's create a new send only group now */ + ipoib_dbg_mcast(priv, "setting up send only multicast group for " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); + + mcast = ipoib_mcast_alloc(dev, 0); + if (!mcast) { + ipoib_warn(priv, "unable to allocate memory for " + "multicast structure\n"); + dev_kfree_skb_any(skb); + goto out; + } + + set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); + mcast->mcmember.mgid = *mgid; + __ipoib_mcast_add(dev, mcast); + list_add_tail(&mcast->list, &priv->multicast_list); + } + + if (!mcast->ah) { + if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) + skb_queue_tail(&mcast->pkt_queue, skb); + else + dev_kfree_skb_any(skb); + + if (mcast->query) + ipoib_dbg_mcast(priv, "no address vector, " + "but multicast join already started\n"); + else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + ipoib_mcast_sendonly_join(mcast); + + /* + * If lookup completes between here and out:, don't + * want to send packet twice. + */ + mcast = NULL; + } + +out: + spin_unlock_irqrestore(&priv->lock, flags); + if (mcast && mcast->ah) { + if (skb->dst && + skb->dst->neighbour && + !*to_ipoib_path(skb->dst->neighbour)) { + struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); + + if (path) { + kref_get(&mcast->ah->ref); + path->ah = mcast->ah; + path->dev = dev; + path->neighbour = skb->dst->neighbour; + *to_ipoib_path(skb->dst->neighbour) = path; + } + } + + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + } +} + +void ipoib_mcast_dev_flush(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + LIST_HEAD(remove_list); + struct ipoib_mcast *mcast, *tmcast, *nmcast; + unsigned long flags; + + ipoib_dbg_mcast(priv, "flushing multicast list\n"); + + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->flags = + mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); + + nmcast->mcmember.mgid = mcast->mcmember.mgid; + + /* Add the new group in before the to-be-destroyed group */ + list_add_tail(&nmcast->list, &mcast->list); + list_del_init(&mcast->list); + + rb_replace_node(&mcast->rb_node, &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&mcast->list, &remove_list); + } else { + ipoib_warn(priv, "could not reallocate multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + } + } + + if (priv->broadcast) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; + + rb_replace_node(&priv->broadcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&priv->broadcast->list, &remove_list); + } + + priv->broadcast = nmcast; + } + + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } +} + +void ipoib_mcast_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + /* Delete broadcast since it will be recreated */ + if (priv->broadcast) { + ipoib_dbg_mcast(priv, "deleting broadcast group\n"); + + spin_lock_irqsave(&priv->lock, flags); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_mcast_leave(dev, priv->broadcast); + ipoib_mcast_free(priv->broadcast); + priv->broadcast = NULL; + } +} + +void ipoib_mcast_restart_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dev_mc_list *mclist; + struct ipoib_mcast *mcast, *tmcast; + LIST_HEAD(remove_list); + unsigned long flags; + + ipoib_dbg_mcast(priv, "restarting multicast task\n"); + + ipoib_mcast_stop_thread(dev); + + spin_lock_irqsave(&priv->lock, flags); + + /* + * Unfortunately, the networking core only gives us a list of all of + * the multicast hardware addresses. We need to figure out which ones + * are new and which ones have been removed + */ + + /* Clear out the found flag */ + list_for_each_entry(mcast, &priv->multicast_list, list) + clear_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + + /* Mark all of the entries that are found or don't exist */ + for (mclist = dev->mc_list; mclist; mclist = mclist->next) { + union ib_gid mgid; + + memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); + + /* Add in the P_Key */ + mgid.raw[4] = (priv->pkey >> 8) & 0xff; + mgid.raw[5] = priv->pkey & 0xff; + + mcast = __ipoib_mcast_find(dev, &mgid); + if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + struct ipoib_mcast *nmcast; + + /* Not found or send-only group, let's add a new entry */ + ipoib_dbg_mcast(priv, "adding multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + + nmcast = ipoib_mcast_alloc(dev, 0); + if (!nmcast) { + ipoib_warn(priv, "unable to allocate memory for multicast structure\n"); + continue; + } + + set_bit(IPOIB_MCAST_FLAG_FOUND, &nmcast->flags); + + nmcast->mcmember.mgid = mgid; + + if (mcast) { + /* Destroy the send only entry */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + + rb_replace_node(&mcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + } else + __ipoib_mcast_add(dev, nmcast); + + list_add_tail(&nmcast->list, &priv->multicast_list); + } + + if (mcast) + set_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + } + + /* Remove all of the entries don't exist anymore */ + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rb_erase(&mcast->rb_node, &priv->multicast_tree); + + /* Move to the remove list */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + } + } + spin_unlock_irqrestore(&priv->lock, flags); + + /* We have to cancel outside of the spinlock */ + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(mcast->dev, mcast); + ipoib_mcast_free(mcast); + } + + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_mcast_start_thread(dev); +} + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) +{ + struct ipoib_mcast_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->mgid.raw, 0, sizeof iter->mgid); + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) +{ + kfree(iter); +} + +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_mcast *mcast; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->multicast_tree); + + while (n) { + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + if (memcmp(iter->mgid.raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)) < 0) { + iter->mgid = mcast->mcmember.mgid; + iter->created = mcast->created; + iter->queuelen = skb_queue_len(&mcast->pkt_queue); + iter->complete = !!mcast->ah; + iter->send_only = !!(mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)); + + ret = 0; + + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *mgid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only) +{ + *mgid = iter->mgid; + *created = iter->created; + *queuelen = iter->queuelen; + *complete = iter->complete; + *send_only = iter->send_only; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_proto.h 2004-11-23 08:10:22.978174248 -0800 @@ -0,0 +1,37 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_proto.h 1254 2004-11-17 17:19:12Z roland $ + */ + +#ifndef _IPOIB_PROTO_H +#define _IPOIB_PROTO_H + +#include +#include + +/* + * Public functions + */ + +int ipoib_device_handle(struct net_device *dev, struct ib_device **ca, + u8 *port_num, union ib_gid *gid, u16 *pkey); + +#endif /* _IPOIB_PROTO_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2004-11-23 08:10:23.018168351 -0800 @@ -0,0 +1,248 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_verbs.c 1262 2004-11-18 17:38:36Z roland $ + */ + +#include + +#include "ipoib.h" + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr *qp_attr; + int attr_mask; + int ret; + u16 pkey_index; + + ret = -ENOMEM; + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + goto out; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = -ENXIO; + goto out; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* set correct QKey for QP */ + qp_attr->qkey = priv->qkey; + attr_mask = IB_QP_QKEY; + ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); + goto out; + } + + /* attach QP to multicast group */ + down(&priv->mcast_mutex); + ret = ib_attach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "failed to attach to multicast group, ret = %d\n", ret); + +out: + kfree(qp_attr); + return ret; +} + +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + down(&priv->mcast_mutex); + ret = ib_detach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret); + + return ret; +} + +int ipoib_qp_create(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u16 pkey_index; + struct ib_qp_attr qp_attr; + int attr_mask; + + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index); + if (ret) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return ret; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qkey = 0; + qp_attr.port_num = priv->port; + qp_attr.pkey_index = pkey_index; + attr_mask = + IB_QP_QKEY | + IB_QP_PORT | + IB_QP_PKEY_INDEX | + IB_QP_STATE; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTR; + /* Can't set this in a INIT->RTR transition */ + attr_mask &= ~IB_QP_PORT; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTS; + qp_attr.sq_psn = 0; + attr_mask |= IB_QP_SQ_PSN; + attr_mask &= ~IB_QP_PKEY_INDEX; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret); + goto out_fail; + } + + return 0; + +out_fail: + ib_destroy_qp(priv->qp); + priv->qp = NULL; + + return -EINVAL; +} + +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr init_attr = { + .cap = { + .max_send_wr = IPOIB_TX_RING_SIZE, + .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .sq_sig_type = IB_SIGNAL_ALL_WR, + .rq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UD + }; + + priv->pd = ib_alloc_pd(priv->ca); + if (IS_ERR(priv->pd)) { + printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); + return -ENODEV; + } + + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, + IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + if (IS_ERR(priv->cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + goto out_free_pd; + } + + if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) + goto out_free_cq; + + priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(priv->mr)) { + printk(KERN_WARNING "%s: ib_reg_phys_mr failed\n", ca->name); + goto out_free_cq; + } + + init_attr.send_cq = priv->cq; + init_attr.recv_cq = priv->cq, + + priv->qp = ib_create_qp(priv->pd, &init_attr); + if (IS_ERR(priv->qp)) { + printk(KERN_WARNING "%s: failed to create QP\n", ca->name); + goto out_free_mr; + } + + priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff; + priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; + priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; + + return 0; + +out_free_mr: + ib_dereg_mr(priv->mr); + +out_free_cq: + ib_destroy_cq(priv->cq); + +out_free_pd: + ib_dealloc_pd(priv->pd); + return -ENODEV; +} + +void ipoib_transport_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->qp) { + if (ib_destroy_qp(priv->qp)) + ipoib_warn(priv, "ib_qp_destroy failed\n"); + + priv->qp = NULL; + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } + + if (ib_dereg_mr(priv->mr)) + ipoib_warn(priv, "ib_dereg_mr failed\n"); + + if (ib_destroy_cq(priv->cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_dealloc_pd(priv->pd)) + ipoib_warn(priv, "ib_dealloc_pd failed\n"); +} + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) +{ + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); + + if (record->event == IB_EVENT_PORT_ACTIVE) { + ipoib_dbg(priv, "Port active event\n"); + schedule_work(&priv->flush_task); + } +} + +/* + Local Variables: + c-file-style: "linux" + indent-tabs-mode: t + End: +*/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2004-11-23 08:10:23.043164665 -0800 @@ -0,0 +1,166 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_vlan.c 1271 2004-11-18 22:11:29Z roland $ + */ + +#include +#include + +#include +#include +#include + +#include + +#include "ipoib.h" + +static ssize_t show_parent(struct class_device *class_dev, char *buf) +{ + struct net_device *dev = + container_of(class_dev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return sprintf(buf, "%s\n", priv->parent->name); +} +static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv; + char intf_name[IFNAMSIZ]; + int result; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + + /* + * First ensure this isn't a duplicate. We check the parent device and + * then all of the child interfaces to make sure the Pkey doesn't match. + */ + if (ppriv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + + list_for_each_entry(priv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + } + + snprintf(intf_name, sizeof intf_name, "%s.%04x", + ppriv->dev->name, pkey); + priv = ipoib_intf_alloc(intf_name); + if (!priv) { + result = -ENOMEM; + goto err; + } + + set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + + priv->pkey = pkey; + + memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); + priv->dev->broadcast[8] = pkey >> 8; + priv->dev->broadcast[9] = pkey & 0xff; + + result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port); + if (result < 0) { + ipoib_warn(ppriv, "failed to initialize subinterface: " + "device %s, port %d", + ppriv->ca->name, ppriv->port); + goto device_init_failed; + } + + result = register_netdev(priv->dev); + if (result) { + ipoib_warn(priv, "failed to initialize; error %i", result); + goto register_failed; + } + + priv->parent = ppriv->dev; + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_parent)) + goto sysfs_failed; + + list_add_tail(&priv->list, &ppriv->child_intfs); + + up(&ppriv->vlan_mutex); + + return 0; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +err: + up(&ppriv->vlan_mutex); + return result; +} + +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv, *tpriv; + int ret = -ENOENT; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + + list_del(&priv->list); + + kfree(priv); + + ret = 0; + break; + } + } + up(&ppriv->vlan_mutex); + + return ret; +} From roland at topspin.com Tue Nov 23 08:16:09 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:16:09 -0800 Subject: [openib-general] [PATCH][RFC/v2][18/21] Add InfiniBand userspace MAD support In-Reply-To: <20041123816.7BdwvFRYhI45pb9i@topspin.com> Message-ID: <20041123816.bPLXoHbNS6amekEO@topspin.com> Add a driver that provides a character special device for each InfiniBand port. This device allows userspace to send and receive MADs via write() and read() (with some control operations implemented as ioctls). All operations are 32/64 clean and have been tested with 32-bit userspace running on a ppc64 kernel. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-11-23 08:10:18.652812015 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-11-23 08:10:23.631077978 -0800 @@ -1,22 +1,12 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += \ - ib_core.o \ - ib_mad.o \ - ib_sa.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o ib_umad.o -ib_core-objs := \ - packer.o \ - ud_header.o \ - verbs.o \ - sysfs.o \ - device.o \ - fmr_pool.o \ - cache.o +ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ + device.o fmr_pool.o cache.o -ib_mad-objs := \ - mad.o \ - smi.o \ - agent.o +ib_mad-y := mad.o smi.o agent.o -ib_sa-objs := sa_query.o +ib_sa-y := sa_query.o + +ib_umad-y := user_mad.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/user_mad.c 2004-11-23 08:10:23.697068248 -0800 @@ -0,0 +1,649 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand userspace MAD packet access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UMAD_MAX_PORTS = 256, + IB_UMAD_MAX_AGENTS = 32 +}; + +struct ib_umad_port { + int devnum; + struct cdev dev; + struct class_device *class_dev; + struct ib_device *ib_dev; + u8 port_num; +}; + +struct ib_umad_device { + int start_port, end_port; + struct ib_umad_port port[0]; +}; + +struct ib_umad_file { + struct ib_umad_port *port; + spinlock_t recv_lock; + struct list_head recv_list; + wait_queue_head_t recv_wait; + struct rw_semaphore agent_mutex; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; +}; + +struct ib_umad_packet { + struct ib_user_mad mad; + struct ib_ah *ah; + struct list_head list; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static dev_t base_dev; +static spinlock_t map_lock; +static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS); + +static struct class_simple *umad_class; + +static void ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device); + +static int queue_packet(struct ib_umad_file *file, + struct ib_mad_agent *agent, + struct ib_umad_packet *packet) +{ + int ret = 1; + + down_read(&file->agent_mutex); + for (packet->mad.id = 0; + packet->mad.id < IB_UMAD_MAX_AGENTS; + packet->mad.id++) + if (agent == file->agent[packet->mad.id]) { + spin_lock_irq(&file->recv_lock); + list_add_tail(&packet->list, &file->recv_list); + spin_unlock_irq(&file->recv_lock); + wake_up_interruptible(&file->recv_wait); + ret = 0; + break; + } + + up_read(&file->agent_mutex); + + return ret; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *send_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet = + (void *) (unsigned long) send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + ib_destroy_ah(packet->ah); + + if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { + packet->mad.status = ETIMEDOUT; + + if (!queue_packet(file, agent, packet)) + return; + } + + kfree(packet); +} + +static void recv_handler(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet; + + if (mad_recv_wc->wc->status != IB_WC_SUCCESS) + goto out; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + goto out; + + memset(packet, 0, sizeof *packet); + + memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data); + packet->mad.status = 0; + packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); + packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); + packet->mad.sl = mad_recv_wc->wc->sl; + packet->mad.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); + if (packet->mad.grh_present) { + /* XXX parse GRH */ + packet->mad.gid_index = 0; + packet->mad.hop_limit = 0; + packet->mad.traffic_class = 0; + memset(packet->mad.gid, 0, 16); + packet->mad.flow_label = 0; + } + + if (queue_packet(file, agent, packet)) + kfree(packet); + +out: + ib_free_recv_mad(mad_recv_wc); +} + +static ssize_t ib_umad_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + ssize_t ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + spin_lock_irq(&file->recv_lock); + + while (list_empty(&file->recv_list)) { + spin_unlock_irq(&file->recv_lock); + + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->recv_wait, + !list_empty(&file->recv_list))) + return -ERESTARTSYS; + + spin_lock_irq(&file->recv_lock); + } + + packet = list_entry(file->recv_list.next, struct ib_umad_packet, list); + list_del(&packet->list); + + spin_unlock_irq(&file->recv_lock); + + if (copy_to_user(buf, &packet->mad, sizeof packet->mad)) + ret = -EFAULT; + else + ret = sizeof packet->mad; + + kfree(packet); + return ret; +} + +static ssize_t ib_umad_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + struct ib_mad_agent *agent; + struct ib_ah_attr ah_attr; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + }; + int ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + return -ENOMEM; + + if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) { + kfree(packet); + return -EFAULT; + } + + if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) { + ret = -EINVAL; + goto err; + } + + down_read(&file->agent_mutex); + + agent = file->agent[packet->mad.id]; + if (!agent) { + ret = -EINVAL; + goto err_up; + } + + ((struct ib_mad_hdr *) packet->mad.data)->tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & + 0xffffffff)); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = be16_to_cpu(packet->mad.lid); + ah_attr.sl = packet->mad.sl; + ah_attr.src_path_bits = packet->mad.path_bits; + ah_attr.port_num = file->port->port_num; + /* XXX handle GRH */ + + packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); + if (IS_ERR(packet->ah)) { + ret = PTR_ERR(packet->ah); + goto err_up; + } + + gather_list.addr = dma_map_single(agent->device->dma_device, + packet->mad.data, + sizeof packet->mad.data, + DMA_TO_DEVICE); + gather_list.length = sizeof packet->mad.data; + gather_list.lkey = file->mr[packet->mad.id]->lkey; + pci_unmap_addr_set(packet, mapping, gather_list.addr); + + wr.wr.ud.mad_hdr = (struct ib_mad_hdr *) packet->mad.data; + wr.wr.ud.ah = packet->ah; + wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); + wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); + wr.wr.ud.timeout_ms = packet->mad.timeout_ms; + + wr.wr_id = (unsigned long) packet; + + ret = ib_post_send_mad(agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + goto err_up; + } + + up_read(&file->agent_mutex); + + return sizeof packet->mad; + +err_up: + up_read(&file->agent_mutex); + +err: + kfree(packet); + return ret; +} + +static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ib_umad_file *file = filp->private_data; + + /* we will always be able to post a MAD send */ + unsigned int mask = POLLOUT | POLLWRNORM; + + poll_wait(filp, &file->recv_wait, wait); + + if (!list_empty(&file->recv_list)) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) +{ + struct ib_user_mad_reg_req ureq; + struct ib_mad_reg_req req; + struct ib_mad_agent *agent; + int agent_id; + int ret; + + down_write(&file->agent_mutex); + + if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { + ret = -EFAULT; + goto out; + } + + if (ureq.qpn != 0 && ureq.qpn != 1) { + ret = -EINVAL; + goto out; + } + + for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) + if (!file->agent[agent_id]) + goto found; + + ret = -ENOMEM; + goto out; + +found: + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + + agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, + ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, + &req, 0, send_handler, recv_handler, + file); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto out; + } + + file->agent[agent_id] = agent; + + file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(file->mr[agent_id])) { + ret = -ENOMEM; + goto err; + } + + if (put_user(agent_id, + (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { + ret = -EFAULT; + goto err_mr; + } + + ret = 0; + goto out; + +err_mr: + ib_dereg_mr(file->mr[agent_id]); + +err: + file->agent[agent_id] = NULL; + ib_unregister_mad_agent(agent); + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) +{ + u32 id; + int ret = 0; + + down_write(&file->agent_mutex); + + if (get_user(id, (u32 __user *) arg)) { + ret = -EFAULT; + goto out; + } + + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + ret = -EINVAL; + goto out; + } + + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case IB_USER_MAD_GET_ABI_VERSION: + return put_user(IB_USER_MAD_ABI_VERSION, + (u32 __user *) arg) ? -EFAULT : 0; + case IB_USER_MAD_REGISTER_AGENT: + return ib_umad_reg_agent(filp->private_data, arg); + case IB_USER_MAD_UNREGISTER_AGENT: + return ib_umad_unreg_agent(filp->private_data, arg); + default: + return -ENOIOCTLCMD; + } +} + +static int ib_umad_open(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = + container_of(inode->i_cdev, struct ib_umad_port, dev); + struct ib_umad_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + memset(file, 0, sizeof *file); + + spin_lock_init(&file->recv_lock); + init_rwsem(&file->agent_mutex); + INIT_LIST_HEAD(&file->recv_list); + init_waitqueue_head(&file->recv_wait); + + file->port = port; + filp->private_data = file; + + return 0; +} + +static int ib_umad_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_file *file = filp->private_data; + int i; + + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) { + ib_dereg_mr(file->mr[i]); + ib_unregister_mad_agent(file->agent[i]); + } + + kfree(file); + + return 0; +} + +static struct file_operations umad_fops = { + .owner = THIS_MODULE, + .read = ib_umad_read, + .write = ib_umad_write, + .poll = ib_umad_poll, + .ioctl = ib_umad_ioctl, + .open = ib_umad_open, + .release = ib_umad_close +}; + +static struct ib_client umad_client = { + .name = "umad", + .add = ib_umad_add_one, + .remove = ib_umad_remove_one +}; + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = class_get_devdata(class_dev); + + return sprintf(buf, "%s\n", port->ib_dev->name); +} +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = class_get_devdata(class_dev); + + return sprintf(buf, "%d\n", port->port_num); +} +static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static void ib_umad_add_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + umad_dev = kmalloc(sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port), + GFP_KERNEL); + if (!umad_dev) + return; + + umad_dev->start_port = s; + umad_dev->end_port = e; + + for (i = s; i <= e; ++i) { + spin_lock(&map_lock); + umad_dev->port[i - s].devnum = + find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); + if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) { + spin_unlock(&map_lock); + goto err; + } + set_bit(umad_dev->port[i - s].devnum, dev_map); + spin_unlock(&map_lock); + + umad_dev->port[i - s].ib_dev = device; + umad_dev->port[i - s].port_num = i; + + memset(&umad_dev->port[i - s].dev, 0, sizeof (struct cdev)); + cdev_init(&umad_dev->port[i - s].dev, &umad_fops); + umad_dev->port[i - s].dev.owner = THIS_MODULE; + kobject_set_name(&umad_dev->port[i - s].dev.kobj, + "umad%d", umad_dev->port[i - s].devnum); + if (cdev_add(&umad_dev->port[i - s].dev, base_dev + + umad_dev->port[i - s].devnum, 1)) + goto err; + + umad_dev->port[i - s].class_dev = + class_simple_device_add(umad_class, + umad_dev->port[i - s].dev.dev, + device->dma_device, + "umad%d", umad_dev->port[i - s].devnum); + if (IS_ERR(umad_dev->port[i - s].class_dev)) + goto err_class; + + class_set_devdata(umad_dev->port[i - s].class_dev, + &umad_dev->port[i - s]); + + if (class_device_create_file(umad_dev->port[i - s].class_dev, + &class_device_attr_ibdev)) + goto err_class; + if (class_device_create_file(umad_dev->port[i - s].class_dev, + &class_device_attr_port)) + goto err_class; + } + + ib_set_client_data(device, &umad_client, umad_dev); + + return; + +err_class: + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + +err: + while (--i >= s) { + class_simple_device_remove(umad_dev->port[i - s].dev.dev); + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + } + + kfree(umad_dev); +} + +static void ib_umad_remove_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + int i; + + if (!umad_dev) + return; + + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) { + class_simple_device_remove(umad_dev->port[i].dev.dev); + cdev_del(&umad_dev->port[i].dev); + clear_bit(umad_dev->port[i].devnum, dev_map); + } + + kfree(umad_dev); +} + +static int __init ib_umad_init(void) +{ + int ret; + + spin_lock_init(&map_lock); + + ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS, + "infiniband_mad"); + if (ret) { + printk(KERN_ERR "user_mad: couldn't get device number\n"); + goto out; + } + + umad_class = class_simple_create(THIS_MODULE, "infiniband_mad"); + if (IS_ERR(umad_class)) { + printk(KERN_ERR "user_mad: couldn't create class_simple\n"); + ret = PTR_ERR(umad_class); + goto out_chrdev; + } + + ret = ib_register_client(&umad_client); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ib_umad client\n"); + goto out_class; + } + + return 0; + +out_class: + class_simple_destroy(umad_class); + +out_chrdev: + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); + +out: + return ret; +} + +static void __exit ib_umad_cleanup(void) +{ + ib_unregister_client(&umad_client); + class_simple_destroy(umad_class); + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); +} + +module_init(ib_umad_init); +module_exit(ib_umad_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_user_mad.h 2004-11-23 08:10:23.724064267 -0800 @@ -0,0 +1,111 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef IB_USER_MAD_H +#define IB_USER_MAD_H + +#include +#include + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IB_USER_MAD_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + */ + +/** + * ib_user_mad - MAD packet + * @data - Contents of MAD + * @id - ID of agent MAD received with/to be sent with + * @status - 0 on successful receive, ETIMEDOUT if no response + * received (transaction ID in data[] will be set to TID of original + * request) (ignored on send) + * @timeout_ms - Milliseconds to wait for response (unset on receive) + * @qpn - Remote QP number received from/to be sent to + * @qkey - Remote Q_Key to be sent with (unset on receive) + * @lid - Remote lid received from/to be sent to + * @sl - Service level received with/to be sent with + * @path_bits - Local path bits received with/to be sent with + * @grh_present - If set, GRH was received/should be sent + * @gid_index - Local GID index to send with (unset on receive) + * @hop_limit - Hop limit in GRH + * @traffic_class - Traffic class in GRH + * @gid - Remote GID in GRH + * @flow_label - Flow label in GRH + * + * All multi-byte quantities are stored in network (big endian) byte order. + */ +struct ib_user_mad { + __u8 data[256]; + __u32 id; + __u32 status; + __u32 timeout_ms; + __u32 qpn; + __u32 qkey; + __u16 lid; + __u8 sl; + __u8 path_bits; + __u8 grh_present; + __u8 gid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 gid[16]; + __u32 flow_label; +}; + +/** + * ib_user_mad_reg_req - MAD registration request + * @id - Set by the kernel; used to identify agent in future requests. + * @qpn - Queue pair number; must be 0 or 1. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + */ +struct ib_user_mad_reg_req { + __u32 id; + __u32 method_mask[4]; + __u8 qpn; + __u8 mgmt_class; + __u8 mgmt_class_version; +}; + +#define IB_IOCTL_MAGIC 0x1b + +#define IB_USER_MAD_GET_ABI_VERSION _IOR(IB_IOCTL_MAGIC, 0, __u32) + +#define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ + struct ib_user_mad_reg_req) + +#define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) + +#endif /* IB_USER_MAD_H */ From roland at topspin.com Tue Nov 23 08:16:15 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:16:15 -0800 Subject: [openib-general] [PATCH][RFC/v2][19/21] Document InfiniBand ioctl use In-Reply-To: <20041123816.bPLXoHbNS6amekEO@topspin.com> Message-ID: <20041123816.baaAyOggjbry3R4e@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier --- linux-bk.orig/Documentation/ioctl-number.txt 2004-11-23 08:09:54.932309534 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-11-23 08:10:24.016021218 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Tue Nov 23 08:16:20 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:16:20 -0800 Subject: [openib-general] [PATCH][RFC/v2][20/21] Add InfiniBand Documentation files In-Reply-To: <20041123816.baaAyOggjbry3R4e@topspin.com> Message-ID: <20041123816.Z3lNI0kVfxRLOphJ@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-11-23 08:10:24.271983477 -0800 @@ -0,0 +1,55 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output + in the data path when debug_level is set to 2. However, even with + the output disabled, this option will affect performance. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-11-23 08:10:24.316976843 -0800 @@ -0,0 +1,63 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-11-23 08:10:24.365969619 -0800 @@ -0,0 +1,77 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%s{ibdev}/ports/%s{port}/mad" + + can be used. This will create a device node named + + /dev/infiniband/mthca0/ports/1/mad + + for port 1 of device mthca0, and so on. From roland at topspin.com Tue Nov 23 08:16:27 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 08:16:27 -0800 Subject: [openib-general] [PATCH][RFC/v2][21/21] InfiniBand MAINTAINERS entry In-Reply-To: <20041123816.Z3lNI0kVfxRLOphJ@topspin.com> Message-ID: <20041123816.kKEP5asEjoRbLoxS@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier --- linux-bk.orig/MAINTAINERS 2004-11-23 08:09:38.208775343 -0800 +++ linux-bk/MAINTAINERS 2004-11-23 08:10:24.658926423 -0800 @@ -1075,6 +1075,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From robert.j.woodruff at intel.com Tue Nov 23 08:25:59 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 23 Nov 2004 08:25:59 -0800 Subject: [openib-general] troubles with IPoIB Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002DCBAE3@orsmsx408> >What is the firmware version of the PCIe adapters ? I have seen problems >like this when not all the adapters were at 4.5.3. I also saw a problem with multicast packets on PCI-E adapters, with the SF ipoib. There was a problem was with the 4.5.3 firmware and seemed to be fixed with the 4.6.0-rc4 firmware. Not sure if that one is released yet, but you might want to check with Mellanox. From iod00d at hp.com Tue Nov 23 08:43:59 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Nov 2004 08:43:59 -0800 Subject: [openib-general] troubles with IPoIB In-Reply-To: <1101186939.29554.92.camel@trinity> References: <1101173164.18604.53.camel@localhost> <1101183978.4124.548.camel@localhost.localdomain> <1101186939.29554.92.camel@trinity> Message-ID: <20041123164359.GB10431@esmail.cup.hp.com> On Mon, Nov 22, 2004 at 09:15:38PM -0800, Matt Leininger wrote: > We are using fw_ver 4.5.0. Looks like we need to upgrade. Time to > try the user space firmware burning tools. FWIW, tvflash works fine under 2.6.10-rc1 kernels on ia64. I hacked the code a bit so it's more informative of what's going on...I guess I should submit a diff back to roland. thanks, grant From greg at kroah.com Tue Nov 23 09:22:56 2004 From: greg at kroah.com (Greg KH) Date: Tue, 23 Nov 2004 09:22:56 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand support In-Reply-To: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> References: <20041123814.rXLIXw020elfd6Da@topspin.com> <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> Message-ID: <20041123172256.GA30264@kroah.com> On Tue, Nov 23, 2004 at 08:14:19AM -0800, Roland Dreier wrote: > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-bk/drivers/infiniband/core/cache.c 2004-11-23 08:10:16.816082837 -0800 > @@ -0,0 +1,338 @@ > +/* > + This software is available to you under a choice of one of two > + licenses. You may choose to be licensed under the terms of the GNU > + General Public License (GPL) Version 2, available at > + , or the OpenIB.org BSD > + license, available in the LICENSE.TXT file accompanying this > + software. These details are also available at > + . Sorry, but this is wrong license for this file still. Come on, you can't tell me that your lawyers didn't vet this code at least once before submission... Looks like the openib group is going to have to give up on their dream of keeping a bsd license for their code, sorry. > +/* > + Local Variables: > + c-file-style: "linux" > + indent-tabs-mode: t > + End: > +*/ Are these really necessary in every file? Just set these to be your editor's defaults. thanks, greg k-h From roland at topspin.com Tue Nov 23 09:34:53 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 09:34:53 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand support In-Reply-To: <20041123172256.GA30264@kroah.com> (Greg KH's message of "Tue, 23 Nov 2004 09:22:56 -0800") References: <20041123814.rXLIXw020elfd6Da@topspin.com> <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> <20041123172256.GA30264@kroah.com> Message-ID: <521xek79ky.fsf@topspin.com> Greg> Are these really necessary in every file? Just set these to Greg> be your editor's defaults. I'll strip them out before next time... - R. From halr at voltaire.com Tue Nov 23 09:58:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 23 Nov 2004 12:58:17 -0500 Subject: [openib-general] troubles with IPoIB In-Reply-To: <52llcs7ey0.fsf@topspin.com> References: <1101173164.18604.53.camel@localhost> <1101183978.4124.548.camel@localhost.localdomain> <1101186939.29554.92.camel@trinity> <52llcs7ey0.fsf@topspin.com> Message-ID: <1101232697.19855.2.camel@localhost.localdomain> On Tue, 2004-11-23 at 10:39, Roland Dreier wrote: > Matt> We are using fw_ver 4.5.0. Looks like we need to upgrade. > Matt> Time to try the user space firmware burning tools. > > I would recommend _not_ using tvflash to upgrade PCIe HCAs from FW > 4.5.0 to 4.5.3 right now. The invariant sector of flash needs to be > rewritten, and the version of tvflash checked in right now doesn't > handle that properly yet. Give me a day or so to fix it... The HCAs can still be updated using the Mellanox tools (new mstflint ? or other OS/driver with InfiniBurn and/or old mstflint) in the mean time or one can wait. -- Hal From halr at voltaire.com Tue Nov 23 10:20:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 23 Nov 2004 13:20:21 -0500 Subject: [openib-general] Start of an IPoIB FAQ Message-ID: <1101234021.19855.16.camel@localhost.localdomain> Hi, The start of an IPoIB FAQ may be in order. Something along the lines of: ping doesn't work between IPoIB nodes. What should I do ? First, verify that the ports are active. This can be done via: cat /sys/class/infiniband/mthca0/ports/1/state This should indicate 4: ACTIVE assuming the HCA is mthca0 and port 1 is plugged in. Next, verify the firmware version via cat /sys/class/infiniband/mthca0/fw_ver For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 4.5.3 is recommended. If these versions of the firmware are being used, indicate the configuration and which SM is being utilized. Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" show anything on the other nodes when you try to ping or something? There are 2 levels of IPoIB debug which can be enabled when building: IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. The latter has performance implications and should only be enabled when all else fails. Enable the first level of IPoIB debug and then: mount -t ipoib_debugfs none /ipoib_debufs/ cat /ipoib_debugfs/ib0_mcg -- Hal From roland at topspin.com Tue Nov 23 10:40:50 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 10:40:50 -0800 Subject: [openib-general] [PATCH] Convert from pci_xxx to dma_xxx functions In-Reply-To: <52wtwd8cji.fsf@topspin.com> (Roland Dreier's message of "Mon, 22 Nov 2004 19:33:21 -0800") References: <52wtwd8cji.fsf@topspin.com> Message-ID: <52y8gs4de5.fsf@topspin.com> I heard no objections, so I'm going to go ahead and commit this. - R. From iod00d at hp.com Tue Nov 23 11:32:25 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Nov 2004 11:32:25 -0800 Subject: [openib-general] [PATCH] Convert from pci_xxx to dma_xxx functions In-Reply-To: <52wtwd8cji.fsf@topspin.com> References: <52wtwd8cji.fsf@topspin.com> Message-ID: <20041123193225.GC10431@esmail.cup.hp.com> On Mon, Nov 22, 2004 at 07:33:21PM -0800, Roland Dreier wrote: > Christoph Hellwig suggested we might as well put a generic struct > device *dma_device and use the generic dma_map functions rather than > assuming we're dealing with a PCI device. (There's no dma_xxx > equivalent of pci_unmap_addr_set() and friends, so I left that stuff-- > Christoph agrees this is OK for now). > > Look OK to commit? yeah - looks fine to me. thanks, grant From sam at ravnborg.org Tue Nov 23 11:53:45 2004 From: sam at ravnborg.org (Sam Ravnborg) Date: Tue, 23 Nov 2004 20:53:45 +0100 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: <20041123814.rXLIXw020elfd6Da@topspin.com> References: <20041123814.p0AnYzTlx42JeVes@topspin.com> <20041123814.rXLIXw020elfd6Da@topspin.com> Message-ID: <20041123195345.GC8367@mars.ravnborg.org> On Tue, Nov 23, 2004 at 08:14:14AM -0800, Roland Dreier wrote: > Add public headers for core InfiniBand support. This can be thought > of as a midlayer that provides an abstraction between low-level > hardware drivers and upper level protocols (such as > IP-over-InfiniBand). > > Signed-off-by: Roland Dreier After giving it a second thought my vote goes for: include/linux/infiniband And just a few comments to the API towards drivers... Sam From sam at ravnborg.org Tue Nov 23 11:56:54 2004 From: sam at ravnborg.org (Sam Ravnborg) Date: Tue, 23 Nov 2004 20:56:54 +0100 Subject: [openib-general] Re: [PATCH][RFC/v2][4/21] Add InfiniBand MAD (management datagram) support (public headers) In-Reply-To: <20041123814.xOcI2C4YpT1G9jQi@topspin.com> References: <20041123814.LeHMD5hRZLn6VbLm@topspin.com> <20041123814.xOcI2C4YpT1G9jQi@topspin.com> Message-ID: <20041123195654.GD8367@mars.ravnborg.org> On Tue, Nov 23, 2004 at 08:14:31AM -0800, Roland Dreier wrote: > + > +struct ib_grh { > + u32 version_tclass_flow; > + u16 paylen; > + u8 next_hdr; > + u8 hop_limit; > + union ib_gid sgid; > + union ib_gid dgid; > +} __attribute__ ((packed)); It was told on lkml why these structs was packed. Same info here as comment so it is known next time. And I see comments to API here - good. Sam From iod00d at hp.com Tue Nov 23 12:28:38 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Nov 2004 12:28:38 -0800 Subject: [openib-general] Start of an IPoIB FAQ In-Reply-To: <1101234021.19855.16.camel@localhost.localdomain> References: <1101234021.19855.16.camel@localhost.localdomain> Message-ID: <20041123202838.GI10431@esmail.cup.hp.com> On Tue, Nov 23, 2004 at 01:20:21PM -0500, Hal Rosenstock wrote: > Hi, > > The start of an IPoIB FAQ may be in order. Yes - good idea. I need it too. > Something along the lines of: > > ping doesn't work between IPoIB nodes. What should I do ? > > First, verify that the ports are active. This can be done via: > > cat /sys/class/infiniband/mthca0/ports/1/state cat: /sys/class/infiniband/mthca0/ports/1/state: No such file or directory gsyprf3:~# ls /sys/class/infiniband/ gsyprf3:~# lsmod Module Size Used by ib_ipoib 104344 0 ib_sa 24620 1 ib_ipoib ipt_state 5528 13 ib_mthca 168167 0 ib_mad 60352 2 ib_sa,ib_mthca ib_core 81328 4 ib_ipoib,ib_sa,ib_mthca,ib_mad gsyprf3:~# lspci -vs 81:0.0 0000:81:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technology MT23108 InfiniHost Flags: 66MHz, medium devsel, IRQ 67 Memory at 00000000cf700000 (64-bit, non-prefetchable) [size=1M] Memory at 00000000cf800000 (64-bit, prefetchable) [size=8M] Memory at 00000000d0000000 (64-bit, prefetchable) [size=256M] Capabilities: [40] #11 [001f] Capabilities: [50] Vital Product Data Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [70] PCI-X non-bridge device. Maybe trying to use an unsupported card? Ah yes...something along that line: ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0) GSI 60 (level, low) -> CPU 0 (0x0000) vector 67 ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67 ib_mthca 0000:81:00.0: Unhandled event 0f(00) on eqn 3 ib_query_gid failed (-16) for mthca0 (index 12) ib_query_port failed (-16) for mthca0 ib_mthca 0000:81:00.0: WRITE_MTT failed (-16) ib_mad: Couldn't create ib_mad CQ ib_mad: Couldn't open mthca0 port 1 I can debug this a bit more later today. In anycase, the FAQ sounds like a great idea. grant From roland at topspin.com Tue Nov 23 13:20:38 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 13:20:38 -0800 Subject: [openib-general] Start of an IPoIB FAQ In-Reply-To: <20041123202838.GI10431@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 23 Nov 2004 12:28:38 -0800") References: <1101234021.19855.16.camel@localhost.localdomain> <20041123202838.GI10431@esmail.cup.hp.com> Message-ID: <52zn182rfd.fsf@topspin.com> > GSI 60 (level, low) -> CPU 0 (0x0000) vector 67 > ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67 > ib_mthca 0000:81:00.0: Unhandled event 0f(00) on eqn 3 > ib_query_gid failed (-16) for mthca0 (index 12) > ib_query_port failed (-16) for mthca0 > ib_mthca 0000:81:00.0: WRITE_MTT failed (-16) > ib_mad: Couldn't create ib_mad CQ > ib_mad: Couldn't open mthca0 port 1 Something very strange happened here. It looks like the event queue for firmware command completions overflowed, and then a couple of firmware commands timed out. What kind of system is this? What HCA firmware are you running? - R. From iod00d at hp.com Tue Nov 23 13:34:32 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Nov 2004 13:34:32 -0800 Subject: [openib-general] Start of an IPoIB FAQ In-Reply-To: <52zn182rfd.fsf@topspin.com> References: <1101234021.19855.16.camel@localhost.localdomain> <20041123202838.GI10431@esmail.cup.hp.com> <52zn182rfd.fsf@topspin.com> Message-ID: <20041123213432.GN10431@esmail.cup.hp.com> On Tue, Nov 23, 2004 at 01:20:38PM -0800, Roland Dreier wrote: > > GSI 60 (level, low) -> CPU 0 (0x0000) vector 67 > > ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67 > > ib_mthca 0000:81:00.0: Unhandled event 0f(00) on eqn 3 > > ib_query_gid failed (-16) for mthca0 (index 12) > > ib_query_port failed (-16) for mthca0 > > ib_mthca 0000:81:00.0: WRITE_MTT failed (-16) > > ib_mad: Couldn't create ib_mad CQ > > ib_mad: Couldn't open mthca0 port 1 > > Something very strange happened here. It looks like the event queue > for firmware command completions overflowed, and then a couple of > firmware commands timed out. > > What kind of system is this? ia64 rx2600. > What HCA firmware are you running? Erm...one of the firmware versions that I downloaded with tvflash. Non trivial to say since /sys/class/infiniband isn't available. Either fw3.3 or hca-cougar-a1-250-157.bin. Because tvflash can't identify it, I'll guess it's the fw3.3 (a version of HP's firmware that exposes the 3rd MMIO BAR). I can try again on another machine that has the hca-cougar firmware. grant From iod00d at hp.com Tue Nov 23 14:56:24 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Nov 2004 14:56:24 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... Message-ID: <20041123225624.GO10431@esmail.cup.hp.com> So the adventure continues on a different box (rx4640). (I'll go back to the rx2600 and reflash/reboot the box). With tvflash, I was able to upload the hca-cougar image I mentioned before successfully...at least that's what tvflash asserted. If you want me to try a different firmware, we should do that off-list. Running 2.6.10-rc2 kernel ended up with the following output: iowa:~# modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:41:00.0) GSI 38 (level, low) -> CPU 1 (0x0100) vector 66 ACPI: PCI interrupt 0000:41:00.0[A] -> GSI 38 (level, low) -> IRQ 66 ib_mthca 0000:41:00.0: SYS_EN DDR error: syn=0, sock=0, sladdr=0, SPD source=DIMM ib_mthca 0000:41:00.0: SYS_EN returned status 0x07, aborting. ib_mthca: probe of 0000:41:00.0 failed with error -22 iowa:~# tvflash -i open_hca(0) flash_chip_reset() flash_check_failsafe() Error. String Tag not present (found tag 43 instead) HCA #0: Found MT23108, Cougar, revision A1 Primary image is valid, unknown source (sig 0x0/0x0) Secondary image is valid, unknown source (sig 0x0/0x0) Error. String Tag not present (found tag 43 instead) close_hca() Vital Product Dataiowa:~# tvflash isn't able to ID the new downloaded firmware. Seems like a bug but I don't have specs to see what it is. iowa:~# lspci -vs 41:0.0 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technology MT23108 InfiniHost Flags: 66MHz, medium devsel, IRQ 66 Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M] Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M] Memory at (64-bit, prefetchable) Capabilities: [40] #11 [001f] Capabilities: [50] Vital Product Data Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [70] PCI-X non-bridge device. Oh...I think I see the problem. System Firmware is having problems with this card. I need to update firmware on this box anyway and will report back. thanks, grant From roland at topspin.com Tue Nov 23 15:31:10 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 15:31:10 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand support In-Reply-To: <20041123172256.GA30264@kroah.com> (Greg KH's message of "Tue, 23 Nov 2004 09:22:56 -0800") References: <20041123814.rXLIXw020elfd6Da@topspin.com> <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> <20041123172256.GA30264@kroah.com> Message-ID: <52mzx82ldt.fsf@topspin.com> I've just checked in a change that converts this file from using RCU to protecting its structures with an rwlock_t. This should avoid any patent licensing issues. These functions are extremely unlikely to have SMP scalability issues so this isn't too painful. Thanks, Roland From bunk at stusta.de Tue Nov 23 16:13:28 2004 From: bunk at stusta.de (Adrian Bunk) Date: Wed, 24 Nov 2004 01:13:28 +0100 Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand support In-Reply-To: <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> References: <20041123814.rXLIXw020elfd6Da@topspin.com> <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> Message-ID: <20041124001328.GE2927@stusta.de> On Tue, Nov 23, 2004 at 08:14:19AM -0800, Roland Dreier wrote: > Add implementation of core InfiniBand support. This can be thought of > as a midlayer that provides an abstraction between low-level hardware > drivers and upper level protocols (such as IP-over-InfiniBand). > > Signed-off-by: Roland Dreier > > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-bk/drivers/infiniband/Kconfig 2004-11-23 08:10:16.399144313 -$ > @@ -0,0 +1,11 @@ > +menu "InfiniBand support" > + > +config INFINIBAND > + tristate "InfiniBand support" > + default n >... This "default n" has no effect. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From roland at topspin.com Tue Nov 23 16:24:24 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 23 Nov 2004 16:24:24 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][2/21] Add core InfiniBand support In-Reply-To: <20041124001328.GE2927@stusta.de> (Adrian Bunk's message of "Wed, 24 Nov 2004 01:13:28 +0100") References: <20041123814.rXLIXw020elfd6Da@topspin.com> <20041123814.m1N7Tf2QmSCq9s5q@topspin.com> <20041124001328.GE2927@stusta.de> Message-ID: <52is7w2ix3.fsf@topspin.com> Adrian> This "default n" has no effect. Thanks, I've deleted it from our tree. - Roland From mst at mellanox.co.il Wed Nov 24 11:16:24 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Nov 2004 21:16:24 +0200 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041123225624.GO10431@esmail.cup.hp.com> References: <20041123225624.GO10431@esmail.cup.hp.com> Message-ID: <20041124191624.GA9404@mellanox.co.il> Hello! Quoting r. Grant Grundler (iod00d at hp.com) "[openib-general] HP ZX1 and HP IB cards...": > So the adventure continues on a different box (rx4640). > (I'll go back to the rx2600 and reflash/reboot the box). > > With tvflash, I was able to upload the hca-cougar image I mentioned > before successfully...at least that's what tvflash asserted. > > If you want me to try a different firmware, we should do that off-list. > > Running 2.6.10-rc2 kernel ended up with the following output: > iowa:~# modprobe ib_mthca > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:41:00.0) > GSI 38 (level, low) -> CPU 1 (0x0100) vector 66 > ACPI: PCI interrupt 0000:41:00.0[A] -> GSI 38 (level, low) -> IRQ 66 > ib_mthca 0000:41:00.0: SYS_EN DDR error: syn=0, sock=0, sladdr=0, SPD source=DIMM > ib_mthca 0000:41:00.0: SYS_EN returned status 0x07, aborting. > ib_mthca: probe of 0000:41:00.0 failed with error -22 > iowa:~# tvflash -i > open_hca(0) > flash_chip_reset() > flash_check_failsafe() > > Error. String Tag not present (found tag 43 instead) > HCA #0: Found MT23108, Cougar, revision A1 > Primary image is valid, unknown source (sig 0x0/0x0) > Secondary image is valid, unknown source (sig 0x0/0x0) > > > Error. String Tag not present (found tag 43 instead) > close_hca() > Vital Product Dataiowa:~# > > tvflash isn't able to ID the new downloaded firmware. > Seems like a bug but I don't have specs to see what it is. > > iowa:~# lspci -vs 41:0.0 > 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) > Subsystem: Mellanox Technology MT23108 InfiniHost > Flags: 66MHz, medium devsel, IRQ 66 > Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M] > Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M] > Memory at (64-bit, prefetchable) > Capabilities: [40] #11 [001f] > Capabilities: [50] Vital Product Data > Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- > Capabilities: [70] PCI-X non-bridge device. > > Oh...I think I see the problem. > System Firmware is having problems with this card. > I need to update firmware on this box anyway and will report back. > > thanks, > grant If you think its a flash issue, try flashing with flint (mstflint under openib tree works without kernel modules). this is what we always use in mellanox for cougars. MST From roland at topspin.com Wed Nov 24 11:39:27 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 24 Nov 2004 11:39:27 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: <20041123195345.GC8367@mars.ravnborg.org> (Sam Ravnborg's message of "Tue, 23 Nov 2004 20:53:45 +0100") References: <20041123814.p0AnYzTlx42JeVes@topspin.com> <20041123814.rXLIXw020elfd6Da@topspin.com> <20041123195345.GC8367@mars.ravnborg.org> Message-ID: <52brdnyr2o.fsf@topspin.com> Sam> After giving it a second thought my vote goes for: Sam> include/linux/infiniband Could you share the reasoning that led to that preference? Unfortunately we don't seem to be converging on one choice of location. On one side there is the fact that the .h files are not used outside of drivers/infiniband -- hence they should stay under drivers/infiniband. On the other side is the fact that moving the includes under include/ gets rid of some CFLAGS lines in the Makefile. I don't see a conclusive reason to choose any particular place. Perhaps Linus or Andrew can simply hand down an authoritative answer? Thanks, Roland From iod00d at hp.com Wed Nov 24 14:35:52 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 24 Nov 2004 14:35:52 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041124191624.GA9404@mellanox.co.il> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041124191624.GA9404@mellanox.co.il> Message-ID: <20041124223552.GA15993@esmail.cup.hp.com> On Wed, Nov 24, 2004 at 09:16:24PM +0200, Michael S. Tsirkin wrote: > > 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) > > Subsystem: Mellanox Technology MT23108 InfiniHost > > Flags: 66MHz, medium devsel, IRQ 66 > > Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M] > > Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M] > > Memory at (64-bit, prefetchable) is the problem. > > System Firmware is having problems with this card. yeah - turned out to be a firmware bug...not clear the firmware team will fix it but they are at least aware of it. We are also considering adding 64-bit MMIO (aka GMMIO) support to ia64-linux. We just learned that some boxes don't assign GMMIO. > If you think its a flash issue, try flashing with flint > (mstflint under openib tree works without kernel modules). this is what we > always use in mellanox for cougars. I'm certain this is not a flash issue. There might be issues with flash but not this one. thanks though, grant From mshefty at ichips.intel.com Wed Nov 24 16:59:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 24 Nov 2004 16:59:54 -0800 Subject: [openib-general] [PATCH] cleanup/fixes for handle_outgoing_smp Message-ID: <41A52E8A.3000802@ichips.intel.com> This patch restructures handle_outgoing_smp to improve its readability and fixes the following issues: removes unneeded memory allocation for received SMP, properly sends a SMP if the underlying HCA driver does not provide a process_mad routine, and deallocates the allocated received SMP in all failures cases. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1291) +++ core/mad.c (working copy) @@ -366,108 +366,92 @@ struct ib_send_wr *send_wr) { int ret; + struct ib_mad_private *mad_priv; + struct ib_mad_send_wc mad_send_wc; if (!smi_handle_dr_smp_send(smp, mad_agent->device->node_type, mad_agent->port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); - goto error1; + goto out; } - if (smi_check_local_dr_smp(smp, - mad_agent->device, - mad_agent->port_num)) { - struct ib_mad_private *mad_priv; - struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_send_wc mad_send_wc; - - mad_priv = kmem_cache_alloc(ib_mad_cache, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); - if (!mad_priv) { - ret = -ENOMEM; - printk(KERN_ERR PFX "No memory for local " - "response MAD\n"); - goto error1; - } + /* Check to post send on QP or process locally. */ + ret = smi_check_local_dr_smp(smp, mad_agent->device, + mad_agent->port_num); + if (!ret || !mad_agent->device->process_mad) + goto out; - mad_agent_priv = container_of(mad_agent, - struct ib_mad_agent_private, - agent); - - if (mad_agent->device->process_mad) { - ret = mad_agent->device->process_mad( - mad_agent->device, - 0, - mad_agent->port_num, - smp->dr_slid, /* ? */ + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + goto out; + } + ret = mad_agent->device->process_mad(mad_agent->device, 0, + mad_agent->port_num, smp->dr_slid, (struct ib_mad *)smp, (struct ib_mad *)&mad_priv->mad); - if (ret & IB_MAD_RESULT_SUCCESS) { - if (ret & IB_MAD_RESULT_CONSUMED) { - ret = 1; - goto error1; - } - if (ret & IB_MAD_RESULT_REPLY) { - /* - * See if response is solicited and - * there is a recv handler - */ - if (solicited_mad(&mad_priv->mad.mad) && - mad_agent_priv->agent.recv_handler) { - struct ib_wc wc; - - /* - * Defined behavior is to - * complete response before - * request - */ - wc.wr_id = send_wr->wr_id; - wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; - wc.byte_len = sizeof(struct ib_mad); - wc.src_qp = 0; /* IB_QPT_SMI ? */ - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = IB_LID_PERMISSIVE; - wc.sl = 0; - wc.dlid_path_bits = 0; - mad_priv->header.recv_wc.wc = &wc; - mad_priv->header.recv_wc.mad_len = - sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); - mad_priv->header.recv_buf.grh = NULL; - mad_priv->header.recv_buf.mad = - &mad_priv->mad.mad; - mad_priv->header.recv_wc.recv_buf = - &mad_priv->header.recv_buf; - mad_agent_priv->agent.recv_handler( - mad_agent, - &mad_priv->header.recv_wc); - } else - kmem_cache_free(ib_mad_cache, mad_priv); - } else - kmem_cache_free(ib_mad_cache, mad_priv); - } else - kmem_cache_free(ib_mad_cache, mad_priv); - } - - if (mad_agent_priv->agent.send_handler) { - /* Now, complete send */ - mad_send_wc.status = IB_WC_SUCCESS; - mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = send_wr->wr_id; - mad_agent_priv->agent.send_handler( - mad_agent, - &mad_send_wc); - ret = 1; + switch (ret) + { + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent->recv_handler) { + struct ib_wc wc; + + /* + * Defined behavior is to complete response before + * request + */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); + mad_priv->header.recv_buf.grh = NULL; + mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; + mad_priv->header.recv_wc.recv_buf = + &mad_priv->header.recv_buf; + mad_agent->recv_handler(mad_agent, + &mad_priv->header.recv_wc); } else - ret = -EINVAL; - } else + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS: ret = 0; + goto out; + default: + kmem_cache_free(ib_mad_cache, mad_priv); + ret = -EINVAL; + goto out; + } -error1: + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = send_wr->wr_id; + mad_agent->send_handler(mad_agent, &mad_send_wc); + ret = 1; +out: return ret; } From eli at mellanox.co.il Thu Nov 25 04:36:30 2004 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 25 Nov 2004 14:36:30 +0200 Subject: [openib-general] [PATCH] cache.c fixes Message-ID: <506C3D7B14CDD411A52C00025558DED605B1A4F9@mtlex01.yok.mtl.com> Looks like allocation size is buggy and also cleanup. Index: cache.c =================================================================== --- cache.c (revision 1292) +++ cache.c (working copy) @@ -249,10 +249,10 @@ device->cache.pkey_cache = kmalloc(sizeof *device->cache.pkey_cache * - end_port(device) - start_port(device), GFP_KERNEL); + (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = kmalloc(sizeof *device->cache.pkey_cache * - end_port(device) - start_port(device), GFP_KERNEL); + (end_port(device) - start_port(device) + 1), GFP_KERNEL); if (!device->cache.pkey_cache || !device->cache.gid_cache) { printk(KERN_WARNING "Couldn't allocate cache " @@ -280,8 +280,14 @@ } err: - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); + if (device->cache.pkey_cache) { + kfree(device->cache.pkey_cache); + device->cache.pkey_cache = NULL; + } + if (device->cache.gid_cache) { + device->cache.gid_cache = NULL; + kfree(device->cache.gid_cache); + } } void ib_cache_cleanup_one(struct ib_device *device) @@ -296,8 +302,12 @@ kfree(device->cache.gid_cache[p]); } - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); + if (device->cache.pkey_cache) { + kfree(device->cache.pkey_cache); + } + if (device->cache.gid_cache) { + kfree(device->cache.gid_cache); + } } struct ib_client cache_client = { -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Nov 25 05:03:45 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 25 Nov 2004 15:03:45 +0200 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) Message-ID: > > Sam> After giving it a second thought my vote goes for: > Sam> include/linux/infiniband > > Could you share the reasoning that led to that preference? > > Unfortunately we don't seem to be converging on one choice of location. > > On one side there is the fact that the .h files are not used outside > of drivers/infiniband -- hence they should stay under drivers/infiniband. > > On the other side is the fact that moving the includes under include/ > gets rid of some CFLAGS lines in the Makefile. > > I don't see a conclusive reason to choose any particular place. > Perhaps Linus or Andrew can simply hand down an authoritative answer? > > Thanks, > Roland (This message is posted to openib-general only) I agree that headers that are not used outside drivers/infiniband should stay there, but it seems that some header currently located in drivers/infiniband may be used by user mode programs - ib_user_mad.h for example, but also parts of ib_mad.h, ib_sa.h, etc. So there are two issues - 1 Shouldn't we move known public headers to include/linux/infiniband? 2 I would prefer to let user mode staff to include IB related headers such that ib_mad.h without including real kernel only stuff. Can we separate these files to (user mode) public parts and kernel only (even drivers/infiniband only) parts? If we do that, I would say that the public headers should be located in include/linux/infiniband and leave the private headers where there are today. Of course we can use also the #ifdef KERNEL mechanism, but personally I don't like it. Shahar From roland at topspin.com Thu Nov 25 07:50:30 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 25 Nov 2004 07:50:30 -0800 Subject: [openib-general] [PATCH] cache.c fixes In-Reply-To: <506C3D7B14CDD411A52C00025558DED605B1A4F9@mtlex01.yok.mtl.com> (Eli Cohen's message of "Thu, 25 Nov 2004 14:36:30 +0200") References: <506C3D7B14CDD411A52C00025558DED605B1A4F9@mtlex01.yok.mtl.com> Message-ID: <52oehmx709.fsf@topspin.com> Thanks for pointing this out. However your patch was seriously whitespace damaged. Also kfree(NULL) is perfectly fine and even encouraged for better readability. So I just applied the kmalloc() part of the fix by hand. Thanks, Roland From roland at topspin.com Thu Nov 25 07:51:59 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 25 Nov 2004 07:51:59 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: (shaharf@voltaire.com's message of "Thu, 25 Nov 2004 15:03:45 +0200") References: Message-ID: <52fz2yx6xs.fsf@topspin.com> shaharf> I agree that headers that are not used outside shaharf> drivers/infiniband should stay there, but it seems that shaharf> some header currently located in drivers/infiniband may shaharf> be used by user mode programs - ib_user_mad.h for shaharf> example, but also parts of ib_mad.h, ib_sa.h, etc. I believe the current feeling in the kernel community is that kernel headers should be kernel only and if userspace needs a header file, there should be a separate userspace version of the file. - Roland From shaharf at voltaire.com Thu Nov 25 08:37:19 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 25 Nov 2004 18:37:19 +0200 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) Message-ID: > > I believe the current feeling in the kernel community is that kernel > headers should be kernel only and if userspace needs a header file, > there should be a separate userspace version of the file. > > - Roland OK, I accept that for most IB types and structures, but what about ib_user_mad structure & header? I suggest exposing that header to the user space (like many other files, asm/errno for example), or alternatively to use only well known structures for the device (I am not sure if that is feasible). Shahar From volta104 at mail.netvision.net.il Thu Nov 25 09:29:04 2004 From: volta104 at mail.netvision.net.il (volta104 at mail.netvision.net.il) Date: Thu, 25 Nov 2004 12:29:04 -0500 Subject: [openib-general] MAD registration for newer vendor classes Message-ID: <194470-220041142517294790@M2W103.mail2web.com> Hi, For the newer vendor classes (0x30-0x4f), should we add OUI to the registration and put the demux into the MAD layer for these classes by OUI ? If so, I will work up a patch for this. -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From volta104 at mail.netvision.net.il Thu Nov 25 09:32:23 2004 From: volta104 at mail.netvision.net.il (volta104 at mail.netvision.net.il) Date: Thu, 25 Nov 2004 12:32:23 -0500 Subject: [openib-general] OUI Needed for OpenIB Alliance ? Message-ID: <52540-2200411425173223129@M2W056.mail2web.com> Hi, One more thing I forgot in the last post: It also seems like we might need an OpenIB alliance OUI for any vendor class MADs, etc. that we might define. Note that one use of this would be a vendor specific ping as a diagnostic tool. There are others. -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From tduffy at sun.com Thu Nov 25 09:45:32 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 25 Nov 2004 09:45:32 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: References: Message-ID: <1101404732.1290.4.camel@duffman> On Thu, 2004-11-25 at 18:37 +0200, shaharf wrote: > > > > I believe the current feeling in the kernel community is that kernel > > headers should be kernel only and if userspace needs a header file, > > there should be a separate userspace version of the file. > > > > - Roland > > OK, I accept that for most IB types and structures, but what about > ib_user_mad structure & header? > > I suggest exposing that header to the user space (like many other files, > asm/errno for example), or alternatively to use only well known > structures for the device (I am not sure if that is feasible). Right, but that should be part of some *other* package (not the kernel). A copy of the file, like how glibc ships kernel headers separately. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Thu Nov 25 10:13:11 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 25 Nov 2004 20:13:11 +0200 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041124223552.GA15993@esmail.cup.hp.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041124191624.GA9404@mellanox.co.il> <20041124223552.GA15993@esmail.cup.hp.com> Message-ID: <20041125181311.GA18098@mellanox.co.il> Hello! Quoting r. Grant Grundler (iod00d at hp.com) "Re: [openib-general] HP ZX1 and HP IB cards...": > On Wed, Nov 24, 2004 at 09:16:24PM +0200, Michael S. Tsirkin wrote: > > > 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) > > > Subsystem: Mellanox Technology MT23108 InfiniHost > > > Flags: 66MHz, medium devsel, IRQ 66 > > > Memory at 00000000a0800000 (64-bit, non-prefetchable) [size=1M] > > > Memory at 00000000a0000000 (64-bit, prefetchable) [size=8M] > > > Memory at (64-bit, prefetchable) > > is the problem. > > > > System Firmware is having problems with this card. > > yeah - turned out to be a firmware bug...not clear the firmware > team will fix it but they are at least aware of it. > > We are also considering adding 64-bit MMIO (aka GMMIO) support > to ia64-linux. We just learned that some boxes don't assign GMMIO. Makes perfect sence to me. I always wandered why more systems dont do it. It is a real problem to allocate a 256Mbyte DDR BAR out of the 32-bit PCI space. MST From roland at topspin.com Thu Nov 25 10:25:48 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 25 Nov 2004 10:25:48 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: (shaharf@voltaire.com's message of "Thu, 25 Nov 2004 18:37:19 +0200") References: Message-ID: <528y8pyedv.fsf@topspin.com> shaharf> OK, I accept that for most IB types and structures, but shaharf> what about ib_user_mad structure & header? shaharf> I suggest exposing that header to the user space (like shaharf> many other files, asm/errno for example), or shaharf> alternatively to use only well known structures for the shaharf> device (I am not sure if that is feasible). /usr/include/asm/errno.h does not come directly from the kernel. It is sanitized and packaged as part of glibc, and even this use is largely due to historical reasons. Adding more such dependencies on kernel headers is not the right way forward. It would make sense for OpenIB to ship a package like "libibmad" that has all the headers required for using the ib_umad module. Userspace and the kernel need to agree on the ABI, obviously, but physicaly sharing the same .h file ends up creating more problems than it solves. - Roland From shaharf at voltaire.com Sun Nov 28 01:13:44 2004 From: shaharf at voltaire.com (shaharf) Date: Sun, 28 Nov 2004 11:13:44 +0200 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) Message-ID: > > It would make sense for OpenIB to ship a package like "libibmad" that > has all the headers required for using the ib_umad module. Userspace > and the kernel need to agree on the ABI, obviously, but physicaly > sharing the same .h file ends up creating more problems than it solves. > > - Roland OK. I am currently working on such libibmad and I will add the header file copy to it. Please take the usermode stuff in account when changing relevant header files. Shahar From halr at voltaire.com Sun Nov 28 07:38:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 28 Nov 2004 10:38:50 -0500 Subject: [openib-general] Re: [PATCH] cleanup/fixes for handle_outgoing_smp In-Reply-To: <41A52E8A.3000802@ichips.intel.com> References: <41A52E8A.3000802@ichips.intel.com> Message-ID: <1101656330.4145.9.camel@localhost.localdomain> On Wed, 2004-11-24 at 19:59, Sean Hefty wrote: > This patch restructures handle_outgoing_smp to improve its readability > and fixes the following issues: removes unneeded memory allocation for > received SMP, properly sends a SMP if the underlying HCA driver does not > provide a process_mad routine, and deallocates the allocated received > SMP in all failures cases. This patch was rejected. I'm not sure why. Can you regenerate it ? -- Hal From roland at topspin.com Sun Nov 28 08:43:26 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 28 Nov 2004 08:43:26 -0800 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) In-Reply-To: (shaharf@voltaire.com's message of "Sun, 28 Nov 2004 11:13:44 +0200") References: Message-ID: <52d5xxx6tt.fsf@topspin.com> shaharf> OK. I am currently working on such libibmad and I will shaharf> add the header file copy to it. shaharf> Please take the usermode stuff in account when changing shaharf> relevant header files. The intention is that userspace can read the kernel's ABI version from /sys/class/infiniband_mad/abi_version and compare it to the value of IB_USER_MAD_ABI_VERSION that userspace was compiled with. We just need to be careful to increment the ABI version if we make any incompatible changes to ib_user_mad.h. - Roland From shaharf at voltaire.com Mon Nov 29 02:33:03 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 29 Nov 2004 12:33:03 +0200 Subject: [openib-general] Re: [PATCH][RFC/v2][1/21] Add core InfiniBand support (public headers) Message-ID: > > The intention is that userspace can read the kernel's ABI version from > /sys/class/infiniband_mad/abi_version and compare it to the value of > IB_USER_MAD_ABI_VERSION that userspace was compiled with. We just > need to be careful to increment the ABI version if we make any > incompatible changes to ib_user_mad.h. > > - Roland I will implement such check in the library. Thanks, Shahar From Andras.Horvath at cern.ch Mon Nov 29 02:39:34 2004 From: Andras.Horvath at cern.ch (Andras.Horvath at cern.ch) Date: Mon, 29 Nov 2004 11:39:34 +0100 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041124191624.GA9404@mellanox.co.il> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041124191624.GA9404@mellanox.co.il> Message-ID: <20041129103934.GT2630@cern.ch> Hello, Maybe related, maybe not: I also have a HP rx2600 and a Voltaire HCA, same kernel (2.6.10-rc2), but a different error (after modprobe mthca and a few minutes of delay). Please see below. Andras ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0) GSI 60 (level, low) -> CPU 1 (0x0100) vector 61 ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 61 ib_mthca 0000:81:00.0: Found bridge: Mellanox Technology MT23108 PCI Bridge (0000:80:01.0) ib_mthca 0000:81:00.0: FW version 000100180000, max commands 1 ib_mthca 0000:81:00.0: FW size 6143 KB (start cfa00000, end cfffffff) ib_mthca 0000:81:00.0: HCA memory size 131071 KB (start c8000000, end cfffffff) ib_mthca 0000:81:00.0: Max QPs: 16777216, reserved QPs: 16, entry size: 256 ib_mthca 0000:81:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:81:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:81:00.0: reserved MPTs: 16, reserved MTTs: 16 ib_mthca 0000:81:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 ib_mthca 0000:81:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 ib_mthca 0000:81:00.0: Flags: 003f0337 ib_mthca 0000:81:00.0: profile[ 0]--10/20 @ 0x c8000000 (size 0x 4000000) ib_mthca 0000:81:00.0: profile[ 1]-- 0/16 @ 0x cc000000 (size 0x 1000000) ib_mthca 0000:81:00.0: profile[ 2]-- 7/18 @ 0x cd000000 (size 0x 800000) ib_mthca 0000:81:00.0: profile[ 3]-- 9/17 @ 0x cd800000 (size 0x 800000) ib_mthca 0000:81:00.0: profile[ 4]-- 3/16 @ 0x ce000000 (size 0x 400000) ib_mthca 0000:81:00.0: profile[ 5]-- 4/16 @ 0x ce400000 (size 0x 200000) ib_mthca 0000:81:00.0: profile[ 6]--12/15 @ 0x ce600000 (size 0x 100000) ib_mthca 0000:81:00.0: profile[ 7]-- 8/13 @ 0x ce700000 (size 0x 80000) ib_mthca 0000:81:00.0: profile[ 8]--11/ 7 @ 0x ce780000 (size 0x 1000) ib_mthca 0000:81:00.0: profile[ 9]-- 6/ 5 @ 0x ce781000 (size 0x 800) ib_mthca 0000:81:00.0: HCA memory: allocated 105990 KB/124928 KB (18938 KB free) ib_mthca 0000:81:00.0: Allocated EQ 1 with 65536 entries ib_mthca 0000:81:00.0: Allocated EQ 2 with 128 entries ib_mthca 0000:81:00.0: Allocated EQ 3 with 128 entries ib_mthca 0000:81:00.0: Setting mask 00000000000343fe for eqn 2 ib_mthca 0000:81:00.0: Setting mask 0000000000000400 for eqn 3 ib_mthca 0000:81:00.0: Failed to initialize queue pair table, aborting. ib_mthca 0000:81:00.0: Clearing mask 00000000000343fe for eqn 2 ib_mthca 0000:81:00.0: Clearing mask 0000000000000400 for eqn 3 ib_mthca: probe of 0000:81:00.0 failed with error -16 80:01.0 PCI bridge: Mellanox Technology MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode]) Flags: bus master, 66Mhz, medium devsel, latency 64 Bus: primary=80, secondary=81, subordinate=81, sec-latency=64 Memory behind bridge: c8000000-d08fffff Capabilities: [70] PCI-X non-bridge device. 81:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technology MT23108 InfiniHost Flags: 66Mhz, medium devsel, IRQ 61 Memory at 00000000d0800000 (64-bit, non-prefetchable) [size=1M] Memory at 00000000d0000000 (64-bit, prefetchable) [size=8M] Memory at 00000000c8000000 (64-bit, prefetchable) [size=128M] Capabilities: [40] #0d [001f] Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [70] PCI-X non-bridge device. From mst at mellanox.co.il Mon Nov 29 05:43:44 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Nov 2004 15:43:44 +0200 Subject: [openib-general] struct class and device.c Message-ID: <20041129134344.GA2991@mellanox.co.il> Hi! Why does not core/device.c use the struct class to manage the list of IB devices? Are there disadvatages with this approach? mst From roland at topspin.com Mon Nov 29 08:36:19 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 08:36:19 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041129103934.GT2630@cern.ch> (Andras Horvath's message of "Mon, 29 Nov 2004 11:39:34 +0100") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041124191624.GA9404@mellanox.co.il> <20041129103934.GT2630@cern.ch> Message-ID: <52zn10vcho.fsf@topspin.com> Andras> Hello, Maybe related, maybe not: I also have a HP rx2600 Andras> and a Voltaire HCA, same kernel (2.6.10-rc2), but a Andras> different error (after modprobe mthca and a few minutes of Andras> delay). Please see below. This looks like an interrupt routing problem. Andras> ib_mthca 0000:81:00.0: Failed to initialize queue pair table, aborting. Andras> ib_mthca 0000:81:00.0: Clearing mask 00000000000343fe for eqn 2 Andras> ib_mthca 0000:81:00.0: Clearing mask 0000000000000400 for eqn 3 Andras> ib_mthca: probe of 0000:81:00.0 failed with error -16 Initializing the QP table is the first time the driver tries to execute a FW command and get a completion interrupt. It seems the driver never see the interrupt and eventually times out (should be a 1 minute timeout). Do other drivers work on this system? With this kernel? If so what IRQ do they assign to the HCA (it's shown in /proc/interrupts after the driver is loaded). Thanks, Roland From roland at topspin.com Mon Nov 29 08:37:49 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 08:37:49 -0800 Subject: [openib-general] struct class and device.c In-Reply-To: <20041129134344.GA2991@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Nov 2004 15:43:44 +0200") References: <20041129134344.GA2991@mellanox.co.il> Message-ID: <52vfbovcf6.fsf@topspin.com> Michael> Hi! Why does not core/device.c use the struct class to Michael> manage the list of IB devices? Are there disadvatages Michael> with this approach? struct class is used. The code is in core/sysfs.c - Roland From ido at mellanox.co.il Mon Nov 29 09:03:12 2004 From: ido at mellanox.co.il (Ido Bukspan) Date: Mon, 29 Nov 2004 19:03:12 +0200 Subject: [openib-general] possible oops when calling ipoib_neigh_destructor while ipoib mod ule is down. Message-ID: <91DB792C7985D411BEC300B40080D29C711B95@mtvex01.mtv.mtl.com> As far as I understand, if the kernel calls "ipoib_neigh_destructor" after the ib_ipoib module has been taken down a kernel oops can occur. In most cases when a driver is taken down, the kernel cleanup has already destroyed all the ipoib driver relevant entries. We noticed that while applications such as NetPerf are running while ipoib is taken down, the neighbor entry may be held (by the kernel) after the module is taken down. The destructor will only be called way after the application exits or being terminated. In such case the Kernel will call the destructor method after the module is already down resulting in a kernel oops. I am not 100% sure about that, but this is what I am seeing happening when I take the ipoib module down while netperf is running. Am I right? and if so what can be done? I thought that maybe we can change the neighbor destructor pointer to NULL when the module exits. -Ido Ido Bukspan Mellanox Technologies Ltd. Phone : (972)-3-6259500 ,Ext 518. Fax : (972)-3-5614943 mailto:ido at mellanox.co.il http://www.mellanox.com No play No game From ido at mellanox.co.il Mon Nov 29 09:10:45 2004 From: ido at mellanox.co.il (Ido Bukspan) Date: Mon, 29 Nov 2004 19:10:45 +0200 Subject: [openib-general] Unicast ARP Message-ID: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> In gen2 we have a problem if the destination path (e.g. LID) changes. Typically the kernel issues periodically unicast ARP packets to addresses which are in the ARP cache to ensure that the neighbors are up and that the ARP cache is up to date. In gen2, unicast ARPs arrive to the ipoib driver (hard_xmit) with no address handle assigned. Now, SA query is initiated by the ipoib driver, then ARP is sent to this address, and when ARP response arrives back, ARP cache is not updated with the new path. In other words, ARP cache believes that the relevant entry is up to date and doesn't notice the path change. I think that we should hold a link list which contains the current address handles with the corresponding GID. When a unicast ARP is sent (hard_xmit), instead of going to the SA, look up the right GID in our list, then send a unicast ARP with this address handle. If path has changed, then the ARP reply won't arrive and it will cause the ARP cache to be refreshed. This solution also reduces the burden on the SA. What do you think ? -Ido Ido Bukspan Mellanox Technologies Ltd. Phone : (972)-3-6259500 ,Ext 518. Fax : (972)-3-5614943 mailto:ido at mellanox.co.il http://www.mellanox.com No play No game From halr at voltaire.com Mon Nov 29 09:02:49 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 12:02:49 -0500 Subject: [openib-general] [RFC] [PATCH] mad: Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU In-Reply-To: <419E64C7.9030905@ichips.intel.com> References: <1100895525.4136.11.camel@localhost.localdomain> <419E64C7.9030905@ichips.intel.com> Message-ID: <1101747769.4145.179.camel@localhost.localdomain> On Fri, 2004-11-19 at 16:25, Sean Hefty wrote: > Hal Rosenstock wrote: > > Change mad thread model to be 1 thread/port rather than 1 thread/port/CPU > > (Note that I have not applied this but am requesting comments). > > > > Index: mad.c > > =================================================================== > > --- mad.c (revision 1269) > > +++ mad.c (working copy) > > @@ -1900,7 +1900,7 @@ > > goto error7; > > > > snprintf(name, sizeof name, "ib_mad%d", port_num); > > - port_priv->wq = create_workqueue(name); > > + port_priv->wq = create_singlethread_workqueue(name); > > if (!port_priv->wq) { > > ret = -ENOMEM; > > goto error8; > > My guess is that this is probably preferable to having 1/port/CPU, > especially on larger systems. It would depend on what the clients do > when notified of a completion. I too think we should change this to a single threaded workqueue (and will do so shortly). > I guess one advantage of keeping it 1/port/CPU (for now) is that it > would help test multi-threaded support. One can always change it back for testing purposes but this means there is not as much "automatic" testing by default. -- Hal From roland at topspin.com Mon Nov 29 09:33:05 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 09:33:05 -0800 Subject: [openib-general] possible oops when calling ipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <91DB792C7985D411BEC300B40080D29C711B95@mtvex01.mtv.mtl.com> (Ido Bukspan's message of "Mon, 29 Nov 2004 19:03:12 +0200") References: <91DB792C7985D411BEC300B40080D29C711B95@mtvex01.mtv.mtl.com> Message-ID: <521xecv9v2.fsf@topspin.com> Ido> As far as I understand, if the kernel calls Ido> "ipoib_neigh_destructor" after the ib_ipoib module has been Ido> taken down a kernel oops can occur. In most cases when a Ido> driver is taken down, the kernel cleanup has already Ido> destroyed all the ipoib driver relevant entries. We noticed Ido> that while applications such as NetPerf are running while Ido> ipoib is taken down, the neighbor entry may be held (by the Ido> kernel) after the module is taken down. The destructor will Ido> only be called way after the application exits or being Ido> terminated. This seems quite likely. I didn't check this very carefully -- it looked like the kernel killed all neighbours when an interface is unregistered, but I guess when the neighbours are still in use they can hang around for a while. It does seem like IPoIB needs to keep track of all the neighbour structures it has added path context to. When unregistering an interface, IPoIB should free the path context and reset the destructure. However I'm not sure exactly how to do this while making sure the kernel hasn't freed the neighbour first (it seems there are some tricky races here). - Roland From roland at topspin.com Mon Nov 29 09:34:42 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 09:34:42 -0800 Subject: [openib-general] Unicast ARP In-Reply-To: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> (Ido Bukspan's message of "Mon, 29 Nov 2004 19:10:45 +0200") References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> Message-ID: <52wtw4tv7x.fsf@topspin.com> Ido> I think that we should hold a link list which contains the Ido> current address handles with the corresponding GID. When a Ido> unicast ARP is sent (hard_xmit), instead of going to the SA, Ido> look up the right GID in our list, then send a unicast ARP Ido> with this address handle. If path has changed, then the ARP Ido> reply won't arrive and it will cause the ARP cache to be Ido> refreshed. This makes sense, although I would use an rb_tree indexed by GID rather than a linked list. Another possibility would be to perform the SA lookup for unicast ARPs and then update any neighbour path information if the reply is different from what we have stored. - Roland From ftillier at infiniconsys.com Mon Nov 29 09:37:11 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Mon, 29 Nov 2004 09:37:11 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <521xecv9v2.fsf@topspin.com> Message-ID: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, November 29, 2004 9:33 AM > > It does seem like IPoIB needs to keep track of all the neighbour > structures it has added path context to. When unregistering an > interface, IPoIB should free the path context and reset the > destructure. However I'm not sure exactly how to do this while making > sure the kernel hasn't freed the neighbour first (it seems there are > some tricky races here). > Maybe I'm clueless (quite likely), but why not just have each neighbour structure take a reference on the module when it is created? The destructor would release that reference. That should solve races involved with cleaning up, as well as ensuring that the module is still around for the destructor to get invoked. - Fab From halr at voltaire.com Mon Nov 29 09:33:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 12:33:39 -0500 Subject: [openib-general] Unicast ARP In-Reply-To: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> Message-ID: <1101749619.4145.214.camel@localhost.localdomain> On Mon, 2004-11-29 at 12:10, Ido Bukspan wrote: > In gen2 we have a problem if the destination path (e.g. LID) > changes. Is this restricted to LID changes or apply more generally to hardware address (GID and/or QPN) changes ? (I had seen something similar when the QPN changed; ARP timeout/retries are needed prior to connectivity being restored). > Typically the kernel issues periodically unicast ARP packets to > addresses which are in the ARP cache to ensure that the neighbors are up and > that the ARP cache is up to date. In gen2, unicast ARPs arrive to the ipoib > driver (hard_xmit) with no address handle assigned. Now, SA query is > initiated by the ipoib driver, Is this the PathRecord lookup for unicast GID of destination ? Is the path record (info) cached ? > then ARP is sent to this address, and when > ARP response arrives back, ARP cache is not updated with the new path. Do you mean hardware address here (GID + QPN) ? > In > other words, ARP cache believes that the relevant entry is up to date and > doesn't notice the path change. > > I think that we should hold a link list which contains the current > address handles with the corresponding GID. When a unicast ARP is sent > (hard_xmit), instead of going to the SA, look up the right GID in our list, > then send a unicast ARP with this address handle. If path has changed, then > the ARP reply won't arrive and it will cause the ARP cache to be refreshed. > > This solution also reduces the burden on the SA. > > What do you think ? > > -Ido > > > > Ido Bukspan > Mellanox Technologies Ltd. > Phone : (972)-3-6259500 ,Ext 518. > Fax : (972)-3-5614943 > mailto:ido at mellanox.co.il > http://www.mellanox.com > > No play No game > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Mon Nov 29 09:43:12 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 09:43:12 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Mon, 29 Nov 2004 09:37:11 -0800") References: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> Message-ID: <52ekictutr.fsf@topspin.com> Fab> Maybe I'm clueless (quite likely), but why not just have each Fab> neighbour structure take a reference on the module when it is Fab> created? The destructor would release that reference. That Fab> should solve races involved with cleaning up, as well as Fab> ensuring that the module is still around for the destructor Fab> to get invoked. That's a possibility but then no IPoIB neighbour structures could ever be garbage collected, which doesn't seem like a good idea. - Roland From halr at voltaire.com Mon Nov 29 09:39:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 12:39:05 -0500 Subject: [openib-general] Unicast ARP In-Reply-To: <52wtw4tv7x.fsf@topspin.com> References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <52wtw4tv7x.fsf@topspin.com> Message-ID: <1101749945.4145.219.camel@localhost.localdomain> On Mon, 2004-11-29 at 12:34, Roland Dreier wrote: > This makes sense, although I would use an rb_tree indexed by GID > rather than a linked list. Is GID sufficient or is QPN also needed ? -- Hal From roland at topspin.com Mon Nov 29 10:00:29 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 10:00:29 -0800 Subject: [openib-general] Unicast ARP In-Reply-To: <1101749945.4145.219.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 29 Nov 2004 12:39:05 -0500") References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <52wtw4tv7x.fsf@topspin.com> <1101749945.4145.219.camel@localhost.localdomain> Message-ID: <52act0tu0y.fsf@topspin.com> Hal> Is GID sufficient or is QPN also needed ? You're right. It should be indexed by HW address. - R. From roland at topspin.com Mon Nov 29 10:03:04 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 10:03:04 -0800 Subject: [openib-general] Unicast ARP In-Reply-To: <1101749619.4145.214.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 29 Nov 2004 12:33:39 -0500") References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <1101749619.4145.214.camel@localhost.localdomain> Message-ID: <52653ottwn.fsf@topspin.com> Hal> Is this restricted to LID changes or apply more generally to Hal> hardware address (GID and/or QPN) changes ? (I had seen Hal> something similar when the QPN changed; ARP timeout/retries Hal> are needed prior to connectivity being restored). Just LID changes. If GID or QPN changes, then the HW address is different and the kernel neighbour code can notice. It's just like an IP address moving to a different MAC on ethernet: either the old ARP entry has to time out, or the interface with the new address has to send a gratuitous ARP. Hal> Is this the PathRecord lookup for unicast GID of destination Hal> ? Is the path record (info) cached ? Yes, path lookup for unicast GID. It's not cached, which is what causes the issue: the ARP will get a reply and so the kernel will believe the neighbour is still valid, even though it has an obsolete path in it. Ido> then ARP is sent to this address, and when ARP response Ido> arrives back, ARP cache is not updated with the new path. Hal> Do you mean hardware address here (GID + QPN) ? No, he meant path (LID, SL, etc) - R. From roland at topspin.com Mon Nov 29 10:03:57 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 10:03:57 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Mon, 29 Nov 2004 09:37:11 -0800") References: <000001c4d63a$0f830af0$655aa8c0@infiniconsys.com> Message-ID: <521xecttv6.fsf@topspin.com> Fab> Maybe I'm clueless (quite likely), but why not just have each Fab> neighbour structure take a reference on the module when it is Fab> created? The destructor would release that reference. That Fab> should solve races involved with cleaning up, as well as Fab> ensuring that the module is still around for the destructor Fab> to get invoked. Sorry, I think I read this backwards before. But if each neighbour has a reference on the IPoIB module then it will be nearly impossible to unload the IPoIB module (which solves the problem but again in not a very nice way). - Roland From ftillier at infiniconsys.com Mon Nov 29 10:07:06 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Mon, 29 Nov 2004 10:07:06 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <52ekictutr.fsf@topspin.com> Message-ID: <000101c4d63e$3d14db70$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, November 29, 2004 9:43 AM > > Fab> Maybe I'm clueless (quite likely), but why not just have each > Fab> neighbour structure take a reference on the module when it is > Fab> created? The destructor would release that reference. That > Fab> should solve races involved with cleaning up, as well as > Fab> ensuring that the module is still around for the destructor > Fab> to get invoked. > > That's a possibility but then no IPoIB neighbour structures could ever > be garbage collected, which doesn't seem like a good idea. I'm confused - why not? Why couldn't garbage collection invoke a similar code path to the destructor? - Fab From roland at topspin.com Mon Nov 29 10:12:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 10:12:54 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <000101c4d63e$3d14db70$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Mon, 29 Nov 2004 10:07:06 -0800") References: <000101c4d63e$3d14db70$655aa8c0@infiniconsys.com> Message-ID: <52wtw4sevt.fsf@topspin.com> Fab> I'm confused - why not? Why couldn't garbage collection Fab> invoke a similar code path to the destructor? I was confused -- I thought you meant for the module to take a reference on each neighbour (which would prevent it from being destroyed until the module released it). On the other hand as I said, if the module can't be unloaded until there are no neighbours left, this could make it very difficult to unload the module. - Roland From ftillier at infiniconsys.com Mon Nov 29 10:13:39 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Mon, 29 Nov 2004 10:13:39 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <521xecttv6.fsf@topspin.com> Message-ID: <000201c4d63f$27aa4350$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, November 29, 2004 10:04 AM > > Fab> Maybe I'm clueless (quite likely), but why not just have each > Fab> neighbour structure take a reference on the module when it is > Fab> created? The destructor would release that reference. That > Fab> should solve races involved with cleaning up, as well as > Fab> ensuring that the module is still around for the destructor > Fab> to get invoked. > > Sorry, I think I read this backwards before. But if each neighbour > has a reference on the IPoIB module then it will be nearly impossible > to unload the IPoIB module (which solves the problem but again in not > a very nice way). > Don't the neighbour structures get cleaned up when an interface goes down? Can't the interface go down while the module has outstanding references? I probably just don't know enough about how module unload works. - Fab From roland at topspin.com Mon Nov 29 10:16:40 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 10:16:40 -0800 Subject: [openib-general] possible oops when callingipoib_neigh_destructor while ipoib mod ule is down. In-Reply-To: <000201c4d63f$27aa4350$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Mon, 29 Nov 2004 10:13:39 -0800") References: <000201c4d63f$27aa4350$655aa8c0@infiniconsys.com> Message-ID: <52sm6ssepj.fsf@topspin.com> Fab> Don't the neighbour structures get cleaned up when an Fab> interface goes down? Can't the interface go down while the Fab> module has outstanding references? The original issue was that neighbour structures can hang around after an interface is unregistered. An interface can be downed even if the module has a non-zero reference count, of course. But under 2.6, for a normal network interface, one can rmmod the corresponding module at any time -- the interface is brought down and the module is unloaded. I'd prefer to preserve that, rather than forcing the user to down all IPoIB interfaces and then wait an arbitrarily long time for all neighbours to be freed before unloading the IPoIB module. - Roland From mshefty at ichips.intel.com Mon Nov 29 10:51:31 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Nov 2004 10:51:31 -0800 Subject: [openib-general] MAD registration for newer vendor classes In-Reply-To: <194470-220041142517294790@M2W103.mail2web.com> References: <194470-220041142517294790@M2W103.mail2web.com> Message-ID: <41AB6FB3.4040301@ichips.intel.com> volta104 at mail.netvision.net.il wrote: > Hi, > > For the newer vendor classes (0x30-0x4f), should we add OUI to the > registration and put the demux into the MAD layer for these classes by OUI ? > > If so, I will work up a patch for this. I guess I need to re-examine the MAD dispatching, but I can't think of a reason why it wouldn't already support vendor classes already. - Sean From mst at mellanox.co.il Mon Nov 29 10:57:39 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Nov 2004 20:57:39 +0200 Subject: [openib-general] Unicast ARP In-Reply-To: <52653ottwn.fsf@topspin.com> References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <1101749619.4145.214.camel@localhost.localdomain> <52653ottwn.fsf@topspin.com> Message-ID: <20041129185739.GA3394@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Unicast ARP": > Hal> Is this restricted to LID changes or apply more generally to > Hal> hardware address (GID and/or QPN) changes ? (I had seen > Hal> something similar when the QPN changed; ARP timeout/retries > Hal> are needed prior to connectivity being restored). > > Just LID changes. If GID or QPN changes, then the HW address is > different and the kernel neighbour code can notice. It's just like an > IP address moving to a different MAC on ethernet: either the old ARP > entry has to time out, or the interface with the new address has to > send a gratuitous ARP. Currently it also seems that by just bringing the interface up and down the hw address will change. Unless I am doing something wrong, this inconvenience seems to be caused by a different QP number being assigned. Cant this be solved e.g. by assigning a fixed QP number for IP over IB? Thanks, MST From roland at topspin.com Mon Nov 29 11:04:14 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 11:04:14 -0800 Subject: [openib-general] Unicast ARP In-Reply-To: <20041129185739.GA3394@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Nov 2004 20:57:39 +0200") References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <1101749619.4145.214.camel@localhost.localdomain> <52653ottwn.fsf@topspin.com> <20041129185739.GA3394@mellanox.co.il> Message-ID: <52oehgsci9.fsf@topspin.com> Michael> Currently it also seems that by just bringing the Michael> interface up and down the hw address will change. Unless Michael> I am doing something wrong, this inconvenience seems to Michael> be caused by a different QP number being assigned. Cant Michael> this be solved e.g. by assigning a fixed QP number for IP Michael> over IB? It seems you are doing something wrong. The QP is allocated when the interface is created and remains the same when the interface is brought up and down: # ip addr show dev ib0 7: ib0: mtu 2044 qdisc pfifo_fast qlen 128 link/[32] 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff # ifconfig ib0 down # ifconfig ib0 up # ip addr show dev ib0 7: ib0: mtu 2044 qdisc pfifo_fast qlen 128 link/[32] 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff - Roland From halr at voltaire.com Mon Nov 29 10:59:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 13:59:35 -0500 Subject: [openib-general] MAD registration for newer vendor classes In-Reply-To: <41AB6FB3.4040301@ichips.intel.com> References: <194470-220041142517294790@M2W103.mail2web.com> <41AB6FB3.4040301@ichips.intel.com> Message-ID: <1101754775.4145.239.camel@localhost.localdomain> On Mon, 2004-11-29 at 13:51, Sean Hefty wrote: > volta104 at mail.netvision.net.il wrote: > > > Hi, > > > > For the newer vendor classes (0x30-0x4f), should we add OUI to the > > registration and put the demux into the MAD layer for these classes by OUI ? > > > > If so, I will work up a patch for this. > > I guess I need to re-examine the MAD dispatching, but I can't think of > a reason why it wouldn't already support vendor classes already. There are a new range of vendor classes (at IBA 1.1) which embed the OUI so that multiple vendors can share the same class. There needs to be another level of demux for these. -- Hal From mst at mellanox.co.il Mon Nov 29 11:17:51 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Nov 2004 21:17:51 +0200 Subject: [openib-general] struct class and device.c In-Reply-To: <52vfbovcf6.fsf@topspin.com> References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com> Message-ID: <20041129191751.GA3450@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] struct class and device.c": > Michael> Hi! Why does not core/device.c use the struct class to > Michael> manage the list of IB devices? Are there disadvatages > Michael> with this approach? > > struct class is used. The code is in core/sysfs.c > No, I had in mind using the children list in the class instead of the device_list in device.c and interfaces list instead of the client_list. Why not? Wouldnt that work? MST From mshefty at ichips.intel.com Mon Nov 29 11:19:30 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Nov 2004 11:19:30 -0800 Subject: [openib-general] MAD registration for newer vendor classes In-Reply-To: <1101754775.4145.239.camel@localhost.localdomain> References: <194470-220041142517294790@M2W103.mail2web.com> <41AB6FB3.4040301@ichips.intel.com> <1101754775.4145.239.camel@localhost.localdomain> Message-ID: <41AB7642.2050803@ichips.intel.com> Hal Rosenstock wrote: >>>For the newer vendor classes (0x30-0x4f), should we add OUI to the >>>registration and put the demux into the MAD layer for these classes by OUI ? >>> >>>If so, I will work up a patch for this. >> >>I guess I need to re-examine the MAD dispatching, but I can't think of >>a reason why it wouldn't already support vendor classes already. > > > There are a new range of vendor classes (at IBA 1.1) which embed the OUI > so that multiple vendors can share the same class. There needs to be > another level of demux for these. You know, I'd really like to rant about the entire MAD architecture right about now... I think it makes sense to add OUI to the MAD interface. From halr at voltaire.com Mon Nov 29 11:16:22 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 14:16:22 -0500 Subject: [openib-general] Unicast ARP In-Reply-To: <52oehgsci9.fsf@topspin.com> References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <1101749619.4145.214.camel@localhost.localdomain> <52653ottwn.fsf@topspin.com> <20041129185739.GA3394@mellanox.co.il> <52oehgsci9.fsf@topspin.com> Message-ID: <1101755782.4145.256.camel@localhost.localdomain> On Mon, 2004-11-29 at 14:04, Roland Dreier wrote: > Michael> Currently it also seems that by just bringing the > Michael> interface up and down the hw address will change. Unless > Michael> I am doing something wrong, this inconvenience seems to > Michael> be caused by a different QP number being assigned. Cant > Michael> this be solved e.g. by assigning a fixed QP number for IP > Michael> over IB? > > It seems you are doing something wrong. The QP is allocated when the > interface is created and remains the same when the interface is > brought up and down: The only time I see that is when removing and readding the IPoIB module. (That does not mean I am recommending a fixed QP for IPoIB (at least unconnected mode). -- Hal From roland at topspin.com Mon Nov 29 11:24:34 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 29 Nov 2004 11:24:34 -0800 Subject: [openib-general] struct class and device.c In-Reply-To: <20041129191751.GA3450@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Nov 2004 21:17:51 +0200") References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com> <20041129191751.GA3450@mellanox.co.il> Message-ID: <52k6s4sbkd.fsf@topspin.com> Michael> No, I had in mind using the children list in the class Michael> instead of the device_list in device.c and interfaces Michael> list instead of the client_list. Why not? Wouldnt that Michael> work? I guess it would work but it seems ugly to rely on the internals of struct class. Although drivers/ieee1394 does do exactly this (without any locking).... - R. From halr at voltaire.com Mon Nov 29 11:23:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 14:23:26 -0500 Subject: [openib-general] MAD registration for newer vendor classes In-Reply-To: <41AB7642.2050803@ichips.intel.com> References: <194470-220041142517294790@M2W103.mail2web.com> <41AB6FB3.4040301@ichips.intel.com> <1101754775.4145.239.camel@localhost.localdomain> <41AB7642.2050803@ichips.intel.com> Message-ID: <1101756206.4145.265.camel@localhost.localdomain> On Mon, 2004-11-29 at 14:19, Sean Hefty wrote: > > There are a new range of vendor classes (at IBA 1.1) which embed the OUI > > so that multiple vendors can share the same class. There needs to be > > another level of demux for these. > > You know, I'd really like to rant about the entire MAD architecture > right about now... I was there for this one so you can vent at me :-) > I think it makes sense to add OUI to the MAD interface. I will work something up on this and post to the list when it is "ready". Also, based on this, do you think it makes sense for an OpenIB OUI (if we are to utilize these classes) ? -- Hal From mshefty at ichips.intel.com Mon Nov 29 11:30:20 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Nov 2004 11:30:20 -0800 Subject: [openib-general] Re: [PATCH] cleanup/fixes for handle_outgoing_smp In-Reply-To: <1101656644.4145.15.camel@localhost.localdomain> References: <41A52E8A.3000802@ichips.intel.com> <1101656644.4145.15.camel@localhost.localdomain> Message-ID: <41AB78CC.2000006@ichips.intel.com> Hal Rosenstock wrote: >>This patch restructures handle_outgoing_smp to improve its readability > > I can't see for sure for your patch. The main changes are that the code is outdented and moved from nested if's to a switch statement. >>and fixes the following issues: removes unneeded memory allocation for >>received SMP, > > It looks like the allocation strategy is slightly modified. It was. The allocation is not done unless process_mad will be called. >>properly sends a SMP if the underlying HCA driver does not >>provide a process_mad routine, > > Missed setting the return code here. I believe that the original code would call the agent's send_handler if process_mad was not provided. > >>and deallocates the allocated received >>SMP in all failures cases. > > > What failure case did not deallocate the allocated received SMP ? Don't recall exactly. Might have been if process_mad consumed the MAD, which I guess isn't a failure case. I can regenerate a new patch. From mshefty at ichips.intel.com Mon Nov 29 11:40:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Nov 2004 11:40:54 -0800 Subject: [openib-general] [PATCH] [re-send] cleanup/fixes for handle_outgoing_smp Message-ID: <20041129114054.2afd03b4.mshefty@ichips.intel.com> Index: core/mad.c =================================================================== --- core/mad.c (revision 1291) +++ core/mad.c (working copy) @@ -366,108 +366,93 @@ struct ib_send_wr *send_wr) { int ret; + struct ib_mad_private *mad_priv; + struct ib_mad_send_wc mad_send_wc; if (!smi_handle_dr_smp_send(smp, mad_agent->device->node_type, mad_agent->port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); - goto error1; + goto out; } - if (smi_check_local_dr_smp(smp, - mad_agent->device, - mad_agent->port_num)) { - struct ib_mad_private *mad_priv; - struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_send_wc mad_send_wc; - - mad_priv = kmem_cache_alloc(ib_mad_cache, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); - if (!mad_priv) { - ret = -ENOMEM; - printk(KERN_ERR PFX "No memory for local " - "response MAD\n"); - goto error1; - } + /* Check to post send on QP or process locally. */ + ret = smi_check_local_dr_smp(smp, mad_agent->device, + mad_agent->port_num); + if (!ret || !mad_agent->device->process_mad) + goto out; - mad_agent_priv = container_of(mad_agent, - struct ib_mad_agent_private, - agent); - - if (mad_agent->device->process_mad) { - ret = mad_agent->device->process_mad( - mad_agent->device, - 0, - mad_agent->port_num, - smp->dr_slid, /* ? */ + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + goto out; + } + ret = mad_agent->device->process_mad(mad_agent->device, 0, + mad_agent->port_num, smp->dr_slid, (struct ib_mad *)smp, (struct ib_mad *)&mad_priv->mad); - if (ret & IB_MAD_RESULT_SUCCESS) { - if (ret & IB_MAD_RESULT_CONSUMED) { - ret = 1; - goto error1; - } - if (ret & IB_MAD_RESULT_REPLY) { - /* - * See if response is solicited and - * there is a recv handler - */ - if (solicited_mad(&mad_priv->mad.mad) && - mad_agent_priv->agent.recv_handler) { - struct ib_wc wc; - - /* - * Defined behavior is to - * complete response before - * request - */ - wc.wr_id = send_wr->wr_id; - wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; - wc.byte_len = sizeof(struct ib_mad); - wc.src_qp = 0; /* IB_QPT_SMI ? */ - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = IB_LID_PERMISSIVE; - wc.sl = 0; - wc.dlid_path_bits = 0; - mad_priv->header.recv_wc.wc = &wc; - mad_priv->header.recv_wc.mad_len = - sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); - mad_priv->header.recv_buf.grh = NULL; - mad_priv->header.recv_buf.mad = - &mad_priv->mad.mad; - mad_priv->header.recv_wc.recv_buf = - &mad_priv->header.recv_buf; - mad_agent_priv->agent.recv_handler( - mad_agent, - &mad_priv->header.recv_wc); - } else - kmem_cache_free(ib_mad_cache, mad_priv); - } else - kmem_cache_free(ib_mad_cache, mad_priv); - } else - kmem_cache_free(ib_mad_cache, mad_priv); - } - - if (mad_agent_priv->agent.send_handler) { - /* Now, complete send */ - mad_send_wc.status = IB_WC_SUCCESS; - mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = send_wr->wr_id; - mad_agent_priv->agent.send_handler( - mad_agent, - &mad_send_wc); - ret = 1; + switch (ret) + { + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent->recv_handler) { + struct ib_wc wc; + + /* + * Defined behavior is to complete response before + * request + */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); + mad_priv->header.recv_buf.grh = NULL; + mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; + mad_priv->header.recv_wc.recv_buf = + &mad_priv->header.recv_buf; + mad_agent->recv_handler(mad_agent, + &mad_priv->header.recv_wc); } else - ret = -EINVAL; - } else + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS: ret = 0; + kmem_cache_free(ib_mad_cache, mad_priv); + goto out; + default: + kmem_cache_free(ib_mad_cache, mad_priv); + ret = -EINVAL; + goto out; + } -error1: + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = send_wr->wr_id; + mad_agent->send_handler(mad_agent, &mad_send_wc); + ret = 1; +out: return ret; } From mshefty at ichips.intel.com Mon Nov 29 11:48:57 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Nov 2004 11:48:57 -0800 Subject: [openib-general] MAD registration for newer vendor classes In-Reply-To: <1101756206.4145.265.camel@localhost.localdomain> References: <194470-220041142517294790@M2W103.mail2web.com> <41AB6FB3.4040301@ichips.intel.com> <1101754775.4145.239.camel@localhost.localdomain> <41AB7642.2050803@ichips.intel.com> <1101756206.4145.265.camel@localhost.localdomain> Message-ID: <41AB7D29.409@ichips.intel.com> Hal Rosenstock wrote: > Also, based on this, do you think it makes sense for an OpenIB OUI (if > we are to utilize these classes) ? I think that it makes sense, but I'd wait until we actually have code that utilizes it. - Sean From mst at mellanox.co.il Mon Nov 29 11:55:15 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Nov 2004 21:55:15 +0200 Subject: [openib-general] struct class and device.c In-Reply-To: <52k6s4sbkd.fsf@topspin.com> References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com> <20041129191751.GA3450@mellanox.co.il> <52k6s4sbkd.fsf@topspin.com> Message-ID: <20041129195515.GA3514@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] struct class and device.c": > Michael> No, I had in mind using the children list in the class > Michael> instead of the device_list in device.c and interfaces > Michael> list instead of the client_list. Why not? Wouldnt that > Michael> work? > > I guess it would work but it seems ugly to rely on the internals of > struct class. Although drivers/ieee1394 does do exactly this (without > any locking).... Yes. Further, thats just for the alloc_name hacks. I am not sure why that is useful - this creates arbitrary names like mthca0 which are not really useful to identify the devices. Couldnt just say pci bus names be used? From mst at mellanox.co.il Mon Nov 29 12:02:07 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Nov 2004 22:02:07 +0200 Subject: [openib-general] struct class and device.c In-Reply-To: <20041129195515.GA3514@mellanox.co.il> References: <20041129134344.GA2991@mellanox.co.il> <52vfbovcf6.fsf@topspin.com> <20041129191751.GA3450@mellanox.co.il> <52k6s4sbkd.fsf@topspin.com> <20041129195515.GA3514@mellanox.co.il> Message-ID: <20041129200207.GB3514@mellanox.co.il> Hello! Sorry about replying to myself ... Quoting Michael S. Tsirkin (mst at mellanox.co.il) "Re: [openib-general] struct class and device.c": > Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] struct class and device.c": > > Michael> No, I had in mind using the children list in the class > > Michael> instead of the device_list in device.c and interfaces > > Michael> list instead of the client_list. Why not? Wouldnt that > > Michael> work? > > > > I guess it would work but it seems ugly to rely on the internals of > > struct class. Although drivers/ieee1394 does do exactly this (without > > any locking).... > > Yes. I mean, duplicating the code from drivers/base/ is ugly too. > Further, thats just for the alloc_name hacks. I am not sure why > that is useful - this creates arbitrary names like mthca0 > which are not really useful to identify the devices. > Couldnt just say pci bus names be used? Or use the system guid. As it is you can pull a device out and another device will be renamed. mst From halr at voltaire.com Mon Nov 29 12:06:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 15:06:31 -0500 Subject: [openib-general] MAD registration for newer vendor classes In-Reply-To: <41AB7D29.409@ichips.intel.com> References: <194470-220041142517294790@M2W103.mail2web.com> <41AB6FB3.4040301@ichips.intel.com> <1101754775.4145.239.camel@localhost.localdomain> <41AB7642.2050803@ichips.intel.com> <1101756206.4145.265.camel@localhost.localdomain> <41AB7D29.409@ichips.intel.com> Message-ID: <1101758791.4145.268.camel@localhost.localdomain> On Mon, 2004-11-29 at 14:48, Sean Hefty wrote: > I think that it makes sense, but I'd wait until we actually have code > that utilizes it. An initial proposal for diagnostics will be posted in the next day or so. In it, there is an ibping utility. It is currently defined as having two ways of running it: with vendor MADs and with normal UD transport. That is the first use (if it meets with consensus). -- Hal From halr at voltaire.com Mon Nov 29 12:45:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 29 Nov 2004 15:45:08 -0500 Subject: [openib-general] Re: [PATCH] [re-send] cleanup/fixes for handle_outgoing_smp In-Reply-To: <20041129114054.2afd03b4.mshefty@ichips.intel.com> References: <20041129114054.2afd03b4.mshefty@ichips.intel.com> Message-ID: <1101761108.4145.274.camel@localhost.localdomain> On Mon, 2004-11-29 at 14:40, Sean Hefty wrote: > - if (mad_agent_priv->agent.send_handler) { > - /* Now, complete send */ > - mad_send_wc.status = IB_WC_SUCCESS; > - mad_send_wc.vendor_err = 0; > - mad_send_wc.wr_id = send_wr->wr_id; > - mad_agent_priv->agent.send_handler( > - mad_agent, > - &mad_send_wc); > + /* Complete send */ > + mad_send_wc.status = IB_WC_SUCCESS; > + mad_send_wc.vendor_err = 0; > + mad_send_wc.wr_id = send_wr->wr_id; > + mad_agent->send_handler(mad_agent, &mad_send_wc); > + ret = 1; Currently, it isn't safe to eliminate the check for the send_handler. (The registration code does not guarantee that a send_handler was supplied; it only does in the case where no registration request is supplied with the registration). Should a send handler always be required or should this check be added back in ? -- Hal From mshefty at ichips.intel.com Mon Nov 29 13:24:31 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Nov 2004 13:24:31 -0800 Subject: [openib-general] Re: [PATCH] [re-send] cleanup/fixes for handle_outgoing_smp In-Reply-To: <1101761108.4145.274.camel@localhost.localdomain> References: <20041129114054.2afd03b4.mshefty@ichips.intel.com> <1101761108.4145.274.camel@localhost.localdomain> Message-ID: <41AB938F.6020708@ichips.intel.com> Hal Rosenstock wrote: > On Mon, 2004-11-29 at 14:40, Sean Hefty wrote: > > >>- if (mad_agent_priv->agent.send_handler) { >>- /* Now, complete send */ >>- mad_send_wc.status = IB_WC_SUCCESS; >>- mad_send_wc.vendor_err = 0; >>- mad_send_wc.wr_id = send_wr->wr_id; >>- mad_agent_priv->agent.send_handler( >>- mad_agent, >>- &mad_send_wc); > > >>+ /* Complete send */ >>+ mad_send_wc.status = IB_WC_SUCCESS; >>+ mad_send_wc.vendor_err = 0; >>+ mad_send_wc.wr_id = send_wr->wr_id; >>+ mad_agent->send_handler(mad_agent, &mad_send_wc); >>+ ret = 1; > > > Currently, it isn't safe to eliminate the check for the send_handler. > (The registration code does not guarantee that a send_handler was > supplied; it only does in the case where no registration request is > supplied with the registration). > > Should a send handler always be required or should this check be added > back in ? The send_handler is checked in ib_post_send_mad. From gdror at mellanox.co.il Tue Nov 30 00:59:24 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Tue, 30 Nov 2004 10:59:24 +0200 Subject: [openib-general] Unicast ARP Message-ID: <506C3D7B14CDD411A52C00025558DED606933619@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, November 29, 2004 9:04 PM > > > Michael> Currently it also seems that by just bringing the > Michael> interface up and down the hw address will change. Unless > Michael> I am doing something wrong, this inconvenience seems to > Michael> be caused by a different QP number being assigned. Cant > Michael> this be solved e.g. by assigning a fixed QP number for IP > Michael> over IB? > > It seems you are doing something wrong. The QP is allocated > when the interface is created and remains the same when the > interface is brought up and down: > > # ip addr show dev ib0 > 7: ib0: mtu 2044 qdisc > pfifo_fast qlen 128 > link/[32] > 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 > brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > # ifconfig ib0 down > # ifconfig ib0 up > # ip addr show dev ib0 > 7: ib0: mtu 2044 qdisc > pfifo_fast qlen 128 > link/[32] > 00:02:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:01:82:06 > brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > Roland, You're right, it takes more than "ifconfig down" for the QPN to change. If you take the module down after doing ifconfig down, then the QPN may change. Assigning specific QPN for ipoib requires allocation of QPN space which is beyond IB spec verbs. Current verbs do not allow it. I don't have any objection for that, except that you have to hold a set of preallocated QPs with specific numbers and hand them over to privileged consumer when requested to. I wouldn't commit that it will work on any HCA architecture. -Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Nov 30 01:05:05 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Nov 2004 11:05:05 +0200 Subject: [openib-general] Unicast ARP In-Reply-To: <1101755782.4145.256.camel@localhost.localdomain> References: <91DB792C7985D411BEC300B40080D29C711B96@mtvex01.mtv.mtl.com> <1101749619.4145.214.camel@localhost.localdomain> <52653ottwn.fsf@topspin.com> <20041129185739.GA3394@mellanox.co.il> <52oehgsci9.fsf@topspin.com> <1101755782.4145.256.camel@localhost.localdomain> Message-ID: <20041130090505.GA11212@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] Unicast ARP": > On Mon, 2004-11-29 at 14:04, Roland Dreier wrote: > > Michael> Currently it also seems that by just bringing the > > Michael> interface up and down the hw address will change. Unless > > Michael> I am doing something wrong, this inconvenience seems to > > Michael> be caused by a different QP number being assigned. Cant > > Michael> this be solved e.g. by assigning a fixed QP number for IP > > Michael> over IB? > > > > It seems you are doing something wrong. The QP is allocated when the > > interface is created and remains the same when the interface is > > brought up and down: > > The only time I see that is when removing and readding the IPoIB module. That was it. My script was unloading the module. Thanks. MST From halr at voltaire.com Tue Nov 30 08:34:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 30 Nov 2004 11:34:11 -0500 Subject: [openib-general] [RFC] Proposed OpenIB Diagnostic Tools Message-ID: <1101832451.6411.89.camel@localhost.localdomain> Hi, Attached is an initial proposal for diagnostic tools. They fall into two categories: host and network. Applications (or scripts) and library support would be supplied. This is an initial writeup on the high level descriptions of the tools and an initial syntax. All comments welcome. BTW, are there coding guidelines for user space ? Note that the implementation of these tools will take a back seat to getting the OpenSM up and running with gen2. At some point soon, I will put a copy of this in the gen2 tree perhaps at /gen2/trunk/src/userspace/diags -- Hal -------------- next part -------------- Diagnostic Tools 11/29/04 user space applications (also library support) two categories: host and network Host Oriented Diagnostic Tools 1. ibstatus Description: ibstatus displays basic information obtained from the local IB driver. -v enables verbose mode. Normal output includes LID, SMLID, port state, link width active, and port physical state. Verbose includes all sysfs supported parameters for that interface and port. Syntax: ibstatus [-v] [-I mthca0] [-p port] Dependencies: sysfs support in mthca 2. ibroute Description: ibroute uses SMPs to display the forwarding tables (unicast (LinearForwardingTable or LFT) or multicast (MulticastForwardingTable or MFT)) for the specified LID. Syntax: ibroute [-multi] [-m mkey] [-pa path] [-I mthca0] [-p port] LID Dependencies: user MAD access, SMA 3. ibtracert Description: ibtracert uses SMPs to trace the path from a source GID/LID to a destination GID/LID. The source GID/LID must be local to the node. Each hop along the path is displayed until the destination is reached or a hop does not respond. By using -mg and/or -ml options, multicast path tracing can be performed between source and destination nodes. Syntax: ibtracert [-m mkey] [-pa path] [-sg SGID] [-sl SLID] [-dg DGID] [-dl DLID] \ [-mg MGID] [-ml MLID] [-I mthca0] [-p port] Dependencies: user MAD access, SMA 4. smpquery Description: smpquery allows a basic subset of standard SMP queries including the following: local information (LID, GID, etc.), node information (from NodeDescription, NodeInfo, and possibly SwitchInfo if node is a switch), port information (port address and state), and port parameters (SLtoVLMappingTable, VLArbitrationTable, HOQLife, ...). Syntax: smpquery [-m mkey] [-l LID] [-pa path] [-I mthca0] [-p port] \ [-l] [-n] [-pi] [-pp] Dependencies: User MAD access 5. smpdump Description: smpdump is a general purpose SMP utility which gets SM attributes from a specified SMA. The result is dumped as hex (-x) or string (-s), with hex as the default. Syntax: smpdump [-m mkey] [-l LID] [-p path] [-I mthca0] [-p port] \ [-a attributeID] [-am attributeModifier] [-s] [-x] Dependencies: User MAD access 6. perfquery Description: perfquery uses PerfMgt GMPs to obtain the PortCounters (basic performance and error counters) from the PMA at the node specified. -r resets these counters after obtaining them. Syntax: perfquery [-I mthca0] [-p port] [-r] [-g GID] LID Dependencies: User MAD access, PMA 7. ibping Description: ibping uses UD transport to validate connectivity between IB nodes. It is run as client/server (daemon). -v option uses vendor MADs rather than normal UD transport. Syntax: ibping [-d] [-v] [-c count] [-i interval] [-s packetsize] \ [-I mthca0] [-p port] [-q qkey] [-g DGID] [-qp dqp] [-dl DLID] -d: run as daemon (server) Dependencies: user MAD access Network Oriented Diagnostics 8. ibnetdiscover Description: ibnetdiscover performs IB subnet discovery and outputs a human readable topology file. GUIDs, node types, and port numbers are displayed as well as port LIDs and NodeDescriptions. All nodes (and links) are displayed (full topology). Syntax: ibnetdiscover [-I mthca0] [-p port] [-o topology-filename] Dependencies: user MAD access Future versions of this file will be annotated with additional information including system guid, system type, internal to physical mapping, and physical location information (blade or ASIC number, etc.). 9. ibhosts Description: ibhosts either walks the IB subnet topology or uses an already saved topology file and extracts the HCA nodes. Syntax: ibhosts [-I mthca0] [-p port] [-i topology-filename] [-o ibhosts-filename] Dependencies: user MAD access, ibnetdiscover 10. ibswitches Description: ibswitches either walks the IB subnet topology or uses an already saved topology file and extracts the IB switches. Syntax: ibswitches [-I mthca0] [-p port] \ [-i topology-filename] [-o ibswitches-filename] Dependencies: user MAD access, ibnetdiscover 11. ibnetverify Description: ibnetverify uses a full topology file that was created by ibnetdiscover, scans the network to see if the current topology matches or not displaying any discrepancies, and validates the connectivity and reports errors (from port counters). Syntax: ibnetverify -f filename [-I mthca0] [-p port] Dependencies: user MAD access, ibnetdiscover From philippe.gregoire at cea.fr Tue Nov 30 09:18:19 2004 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Tue, 30 Nov 2004 18:18:19 +0100 Subject: [openib-general] testing OpenIB on TopSpin hardware ? Message-ID: <200411301718.SAA08841@styx.bruyeres.cea.fr> Hello, I would like to test the latest OpenIB software on our test platform, specially the SDP part. We have a HP-DL380/DL360 (IA32) cluster with 12 nodes connected through TopSpin HCA and TopSpin 90 IB switch. I got the OpenIB source with svn. What is the latest version , gen1 or 1.0 ? 1.0 looks like the version available in march, correct ? What is the firmware requirement for the HCA and the switch ? Thanks for your help Philippe Gregoire CEA/DAM From halr at voltaire.com Tue Nov 30 09:55:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 30 Nov 2004 12:55:31 -0500 Subject: [openib-general] [RFC] DIagnostic Tools Proposal Message-ID: <1101837331.6411.266.camel@localhost.localdomain> is now located in the tree as: https://openib.org/svn/gen2/trunk/src/userspace/diags/diagtools-proposal.txt From halr at voltaire.com Tue Nov 30 10:15:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 30 Nov 2004 13:15:48 -0500 Subject: [openib-general] smpdump and current MAD layer Message-ID: <1101838548.6411.276.camel@localhost.localdomain> Hi, I believe there is an issue with smpdump (or gmpdump) and just want to make sure I am not forgetting something as I am want to do :-) Each received MAD can only have 1 client which "owns" it. That client is either determined via solicited routing or version/class/method (and soon OUI) routing. So solicited MAD responses cannot currently be snooped nor can unsolicited ones for which an agent is registered (Since SMA and PMA are currently firmware based, the latter is not an issue for the current implementation). Is the above correct ? If so, do you see a "clean" way around this ? Thanks. -- Hal From mshefty at ichips.intel.com Tue Nov 30 10:40:30 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Nov 2004 10:40:30 -0800 Subject: [openib-general] [PATCH] added documentation for exported functions Message-ID: <20041130104030.274312d0.mshefty@ichips.intel.com> Patch adds documentation for exported functions that did not have it in ib_verbs.h and device.c. Fixes slight formatting issue in ib_mad.h documentation. Patch will be committed shortly after sending this. - Sean Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 1302) +++ include/ib_verbs.h (working copy) @@ -849,28 +849,107 @@ u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify); +/** + * ib_alloc_pd - Allocates an unused protection domain. + * @device: The device on which to allocate the protection domain. + * + * A protection domain object provides an association between QPs, shared + * receive queues, address handles, memory regions, and memory windows. + */ struct ib_pd *ib_alloc_pd(struct ib_device *device); + +/** + * ib_dealloc_pd - Deallocates a protection domain. + * @pd: The protection domain to deallocate. + */ int ib_dealloc_pd(struct ib_pd *pd); +/** + * ib_create_ah - Creates an address handle for the given address vector. + * @pd: The protection domain associated with the address handle. + * @ah_attr: The attributes of the address vector. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +/** + * ib_modify_ah - Modifies the address vector associated with an address + * handle. + * @ah: The address handle to modify. + * @ah_attr: The new address vector attributes to associate with the + * address handle. + */ int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_query_ah - Queries the address vector associated with an address + * handle. + * @ah: The address handle to query. + * @ah_attr: The address vector attributes associated with the address + * handle. + */ int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_destroy_ah - Destroys an address handle. + * @ah: The address handle to destroy. + */ int ib_destroy_ah(struct ib_ah *ah); +/** + * ib_create_qp - Creates a QP associated with the specified protection + * domain. + * @pd: The protection domain associated with the QP. + * @qp_init_attr: A list of initial attributes required to create the QP. + */ struct ib_qp *ib_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr); +/** + * ib_modify_qp - Modifies the attributes for the specified QP and then + * transitions the QP to the given state. + * @qp: The QP to modify. + * @qp_attr: On input, specifies the QP attributes to modify. On output, + * the current values of selected QP attributes are returned. + * @qp_attr_mask: A bit-mask used to specify which attributes of the QP + * are being modified. + */ int ib_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask); +/** + * ib_query_qp - Returns the attribute list and current values for the + * specified QP. + * @qp: The QP to query. + * @qp_attr: The attributes of the specified QP. + * @qp_attr_mask: A bit-mask used to select specific attributes to query. + * @qp_init_attr: Additional attributes of the selected QP. + * + * The qp_attr_mask may be used to limit the query to gathering only the + * selected attributes. + */ int ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); +/** + * ib_destroy_qp - Destroys the specified QP. + * @qp: The QP to destroy. + */ int ib_destroy_qp(struct ib_qp *qp); +/** + * ib_post_send - Posts a list of work requests to the send queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @send_wr: A list of work requests to post on the send queue. + * @bad_send_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ static inline int ib_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr) @@ -878,6 +957,14 @@ return qp->device->post_send(qp, send_wr, bad_send_wr); } +/** + * ib_post_recv - Posts a list of work requests to the receive queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @recv_wr: A list of work requests to post on the receive queue. + * @bad_recv_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ static inline int ib_post_recv(struct ib_qp *qp, struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr) @@ -885,12 +972,37 @@ return qp->device->post_recv(qp, recv_wr, bad_recv_wr); } +/** + * ib_create_cq - Creates a CQ on the specified device. + * @device: The device on which to create the CQ. + * @comp_handler: A user-specified callback that is invoked when a + * completion event occurs on the CQ. + * @event_handler: A user-specified callback that is invoked when an + * asynchronous event not associated with a completion occurs on the CQ. + * @cq_context: Context associated with the CQ returned to the user via + * the associated completion and event handlers. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe); +/** + * ib_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ int ib_resize_cq(struct ib_cq *cq, int cqe); + +/** + * ib_destroy_cq - Destroys the specified CQ. + * @cq: The CQ to destroy. + */ int ib_destroy_cq(struct ib_cq *cq); /** @@ -911,13 +1023,24 @@ return cq->device->poll_cq(cq, num_entries, wc); } +/** + * ib_peek_cq - Returns the number of unreaped completions currently + * on the specified CQ. + * @cq: The CQ to peek. + * @wc_cnt: A minimum number of unreaped completions to check for. + * + * If the number of unreaped completions is greater than or equal to wc_cnt, + * this function returns wc_cnt, otherwise, it returns the actual number of + * unreaped completions. + */ int ib_peek_cq(struct ib_cq *cq, int wc_cnt); /** - * ib_req_notify_cq - request completion notification - * @cq:the CQ to generate an event for - * @cq_notify:%IB_CQ_SOLICITED for next solicited event, - * %IB_CQ_NEXT_COMP for any completion. + * ib_req_notify_cq - Request completion notification on a CQ. + * @cq: The CQ to generate an event for. + * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will + * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, + * notification will occur on the next completion. */ static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) @@ -925,6 +1048,13 @@ return cq->device->req_notify_cq(cq, cq_notify); } +/** + * ib_req_ncomp_notif - Request completion notification when there are + * at least the specified number of unreaped completions on the CQ. + * @cq: The CQ to generate an event for. + * @wc_cnt: The number of unreaped completions that should be on the + * CQ before an event is generated. + */ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) { return cq->device->req_ncomp_notif ? @@ -932,14 +1062,52 @@ -ENOSYS; } +/** + * ib_get_dma_mr - Returns a memory region for system memory that is + * usable for DMA. + * @pd: The protection domain associated with the memory region. + * @mr_access_flags: Specifies the memory access rights. + */ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); +/** + * ib_reg_phys_mr - Prepares a virtually addressed memory region for use + * by an HCA. + * @pd: The protection domain associated assigned to the registered region. + * @phys_buf_array: Specifies a list of physical buffers to use in the + * memory region. + * @num_phys_buf: Specifies the size of the phys_buf_array. + * @mr_access_flags: Specifies the memory access rights. + * @iova_start: The offset of the region's starting I/O virtual address. + */ struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *phys_buf_array, int num_phys_buf, int mr_access_flags, u64 *iova_start); +/** + * ib_rereg_phys_mr - Modifies the attributes of an existing memory region. + * Conceptually, this call performs the functions deregister memory region + * followed by register physical memory region. Where possible, + * resources are reused instead of deallocated and reallocated. + * @mr: The memory region to modify. + * @mr_rereg_mask: A bit-mask used to indicate which of the following + * properties of the memory region are being modified. + * @pd: If %IB_MR_REREG_PD is set in mr_rereg_mask, this field specifies + * the new protection domain to associated with the memory region, + * otherwise, this parameter is ignored. + * @phys_buf_array: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies a list of physical buffers to use in the new + * translation, otherwise, this parameter is ignored. + * @num_phys_buf: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies the size of the phys_buf_array, otherwise, this + * parameter is ignored. + * @mr_access_flags: If %IB_MR_REREG_ACCESS is set in mr_rereg_mask, this + * field specifies the new memory access rights, otherwise, this + * parameter is ignored. + * @iova_start: The offset of the region's starting I/O virtual address. + */ int ib_rereg_phys_mr(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, @@ -948,11 +1116,35 @@ int mr_access_flags, u64 *iova_start); +/** + * ib_query_mr - Retrieves information about a specific memory region. + * @mr: The memory region to retrieve information about. + * @mr_attr: The attributes of the specified memory region. + */ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +/** + * ib_dereg_mr - Deregisters a memory region and removes it from the + * HCA translation table. + * @mr: The memory region to deregister. + */ int ib_dereg_mr(struct ib_mr *mr); +/** + * ib_alloc_mw - Allocates a memory window. + * @pd: The protection domain associated with the memory window. + */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd); +/** + * ib_bind_mw - Posts a work request to the send queue of the specified + * QP, which binds the memory window to the given address range and + * remote access attributes. + * @qp: QP to post the bind work request on. + * @mw: The memory window to bind. + * @mw_bind: Specifies information about the memory window, including + * its address range, remote access rights, and associated memory region. + */ static inline int ib_bind_mw(struct ib_qp *qp, struct ib_mw *mw, struct ib_mw_bind *mw_bind) @@ -963,12 +1155,32 @@ -ENOSYS; } +/** + * ib_dealloc_mw - Deallocates a memory window. + * @mw: The memory window to deallocate. + */ int ib_dealloc_mw(struct ib_mw *mw); +/** + * ib_alloc_fmr - Allocates a unmapped fast memory region. + * @pd: The protection domain associated with the unmapped region. + * @mr_access_flags: Specifies the memory access rights. + * @fmr_attr: Attributes of the unmapped region. + * + * A fast memory region must be mapped before it can be used as part of + * a work request. + */ struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, int mr_access_flags, struct ib_fmr_attr *fmr_attr); +/** + * ib_map_phys_fmr - Maps a list of physical pages to a fast memory region. + * @fmr: The fast memory region to associate with the pages. + * @page_list: An array of physical pages to map to the fast memory region. + * @list_len: The number of pages in page_list. + * @iova: The I/O virtual address to use with the mapped region. + */ static inline int ib_map_phys_fmr(struct ib_fmr *fmr, u64 *page_list, int list_len, u64 iova) @@ -976,10 +1188,38 @@ return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova); } +/** + * ib_unmap_fmr - Removes the mapping from a list of fast memory regions. + * @fmr_list: A linked list of fast memory regions to unmap. + */ int ib_unmap_fmr(struct list_head *fmr_list); + +/** + * ib_dealloc_fmr - Deallocates a fast memory region. + * @fmr: The fast memory region to deallocate. + */ int ib_dealloc_fmr(struct ib_fmr *fmr); +/** + * ib_attach_mcast - Attaches the specified QP to a multicast group. + * @qp: QP to attach to the multicast group. The QP must be type + * IB_QPT_UD. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + * + * In order to send and receive multicast packets, subnet + * administration must have created the multicast group and configured + * the fabric appropriately. The port associated with the specified + * QP must also be a member of the multicast group. + */ int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +/** + * ib_detach_mcast - Detaches the specified QP from a multicast group. + * @qp: QP to detach from the multicast group. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + */ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); #endif /* IB_VERBS_H */ Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1302) +++ include/ib_mad.h (working copy) @@ -110,16 +110,16 @@ /** * ib_mad_send_handler - callback handler for a sent MAD. - * @mad_agent - MAD agent that sent the MAD. - * @mad_send_wc - Send work completion information on the sent MAD. + * @mad_agent: MAD agent that sent the MAD. + * @mad_send_wc: Send work completion information on the sent MAD. */ typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc); /** * ib_mad_recv_handler - callback handler for a received MAD. - * @mad_agent - MAD agent requesting the received MAD. - * @mad_recv_wc - Received work completion information on the received MAD. + * @mad_agent: MAD agent requesting the received MAD. + * @mad_recv_wc: Received work completion information on the received MAD. * * MADs received in response to a send request operation will be handed to * the user after the send operation completes. All data buffers given @@ -130,15 +130,15 @@ /** * ib_mad_agent - Used to track MAD registration with the access layer. - * @device - Reference to device registration is on. - * @qp - Reference to QP used for sending and receiving MADs. - * @recv_handler - Callback handler for a received MAD. - * @send_handler - Callback handler for a sent MAD. - * @context - User-specified context associated with this registration. - * @hi_tid - Access layer assigned transaction ID for this client. + * @device: Reference to device registration is on. + * @qp: Reference to QP used for sending and receiving MADs. + * @recv_handler: Callback handler for a received MAD. + * @send_handler: Callback handler for a sent MAD. + * @context: User-specified context associated with this registration. + * @hi_tid: Access layer assigned transaction ID for this client. * Unsolicited MADs sent by this client will have the upper 32-bits * of their TID set to this value. - * @port_num - Port number on which QP is registered + * @port_num: Port number on which QP is registered */ struct ib_mad_agent { struct ib_device *device; @@ -152,9 +152,9 @@ /** * ib_mad_send_wc - MAD send completion information. - * @wr_id - Work request identifier associated with the send MAD request. - * @status - Completion status. - * @vendor_err - Optional vendor error information returned with a failed + * @wr_id: Work request identifier associated with the send MAD request. + * @status: Completion status. + * @vendor_err: Optional vendor error information returned with a failed * request. */ struct ib_mad_send_wc { @@ -165,11 +165,11 @@ /** * ib_mad_recv_buf - received MAD buffer information. - * @list - Reference to next data buffer for a received RMPP MAD. - * @grh - References a data buffer containing the global route header. + * @list: Reference to next data buffer for a received RMPP MAD. + * @grh: References a data buffer containing the global route header. * The data refereced by this buffer is only valid if the GRH is * valid. - * @mad - References the start of the received MAD. + * @mad: References the start of the received MAD. */ struct ib_mad_recv_buf { struct list_head list; @@ -179,9 +179,9 @@ /** * ib_mad_recv_wc - received MAD information. - * @wc - Completion information for the received data. - * @recv_buf - Specifies the location of the received data buffer(s). - * @mad_len - The length of the received MAD, without duplicated headers. + * @wc: Completion information for the received data. + * @recv_buf: Specifies the location of the received data buffer(s). + * @mad_len: The length of the received MAD, without duplicated headers. * * For received response, the wr_id field of the wc is set to the wr_id * for the corresponding send request. @@ -194,12 +194,12 @@ /** * ib_mad_reg_req - MAD registration request - * @mgmt_class - Indicates which management class of MADs should be receive + * @mgmt_class: Indicates which management class of MADs should be receive * by the caller. This field is only required if the user wishes to * receive unsolicited MADs, otherwise it should be 0. - * @mgmt_class_version - Indicates which version of MADs for the given + * @mgmt_class_version: Indicates which version of MADs for the given * management class to receive. - * @method_mask - The caller will receive unsolicited MADs for any method + * @method_mask: The caller will receive unsolicited MADs for any method * where @method_mask = 1. */ struct ib_mad_reg_req { @@ -210,21 +210,21 @@ /** * ib_register_mad_agent - Register to send/receive MADs. - * @device - The device to register with. - * @port_num - The port on the specified device to use. - * @qp_type - Specifies which QP to access. Must be either + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP to access. Must be either * IB_QPT_SMI or IB_QPT_GSI. - * @mad_reg_req - Specifies which unsolicited MADs should be received + * @mad_reg_req: Specifies which unsolicited MADs should be received * by the caller. This parameter may be NULL if the caller only * wishes to receive solicited responses. - * @rmpp_version - If set, indicates that the client will send + * @rmpp_version: If set, indicates that the client will send * and receive MADs that contain the RMPP header for the given version. * If set to 0, indicates that RMPP is not used by this client. - * @send_handler - The completion callback routine invoked after a send + * @send_handler: The completion callback routine invoked after a send * request has completed. - * @recv_handler - The completion callback routine invoked for a received + * @recv_handler: The completion callback routine invoked for a received * MAD. - * @context - User specified context associated with the registration. + * @context: User specified context associated with the registration. */ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, u8 port_num, @@ -237,7 +237,7 @@ /** * ib_unregister_mad_agent - Unregisters a client from using MAD services. - * @mad_agent - Corresponding MAD registration request to deregister. + * @mad_agent: Corresponding MAD registration request to deregister. * * After invoking this routine, MAD services are no longer usable by the * client on the associated QP. @@ -247,9 +247,9 @@ /** * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client. - * @mad_agent - Specifies the associated registration to post the send to. - * @send_wr - Specifies the information needed to send the MAD(s). - * @bad_send_wr - Specifies the MAD on which an error was encountered. + * @mad_agent: Specifies the associated registration to post the send to. + * @send_wr: Specifies the information needed to send the MAD(s). + * @bad_send_wr: Specifies the MAD on which an error was encountered. * * Sent MADs are not guaranteed to complete in the order that they were posted. */ @@ -259,8 +259,8 @@ /** * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. - * @mad_recv_wc - Work completion information for a received MAD. - * @buf - User-provided data buffer to receive the coalesced buffers. The + * @mad_recv_wc: Work completion information for a received MAD. + * @buf: User-provided data buffer to receive the coalesced buffers. The * referenced buffer should be at least the size of the mad_len specified * by @mad_recv_wc. * @@ -273,7 +273,7 @@ /** * ib_free_recv_mad - Returns data buffers used to receive a MAD to the * access layer. - * @mad_recv_wc - Work completion information for a received MAD. + * @mad_recv_wc: Work completion information for a received MAD. * * Clients receiving MADs through their ib_mad_recv_handler must call this * routine to return the work completion buffers to the access layer. @@ -282,8 +282,8 @@ /** * ib_cancel_mad - Cancels an outstanding send MAD operation. - * @mad_agent - Specifies the registration associated with sent MAD. - * @wr_id - Indicates the work request identifier of the MAD to cancel. + * @mad_agent: Specifies the registration associated with sent MAD. + * @wr_id: Indicates the work request identifier of the MAD to cancel. * * MADs will be returned to the user through the corresponding * ib_mad_send_handler. @@ -293,15 +293,15 @@ /** * ib_redirect_mad_qp - Registers a QP for MAD services. - * @qp - Reference to a QP that requires MAD services. - * @rmpp_version - If set, indicates that the client will send + * @qp: Reference to a QP that requires MAD services. + * @rmpp_version: If set, indicates that the client will send * and receive MADs that contain the RMPP header for the given version. * If set to 0, indicates that RMPP is not used by this client. - * @send_handler - The completion callback routine invoked after a send + * @send_handler: The completion callback routine invoked after a send * request has completed. - * @recv_handler - The completion callback routine invoked for a received + * @recv_handler: The completion callback routine invoked for a received * MAD. - * @context - User specified context associated with the registration. + * @context: User specified context associated with the registration. * * Use of this call allows clients to use MAD services, such as RMPP, * on user-owned QPs. After calling this routine, users may send @@ -316,8 +316,8 @@ /** * ib_process_mad_wc - Processes a work completion associated with a * MAD sent or received on a redirected QP. - * @mad_agent - Specifies the registered MAD service using the redirected QP. - * @wc - References a work completion associated with a sent or received + * @mad_agent: Specifies the registered MAD service using the redirected QP. + * @wc: References a work completion associated with a sent or received * MAD segment. * * This routine is used to complete or continue processing on a MAD request. Index: core/device.c =================================================================== --- core/device.c (revision 1302) +++ core/device.c (working copy) @@ -556,6 +556,17 @@ } EXPORT_SYMBOL(ib_modify_device); +/** + * ib_modify_port - Modifies the attributes for the specified port. + * @device: The device to modify. + * @port_num: The number of the port to modify. + * @port_modify_mask: Mask used to specify which attributes of the port + * to change. + * @port_modify: New attribute values for the port. + * + * ib_modify_port() changes a port's attributes as specified by the + * @port_modify_mask and @port_modify structure. + */ int ib_modify_port(struct ib_device *device, u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) From mshefty at ichips.intel.com Tue Nov 30 10:55:47 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Nov 2004 10:55:47 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1101838548.6411.276.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> Message-ID: <41ACC233.2000201@ichips.intel.com> Hal Rosenstock wrote: > Each received MAD can only have 1 client which "owns" it. That client is > either determined via solicited routing or version/class/method (and > soon OUI) routing. This is correct. This was done to avoid having to copy received MADs. > So solicited MAD responses cannot currently be snooped nor can > unsolicited ones for which an agent is registered (Since SMA and PMA are > currently firmware based, the latter is not an issue for the current > implementation). > > Is the above correct ? If so, do you see a "clean" way around this ? This is something that was briefly discussed before. I think that I would support snooping by extending the ib_mad_reg_reg structure to indicate a registration type, possibly along with some additional filtering parameters. (We could also create a new snoop routine.) One issue with snooping MAD is whether the snooping occurs above or below RMPP, or possibly in both places. - Sean From halr at voltaire.com Tue Nov 30 11:33:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 30 Nov 2004 14:33:19 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41ACC233.2000201@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> Message-ID: <1101843199.6411.288.camel@localhost.localdomain> On Tue, 2004-11-30 at 13:55, Sean Hefty wrote: > This is something that was briefly discussed before. I think that I > would support snooping by extending the ib_mad_reg_reg structure to > indicate a registration type, possibly along with some additional > filtering parameters. (We could also create a new snoop routine.) Maybe a single bit field in the registration request to indicate snoop. Another question is what is the granularity of snoop registrations needed to be supported ? Is one SMP snooper and one GMP snooper sufficient ? Should the snoopers be per class ? It seems to me that going down to the method level is too much for snoopers. This is just another way of expressing the filtering parameters you mention. > One issue with snooping MAD is whether the snooping occurs above or > below RMPP, or possibly in both places. In general, I would think the GMP snooping would specify whether it is to be done above or below RMPP and perhaps the class or all GS classes (some combinations wouldn't make sense). I could see if one thought one was having problems with RMPP handling doing it below RMPP and otherwise above (the normal case). -- Hal From mshefty at ichips.intel.com Tue Nov 30 11:47:47 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Nov 2004 11:47:47 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1101843199.6411.288.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> <1101843199.6411.288.camel@localhost.localdomain> Message-ID: <41ACCE63.5050800@ichips.intel.com> Hal Rosenstock wrote: >>This is something that was briefly discussed before. I think that I >>would support snooping by extending the ib_mad_reg_reg structure to >>indicate a registration type, possibly along with some additional >>filtering parameters. (We could also create a new snoop routine.) > > Maybe a single bit field in the registration request to indicate snoop. > > Another question is what is the granularity of snoop registrations > needed to be supported ? Is one SMP snooper and one GMP snooper > sufficient ? Should the snoopers be per class ? It seems to me that > going down to the method level is too much for snoopers. This is just > another way of expressing the filtering parameters you mention. I guess filtering can be done above the MAD layer, so just letting the user specify the qp_type may be all that's needed, beyond indicating that snooping is desired. If we go this route, we can probably support any number of snoopers. >>One issue with snooping MAD is whether the snooping occurs above or >>below RMPP, or possibly in both places. > > In general, I would think the GMP snooping would specify whether it is > to be done above or below RMPP and perhaps the class or all GS classes > (some combinations wouldn't make sense). I could see if one thought one > was having problems with RMPP handling doing it below RMPP and otherwise > above (the normal case). Hmm... we could let the client decide through the rmpp_version parameter. Also, would snooping include redirected QPs? I think that we can support this. - Sean From halr at voltaire.com Tue Nov 30 12:24:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 30 Nov 2004 15:24:59 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41ACCE63.5050800@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> <1101843199.6411.288.camel@localhost.localdomain> <41ACCE63.5050800@ichips.intel.com> Message-ID: <1101846298.6411.351.camel@localhost.localdomain> On Tue, 2004-11-30 at 14:47, Sean Hefty wrote: > I guess filtering can be done above the MAD layer, That seems like the right way to go to me. > so just letting the > user specify the qp_type may be all that's needed, beyond indicating > that snooping is desired. If we go this route, we can probably support > any number of snoopers. Then there are really only 2 snoopers at the MAD level (SMI, GSI). Any additional demux (snoopers) would be done above the MAD level. > >>One issue with snooping MAD is whether the snooping occurs above or > >>below RMPP, or possibly in both places. > > > > In general, I would think the GMP snooping would specify whether it is > > to be done above or below RMPP and perhaps the class or all GS classes > > (some combinations wouldn't make sense). I could see if one thought one > > was having problems with RMPP handling doing it below RMPP and otherwise > > above (the normal case). > > Hmm... we could let the client decide through the rmpp_version > parameter. I like it. Nothing new needed here. It's just a question of when to implement above and below RMPP. > Also, would snooping include redirected QPs? I think that > we can support this. Once a QP is redirected, is the MAD layer still involved in handing off the receive completions for that QP ? -- Hal From halr at voltaire.com Tue Nov 30 14:46:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 30 Nov 2004 17:46:28 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41ACCE63.5050800@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> <1101843199.6411.288.camel@localhost.localdomain> <41ACCE63.5050800@ichips.intel.com> Message-ID: <1101854788.6411.373.camel@localhost.localdomain> On Tue, 2004-11-30 at 14:47, Sean Hefty wrote: > I guess filtering can be done above the MAD layer, so just letting the > user specify the qp_type may be all that's needed, beyond indicating > that snooping is desired. If we go this route, we can probably support > any number of snoopers. Does that mean the snoopers would just be a list based on qp_type (and we have a list per QP type (SMI, GSI)) ? -- Hal From mshefty at ichips.intel.com Tue Nov 30 14:57:10 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Nov 2004 14:57:10 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1101846298.6411.351.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> <1101843199.6411.288.camel@localhost.localdomain> <41ACCE63.5050800@ichips.intel.com> <1101846298.6411.351.camel@localhost.localdomain> Message-ID: <41ACFAC6.1060804@ichips.intel.com> Hal Rosenstock wrote: >>I guess filtering can be done above the MAD layer, > > That seems like the right way to go to me. Same here. > Then there are really only 2 snoopers at the MAD level (SMI, GSI). Any > additional demux (snoopers) would be done above the MAD level. I was referring to allowing multiple clients snoop QP0/1 traffic. To implement this, it seems that we'd only need a single list per QP per port. >>Also, would snooping include redirected QPs? I think that >>we can support this. > > Once a QP is redirected, is the MAD layer still involved in handing off > the receive completions for that QP ? Currently, the API expects the user to call ib_process_mad_wc for *some* send or receive completions: those associated with RMPP and for requests/responses. We can state that users of redirected QPs should always call ib_process_mad_wc for any MAD related work completion, but that isn't strictly enforceable as long as the user controls the CQ.