From yaronh at voltaire.com Tue Mar 1 02:29:46 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Tue, 1 Mar 2005 12:29:46 +0200 Subject: [openib-general] IB Address Translation service Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FACB5@taurus.voltaire.com> Eric, let me correct some of your assumptions Which this API is actually targeting to protect against, see below > -----Original Message----- > From: Eric W. Biederman [mailto:eric at lnxi.com] On Behalf Of Eric W. > Biederman > Sent: Tuesday, March 01, 2005 9:18 AM > To: Yaron Haviv > Cc: Roland Dreier; shaharf; openib-general at openib.org > Subject: Re: [openib-general] IB Address Translation service > > "Yaron Haviv" writes: > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Roland Dreier > > > Sent: Monday, February 28, 2005 7:13 PM > > > To: shaharf > > > Cc: openib-general at openib.org > > > Subject: Re: [openib-general] IB Address Translation service > > > > > > This API seems overly complex and at the same time too inflexible to > > > me. However, rather than getting bogged down nitpicking about APIs, I > > > think we have to take a few steps back. > > > > I believe the API is very flexible, but we are pretty open to here what > > you think is needed in addition > > > > > First, let's understand the problem we're trying to solve. Who are > > > the consumers of this address translation service? > > > > The first problem is that most ULPs use valid IP addresses for > > simplicity (DAPL, iSER, NFS/RDMA, SDP, MPI, etc') and someone needs to > > resolve it to an IB address and device to use IB. This should take into > > account cases where there are more than one HCAs in the system. > > Preferable/optionally the ULP would like to know which partition to use > > if there is more than one, and leverage on the IP subnetting done by > > IPoIB. > > I am confused. In any sane network the translation is: > Hostname -> address. > > IP because it spans multiple networks does: > Hostname -> IP address -> hw address. > > IB because it can span multiple IB networks does: > GUID+QPN -> LID + QPN. > > So what is wrong with simply doing: > Hostname -> GUID > ??? 1. In standard protocols such as SDP, iSER, NFS/RDMA, Oracle, .. (unlike OSU MPICH) the name service is one of the standard IP name services mapping Host names to IP addresses, and the ULP accepts a destination IP and NOT a Host name. 2. InfiniBand Hardware address is a GID and not LID, LID is a path attribute implemented to avoid the slow 48 bit lookup done in Ethernet and enable multi-pathing. A LID address is dynamically allocated; you may also have multiple LID addresses per port. (OSU MPICH implementation is a bad example for IB citizenship) So to summaries: Ethernet: Host Name -> IP -> MAC Address InfiniBand: Host Name -> IP -> GID Address -> Path (LID, SL, ..) So If we intend to relay on standard name services we can start with IP (or implement a proprietary name service for Name->HW Addr if we wish) Than we need to translate an IP to HW address (GID/GUID) and the equivalent of VLANs (partitions), this is provided by the ib_at_route_by_ip call And internally it is based on IP and IPoIB mechanisms similar to how Libor implemented it in SDP (and optionally if we see a need using ATS). Than in IB we need to resolve a GID to path attributes, which consist of LID, SL/VL, MTU, etc' The inputs to that are the source, destination, partition and QoS attributes, and the result is a path, since IB also support Multi-pathing, a user may receive multiple paths that can be used for high-availability, performance aggregation, or source based routing. A path may also travel through isolated congestion domains using VLs. The ib_at_paths_by_route call allows resolving HW Address + preferences to one or more path records that are than used by the ULP & CM. It can also be used by non-IP based ULP's such as SRP or MPICH, that is why the API unlike the current SDP implementation is divided to 2 calls one for HW address, and one for path. Currently OSU MPICH is using Proprietary Name and LID+QP assignment, it doesn't work the standard IB way with SA & CM, which is not making use of a lot of IB capabilities, and is also making it more static and less robust, I wouldn't use that as the example for ULP implementation. The MPI layer which doesn't have any idea about the fabric routing/utilization/availability is determining the path. Another simple scenario your application requires is to run MPI and NFS on different IB VLs, today you need to manually configure (recompile) that in each ULP, with that proposal it can be done automatically with a central configuration on the SM. On the other hand SDP uses same mechanisms; however we cannot use it for other ULP's (e.g. kDAPL), and also it is missing functionality that is needed by many of our users. The proposal calls for doing one set of calls for current and future ULP's. > It would be brain damaged for DAPL to require IP addresses. Not that > DAPL hasn't shown some brain damage already. DAPL use IP addresses since it is a common API for IB & Ethernet/RDMA, I'm not sure what is wrong with IP, millions use it and are familiar with it, which is something I cant say about GIDs & LIDs. > You can't do GUID -> IP because there is not a requirement on > a 1 to 1 mapping. And in general there is no fixed IP -> GUID mapping. If you dig into the call, it returns an array of IPs, you can also specify VLAN (P_Key). > > What are the semantics in the upper levels when the IP -> GUID mapping > changes? Does you connection properly follow the IP to the new GUID? > That's a ULP implementation question; I believe in general it shouldn't. > Just FYI IPv6 doesn't use arp. The implementation will depend on the IP stack to provide the IP->GID so it supports both IPv4 & IPv6. Yaron From shaharf at voltaire.com Tue Mar 1 06:33:46 2005 From: shaharf at voltaire.com (shaharf) Date: Tue, 1 Mar 2005 16:33:46 +0200 Subject: [openib-general] IB Address Translation service Message-ID: > > Roland> First, let's understand the problem we're trying to solve. > Roland> Who are the consumers of this address translation service? > > shaharf> Any ULPs at user & kernel, and also some > shaharf> applications. > > I think this is too general an answer. We should be designing based > on specific ULPs and applications. For example, I don't see anything > particularly useful to IPoIB in this API. Perhaps Libor can comment > on how this API works for SDP. > You are right about the IPoIB. I think that IPoIB should not use this API (or at least functions that may use ARP) because this creates a circular dependency in the architecture. Of course this can be solved, but I think that this is really unnecessarily. IPoIB have also relative modest resolution requirement and I don't see why we should complex things. SDP, kDAPL and maybe others are a different story. As Libor already mentioned in a different mail, the SDP already does a very similar lookup. In fact one of my internal goals was to be able to fulfill SDP requirement. The internal resolution should be very similar to the current SDP implementation, except that the ATS option is to be supported. The ATS issue is orthogonal issue to this API. As long that there are ULPs (such as kDAPL) that requires it (even just for reverse mapping), we should provide it. My personal opinion is that the IB-ARP + ATS combination is twisted. As Libor wrote it brings up many issues regarding distributed mechanisms vs. central mechanisms and databases. I guess that up to here there this is a consensus. But my (personal not Voltaire's) take is that the redundant mechanism is the ARP and not the ATS. My reasoning is simple, IB mgt is centralistic. I don't like it but that's the way it is. Adding contradicting mechanisms does not solve the problem, it just makes everything more complex. As I understand it the ARP reasoning is that due the fact that the resolving process has two stages (IP->GID, GID->lid) it is reasonable to use a separate and well known mechanism for IP to IB resolution. Another issue is that it is distributed and therefore doesn't require SM (at least when ignoring the multicast setup). I think that as IP is tunneled over IB, it is not reasonable to use ARP, and its distributed nature is a problem not a feature - the SM is still required for the path record and the multicast management. The correct solution for the centralistic IB management is to distribute the SM - not the underlying mechanisms. I think that it is not too hard to distribute the SM or at least the SA part of it. The SM/SA can also cache the requests much better that the clients. Further more, a unified ATS + path query can be defined to resolve everything in one stage. This will simplify many aspects of the resolution. But again, this is not the really the main issue. > What application would use functions like ib_at_ips_by_gid90 or > ib_at_ips_by_subnet()? > > shaharf> My take right now is to implement a kernel based > shaharf> mechanism and a user mode library to interface it. There > shaharf> are other feasible solutions. I would really like you > shaharf> have your suggestions and preferences. > > Unless there is a real kernel consumer that needs something this > elaborate, I would prefer to implement this sort of caching service as > a userspace daemon/library. This allows for more sophisticated > implementations (eg persistent caches) and also makes debugging and > maintenance easier. > ib_at_ips_by_gid() function is intended for reverse resolution, i.e. if you have a gid and you want you resolve it back to ip/device, and ib_at_ips_by_subnet()to let your resolve all IB devices (and GIDS) on a subnet, for example for a application level load balancing/fail over. ib_at_ips_by_gid() is required by kDAPL. I totally agree that overengineering is bad. This means that some of the functions (such as ib_at_ips_by_subnet) may be implemented at the first stage only in usermode. > shaharf> I think that starting with the APIs is a valid approach > shaharf> that has its own advantages and disadvantages. > > Sure, it's always good to have code in hand to start a discussion. > But in this case the API seems to be far ahead of its consumers, so it > ends up feeling overengineered to me. > You are completely right. The proposed API is designed to cover the (near) future requirement of ICER, NFS-RDMA, kDAPL, SDP, and other. It attempts to cover the following issues: Resolution Back resolution Multi-pathing Fail-over TOS/QOS Partitioning There are not visionary requirements. There are present or very near future requirements. The API attempts to show the "correct" solutions for some common problems. Without it, we may end with several different and un-matching solutions to the same problems. We don't want the ULPs to re-discover the wheel every time. The only "over-engineering" IMO is the caching support. I think that caching is a very likely to happen so it is best for the API to let the clients know that "beware, these function may return cached results". Some application may care. Note that the caching impact is only few flags and invalidate function. This is not very big overhead. For the usermode/kernel mode issue, I would be happy to implement everything in usermode. This leaves just a small issue of efficient kernel to user requests interfaces... Personally, I think that it is legitimate architecture (User mode daemon to serve the kernel) especially when you keep the caches within the kernel so the fast paths do not require usermode intervention, and let the usermode daemons maintain the caches and do the slow path tasks, where the extra context switches overhead will be insignificant relative to the entire slow path latency. I am not sure that my approach is very popular... > - R. Shahar From halr at voltaire.com Tue Mar 1 06:50:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Mar 2005 09:50:25 -0500 Subject: [openib-general] Question In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEEF1@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEEF1@mtlex01.yok.mtl.com> Message-ID: <1109688422.23094.1265.camel@localhost.localdomain> Hi Eitan, On Tue, 2005-03-01 at 01:26, Eitan Zahavi wrote: > Hal wrote: > > So this looks like a workaround for a bug. Not sure what any of the > other symptoms > > are but I'm real curious now. Can someone comment more on this ? > > The ERR 3610 is really just a warning. It is caused by the Anafa1 chip > responding with a LinearFDBTop 0xC000. Are you sure that the only problem when Anafa1 gets into this state ? Does it continue to forward LR packets ? What happens with all the other LIDs now theoretically in play ? I also presume there's no fix for this with Anafa 1. Is that correct ? > OpenSM does know how to handle that case and fix it. Right, the workaround resets LinearFDBTop. > > At a minimum, the SMA is reporting an invalid value for > PortInfo::LinearFDBTop. I > > wonder if it also is incapable of forwarding DR MADs as well. That > would explain > > this. > There are no issues with that switch ability to do DR mads. -- Hal From jlentini at netapp.com Tue Mar 1 07:16:03 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 1 Mar 2005 10:16:03 -0500 (EST) Subject: [openib-general] MTHCA features In-Reply-To: <52bra4s8w1.fsf@topspin.com> References: <52bra4s8w1.fsf@topspin.com> Message-ID: roland> James> It is my understanding that the current MTHCA driver does roland> James> not support InfiniBand memory windows or memory roland> James> registration using virtual addresses. roland> roland> James> Is this information correct? If so, when will these roland> James> features be supported? roland> roland> Well, memory registration is pretty complete. By design, we only roland> support memory registration with physical addresses for kernel roland> consumers even at the verbs API level (ie there are no mthca-specific roland> limitations). In the kernel, registration by virtual address is not roland> very useful. For userspace verbs, only registration by virtual roland> address is supported for obvious reasons. roland> roland> Memory windows are not implemented for mthca. It wouldn't be a lot of roland> work for someone with access to Mellanox documentation to implement roland> them, but they're not particularly useful due to their performance roland> characteristics. Is anyone on this list working on memory window support? I ask because the DAT API contains interfaces that allow users to interact with memory windows. From eitan at mellanox.co.il Tue Mar 1 07:45:01 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 1 Mar 2005 17:45:01 +0200 Subject: [openib-general] Question Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEEFD@mtlex01.yok.mtl.com> The bug is only in the meaning of the report. No other issue was found with it. The Anafa1 will report this wrong value only after reboot. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, March 01, 2005 4:50 PM > To: Eitan Zahavi > Cc: Ronald G. Minnich; openib-general at openib.org > Subject: RE: [openib-general] Question > > Hi Eitan, > > On Tue, 2005-03-01 at 01:26, Eitan Zahavi wrote: > > Hal wrote: > > > So this looks like a workaround for a bug. Not sure what any of the > > other symptoms > > > are but I'm real curious now. Can someone comment more on this ? > > > > The ERR 3610 is really just a warning. It is caused by the Anafa1 chip > > responding with a LinearFDBTop 0xC000. > > Are you sure that the only problem when Anafa1 gets into this state ? > Does it continue to forward LR packets ? What happens with all the other > LIDs now theoretically in play ? > > I also presume there's no fix for this with Anafa 1. Is that correct ? > > > OpenSM does know how to handle that case and fix it. > > Right, the workaround resets LinearFDBTop. > > > > At a minimum, the SMA is reporting an invalid value for > > PortInfo::LinearFDBTop. I > > > wonder if it also is incapable of forwarding DR MADs as well. That > > would explain > > > this. > > There are no issues with that switch ability to do DR mads. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Tue Mar 1 08:03:19 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 08:03:19 -0800 Subject: [openib-general] MTHCA features In-Reply-To: (James Lentini's message of "Tue, 1 Mar 2005 10:16:03 -0500 (EST)") References: <52bra4s8w1.fsf@topspin.com> Message-ID: <521xazqrp4.fsf@topspin.com> James> I ask because the DAT API contains interfaces that allow James> users to interact with memory windows. Are there any real applications that use those interfaces? Thanks, Roland From roland at topspin.com Tue Mar 1 08:03:47 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 08:03:47 -0800 Subject: [openib-general] MTHCA features In-Reply-To: (Or Gerlitz's message of "Tue, 1 Mar 2005 08:18:29 +0200") References: Message-ID: <52wtsrpd3w.fsf@topspin.com> Or> By "performance characteristics" do you mean the extra Or> overhead to generate another rkey for the already registered Or> address range (and also to create/free the mw)? No, I mean the performance cost of binding/unbinding the MW. - R. From roland at topspin.com Tue Mar 1 08:42:47 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 08:42:47 -0800 Subject: [openib-general] [PATCH] Add PCI device ID for new Mellanox HCA In-Reply-To: <52fyzfrk29.fsf@topspin.com> (Roland Dreier's message of "Mon, 28 Feb 2005 21:50:38 -0800") References: <52fyzfrk29.fsf@topspin.com> Message-ID: <52oee3pbaw.fsf@topspin.com> Hi Greg, It turns out that Mellanox decided to change the device ID at the last minute. So of course there will be parts with both IDs. Here's an updated patch that includes both IDs. Please use this instead. Thanks, Roland Add PCI device IDs for new Mellanox "Sinai" InfiniHost III Lx HCA. Signed-off-by: Roland Dreier --- linux-svn.orig/include/linux/pci_ids.h 2005-02-28 21:10:53.000000000 -0800 +++ linux-svn/include/linux/pci_ids.h 2005-03-01 08:39:49.766178558 -0800 @@ -1992,6 +1992,8 @@ #define PCI_DEVICE_ID_MELLANOX_TAVOR 0x5a44 #define PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT 0x6278 #define PCI_DEVICE_ID_MELLANOX_ARBEL 0x6282 +#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c +#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 #define PCI_VENDOR_ID_PDC 0x15e9 #define PCI_DEVICE_ID_PDC_1841 0x1841 From mshefty at ichips.intel.com Tue Mar 1 09:11:55 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 09:11:55 -0800 Subject: [openib-general] Re: [CM] destroy_cm_id In-Reply-To: <20050228173307.K21494@topspin.com> References: <20050228105036.122e3b1c.mshefty@ichips.intel.com> <20050228173307.K21494@topspin.com> Message-ID: <4224A25B.4010500@ichips.intel.com> Libor Michalek wrote: > Is it ever allowed to call ib_destroy_cm_id() from a CM callback? > For some reason I thought that this was OK from only the IDLE callback, > but if I destroy from IDLE I get a hang on cm_id_priv->lock, I believe. > Should the normal mode of operation in the case be to return an error > from IDLE to ensure that cm_id gets cleaned-up? You cannot call ib_destroy_cm_id from a callback. A reference is held on the cm_id while the callback is in progress, so the call to ib_destroy_cm_id will always block forever. The solution is to return a non-zero value from the callback itself, which will destroy the cm_id. Note that you can destroy the cm_id at anytime. You don't need to wait for it to transition to IDLE. (The CM maintains the timewait state separate from the cm_id itself.) - Sean From krause at cup.hp.com Tue Mar 1 09:11:52 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 01 Mar 2005 09:11:52 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: References: <35EA21F54A45CB47B879F21A91F4862F3FAC5D@taurus.voltaire.com> Message-ID: <6.2.0.14.2.20050301090420.02c34638@esmail.cup.hp.com> At 11:17 PM 2/28/2005, Eric W. Biederman wrote: >"Yaron Haviv" writes: > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Roland Dreier > > > Sent: Monday, February 28, 2005 7:13 PM > > > To: shaharf > > > Cc: openib-general at openib.org > > > Subject: Re: [openib-general] IB Address Translation service > > > > > > This API seems overly complex and at the same time too inflexible to > > > me. However, rather than getting bogged down nitpicking about APIs, I > > > think we have to take a few steps back. > > > > I believe the API is very flexible, but we are pretty open to here what > > you think is needed in addition > > > > > First, let's understand the problem we're trying to solve. Who are > > > the consumers of this address translation service? > > > > The first problem is that most ULPs use valid IP addresses for > > simplicity (DAPL, iSER, NFS/RDMA, SDP, MPI, etc') and someone needs to > > resolve it to an IB address and device to use IB. This should take into > > account cases where there are more than one HCAs in the system. > > Preferable/optionally the ULP would like to know which partition to use > > if there is more than one, and leverage on the IP subnetting done by > > IPoIB. > >I am confused. In any sane network the translation is: >Hostname -> address. > >IP because it spans multiple networks does: >Hostname -> IP address -> hw address. > >IB because it can span multiple IB networks does: >GUID+QPN -> LID + QPN. > >So what is wrong with simply doing: >Hostname -> GUID >??? > >Then all the kernel needs to be passed GUID + QPN. > >I am certain MPI does not care about IP addresses. It is the job >of the mpi launcher to resolve where all of the pieces are. Generally >mpirun is done over IP and it just needs to collect the native network >addresses before it leaves. That still does not eliminate the need to resolve some form of address. >It would be brain damaged for DAPL to require IP addresses. Not that >DAPL hasn't shown some brain damage already. I don't believe the IT API requires ATS. It is a bit more flexible and matches better with applications I think. >Please, please remember that IP addresses > > > It is possible to replicate the same code you have in SDP (which is also > > not complete) across all ULP's, I assume a better way is to provide it > > in one central place. > >How about not even worrying about it. It is an extra step that >introduces latency and confusion. > >You can't do GUID -> IP because there is not a requirement on >a 1 to 1 mapping. And in general there is no fixed IP -> GUID mapping. > >What are the semantics in the upper levels when the IP -> GUID mapping >changes? Does you connection properly follow the IP to the new GUID? It should follow a new mapping if done right. >I don't see this making sense anywhere except user space. > > > There are also two proposed address resolution mechanisms, one is ARP > > used by SDP, and one is ATS used by some DAPL consumers, and we believe > > it is better to combine them under the same API. > >Just FYI IPv6 doesn't use arp. ND or ARP for this point is less an issue. > > The second problem relates to mapping of IB GID to one or more Path > > records > > This is also something needed for ALL ULP's. today each ULP provides the > > minimal subset of path resolution functionality without taking into > > account topics such as partitioning, QoS, source routing and > > multi-pathing. > > Some of these require using special SA queries (such as SA Multipath > > Record query and QoSPath Query). > > I don't think it make sense to put all this functionality into each ULP > > as well. > >That part is reasonable. Although the fact it is easy to knock >OpenSM down concerns me. However that looks to be a separate >problem. > > > Than we can also discuss, does it make sense to have each path > > resolution call lead us to the sa, or does it make more sense to cache > > those paths. > > And if we cache, doesn't it make more sense to cache/invalidate the > > routes to all ULP's rather implementing/having it in each ULP. > > Also not sure how a 1000 node cluster functions without the caching. > > > > And the last problem is related to reverse resolution from IB to IP > > addresses that is needed for DAPL, as well as for different management > > and diagnostic tools that want to know what is really that node/port > > behind that GID addresses. > > > > So how would you suggest to go about it ? > > Duplicate all of that in each ULP ? > > Refrain from implementing advanced routing, partitioning, QoS (we cant > > really maintain all that advanced code for each ULP) ? > >One small step at a time. Where each step is obviously correct. > >One giant leap only works well for internal use. Not for things >that are heavily used. > > > Our idea is to provide those few helper functions that enable people to > > make full use of IB and its features without reading all the IB spec, > > and a Phd. > > If you clear all the remarks from the library, you will see it is very > > slim, and for my understanding includes all the relevant input and > > output parameters for each of the 3 functions I mentioned. > >But an interface like that is usually provided by glibc not by the kernel. >At the mixing of levels in that proposed API is absolutely horrible. > > >Eric >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From tduffy at sun.com Tue Mar 1 09:21:56 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 09:21:56 -0800 Subject: [openib-general] [PATCH][IPOIB] data_debug_level should be declared static Message-ID: <1109697716.11800.2.camel@duffman> Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- drivers/infiniband/ulp/ipoib/ipoib_ib.c (revision 1927) +++ drivers/infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -40,7 +40,7 @@ #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA -int data_debug_level; +static int data_debug_level; module_param(data_debug_level, int, 0644); MODULE_PARM_DESC(data_debug_level, From tduffy at sun.com Tue Mar 1 09:46:26 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 09:46:26 -0800 Subject: [openib-general] Re: [openib-commits] r1932 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <20050301174315.F23FD22834D@openib.ca.sandia.gov> References: <20050301174315.F23FD22834D@openib.ca.sandia.gov> Message-ID: <1109699186.11800.13.camel@duffman> On Tue, 2005-03-01 at 09:43 -0800, libor at openib.org wrote: > Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h > =================================================================== > --- gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-02-28 23:43:10 UTC (rev 1931) > +++ gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-03-01 17:43:14 UTC (rev 1932) > @@ -74,3 +74,15 @@ > }; > > #endif /* _SDP_BUFF_P_H */ > + > + > + > + > + > + > + > + > + > + > + > + Checkin turd. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Mar 1 09:56:09 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 09:56:09 -0800 Subject: [openib-general] [PATCH][SDP] lnx_stream_ops should be declared static Message-ID: <1109699770.11800.17.camel@duffman> lnx_stream_ops should be static. Also, fix one more static name in sdp_proc.c Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_inet.c (revision 1932) +++ drivers/infiniband/ulp/sdp/sdp_inet.c (working copy) @@ -1373,7 +1373,7 @@ static int sdp_inet_shutdown(struct sock /* * Primary socket initialization */ -struct proto_ops _lnx_stream_ops = { +static struct proto_ops lnx_stream_ops = { .family = AF_INET_SDP, .release = sdp_inet_release, .bind = sdp_inet_bind, @@ -1419,7 +1419,7 @@ static int sdp_inet_create(struct socket return -ENOMEM; } - sock->ops = &_lnx_stream_ops; + sock->ops = &lnx_stream_ops; sock->state = SS_UNCONNECTED; sock_graft(conn->sk, sock); Index: drivers/infiniband/ulp/sdp/sdp_proc.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_proc.c (revision 1932) +++ drivers/infiniband/ulp/sdp/sdp_proc.c (working copy) @@ -81,7 +81,7 @@ static int sdp_proc_read_parse(char *pag * (anything that is not a module) should create an entry and define read * write function. */ -static struct sdpc_proc_ent _file_entry_list[SDP_PROC_ENTRIES] = { +static struct sdpc_proc_ent file_entry_list[SDP_PROC_ENTRIES] = { { .entry = NULL, .type = SDP_PROC_ENTRY_MAIN_BUFF, @@ -136,7 +136,7 @@ int sdp_main_proc_cleanup(void) * first clean-up the frameworks tables */ for (counter = 0; counter < SDP_PROC_ENTRIES; counter++) { - sub_entry = &_file_entry_list[counter]; + sub_entry = &file_entry_list[counter]; if (sub_entry->entry) { remove_proc_entry(sub_entry->name, dir_root); sub_entry->entry = NULL; @@ -189,7 +189,7 @@ int sdp_main_proc_init(void) dir_root->owner = THIS_MODULE; for (counter = 0; counter < SDP_PROC_ENTRIES; counter++) { - sub_entry = &_file_entry_list[counter]; + sub_entry = &file_entry_list[counter]; if (sub_entry->type != counter) { result = -EFAULT; goto error; From libor at topspin.com Tue Mar 1 09:58:21 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 1 Mar 2005 09:58:21 -0800 Subject: [openib-general] Re: [openib-commits] r1932 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <1109699186.11800.13.camel@duffman>; from tduffy@sun.com on Tue, Mar 01, 2005 at 09:46:26AM -0800 References: <20050301174315.F23FD22834D@openib.ca.sandia.gov> <1109699186.11800.13.camel@duffman> Message-ID: <20050301095821.A27810@topspin.com> On Tue, Mar 01, 2005 at 09:46:26AM -0800, Tom Duffy wrote: > On Tue, 2005-03-01 at 09:43 -0800, libor at openib.org wrote: > > Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h > > =================================================================== > > --- gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-02-28 23:43:10 UTC (rev 1931) > > +++ gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-03-01 17:43:14 UTC (rev 1932) > > @@ -74,3 +74,15 @@ > > }; > > Checkin turd. That's odd. It doesn't make much sense to have a seperate header file for a single structure which is used in the same place as sdp_buff.h and four constants. Here's a patch to get rid of the file. 4 files changed, 28 insertions(+), 91 deletions(-) -Libor Signed-off-by: Libor Michalek Index: sdp_main.h =================================================================== --- sdp_main.h (revision 1932) +++ sdp_main.h (working copy) @@ -115,6 +115,4 @@ #include "sdp_advt.h" #include "sdp_iocb.h" -#include "sdp_buff_p.h" - #endif /* _SDP_MAIN_H */ Index: sdp_dev.h =================================================================== --- sdp_dev.h (revision 1932) +++ sdp_dev.h (working copy) @@ -111,8 +111,14 @@ #define SDP_SEND_POST_FRACTION 0x06 #define SDP_SEND_POST_SLOW 0x01 #define SDP_SEND_POST_COUNT 0x0A - /* + * Buffer pool initialization defaul values. + */ +#define SDP_BUFF_POOL_COUNT_MIN 1024 +#define SDP_BUFF_POOL_COUNT_MAX 1048576 +#define SDP_BUFF_POOL_COUNT_INC 128 +#define SDP_BUFF_POOL_FREE_MARK 1024 +/* * SDP experimental parameters. */ Index: sdp_buff.h =================================================================== --- sdp_buff.h (revision 1932) +++ sdp_buff.h (working copy) @@ -76,6 +76,27 @@ u32 lkey; /* component of scather/gather list (key) */ }; +struct sdpc_buff_root { + /* + * variant + */ + struct sdpc_buff_q pool; /* actual pool of buffers */ + spinlock_t lock; /* spin lock for pool access */ + /* + * invariant + */ + kmem_cache_t *pool_cache; /* cache of pool objects */ + kmem_cache_t *buff_cache; /* cache of buffer descriptor objects */ + + int buff_min; /* minimum allocated buffers */ + int buff_max; /* maximum allocated buffers */ + int buff_cur; /* total allocated buffers */ + int buff_size; /* size of each buffer in the pool */ + + int alloc_inc; /* allocation increment */ + int free_mark; /* start freeing unused buffers */ +}; + /* * buffer flag defintions */ Index: sdp_buff_p.h =================================================================== --- sdp_buff_p.h (revision 1932) +++ sdp_buff_p.h (working copy) @@ -1,88 +0,0 @@ -/* - * Copyright (c) 2005 Topspin Communications. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#ifndef _SDP_BUFF_P_H -#define _SDP_BUFF_P_H -/* - * linux types - */ -#include -#include -#include - -#include "sdp_buff.h" -/* - * definitions - */ -#define SDP_BUFF_POOL_COUNT_MIN 1024 -#define SDP_BUFF_POOL_COUNT_MAX 1048576 -#define SDP_BUFF_POOL_COUNT_INC 128 -#define SDP_BUFF_POOL_FREE_MARK 1024 -/* - * structures - */ -struct sdpc_buff_root { - /* - * variant - */ - struct sdpc_buff_q pool; /* actual pool of buffers */ - spinlock_t lock; /* spin lock for pool access */ - /* - * invariant - */ - kmem_cache_t *pool_cache; /* cache of pool objects */ - kmem_cache_t *buff_cache; /* cache of buffer descriptor objects */ - - int buff_min; /* minimum allocated buffers */ - int buff_max; /* maximum allocated buffers */ - int buff_cur; /* total allocated buffers */ - int buff_size; /* size of each buffer in the pool */ - - int alloc_inc; /* allocation increment */ - int free_mark; /* start freeing unused buffers */ -}; - -#endif /* _SDP_BUFF_P_H */ - - - - - - - - - - - - From tduffy at sun.com Tue Mar 1 10:10:18 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 10:10:18 -0800 Subject: [openib-general] Re: [openib-commits] r1932 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <20050301095821.A27810@topspin.com> References: <20050301174315.F23FD22834D@openib.ca.sandia.gov> <1109699186.11800.13.camel@duffman> <20050301095821.A27810@topspin.com> Message-ID: <1109700619.11800.18.camel@duffman> On Tue, 2005-03-01 at 09:58 -0800, Libor Michalek wrote: > That's odd. It doesn't make much sense to have a seperate header file > for a single structure which is used in the same place as sdp_buff.h and > four constants. Here's a patch to get rid of the file. That's one way to get rid of the turd ;) Looks good. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Mar 1 10:12:35 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 10:12:35 -0800 Subject: [openib-general] [PATCH][CORE] fix sparse warnings about static variables Message-ID: <1109700755.11800.21.camel@duffman> This gets rid of the new sparse warnings like: /build1/tduffy/openib-work/linux-2.6.10-openib/drivers/infiniband/core/mad.c:50:14: warning: symbol 'ib_mad_cache' was not declared. Should it be static? Signed-off-by: Tom Duffy Index: drivers/infiniband/core/agent.c =================================================================== --- drivers/infiniband/core/agent.c (revision 1922) +++ drivers/infiniband/core/agent.c (working copy) @@ -45,14 +45,11 @@ #include "smi.h" #include "agent_priv.h" #include "mad_priv.h" - +#include "agent.h" spinlock_t ib_agent_port_list_lock; static LIST_HEAD(ib_agent_port_list); -extern kmem_cache_t *ib_mad_cache; - - /* * Caller must hold ib_agent_port_list_lock */ Index: drivers/infiniband/core/cache.c =================================================================== --- drivers/infiniband/core/cache.c (revision 1922) +++ drivers/infiniband/core/cache.c (working copy) @@ -38,6 +38,7 @@ #include #include "core_priv.h" +#include "ib_cache.h" struct ib_pkey_cache { int table_len; Index: drivers/infiniband/core/mad_priv.h =================================================================== --- drivers/infiniband/core/mad_priv.h (revision 1922) +++ drivers/infiniband/core/mad_priv.h (working copy) @@ -194,4 +194,6 @@ struct ib_mad_port_private { struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; +extern kmem_cache_t *ib_mad_cache; + #endif /* __IB_MAD_PRIV_H__ */ Index: drivers/infiniband/core/smi.c =================================================================== --- drivers/infiniband/core/smi.c (revision 1922) +++ drivers/infiniband/core/smi.c (working copy) @@ -37,7 +37,7 @@ */ #include - +#include "smi.h" /* * Fixup a directed route SMP for sending From mshefty at ichips.intel.com Tue Mar 1 10:27:37 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 10:27:37 -0800 Subject: [openib-general] [MAD] RMPP reassembly Message-ID: <4224B419.1080601@ichips.intel.com> I'm studying the RMPP implementation requirements for reassembly, and there are a couple of issues/questions. * What is an appropriate window size for the receiver to use? My initial thought was to use 1/8th of the receive queue size, but this would be easy to change. * For the total transaction timeout, the equation given to calculate the value would probably require 1000+ lines of code, and the default value given is 40 seconds, which seems long. Any opinions on what approach to take here? I can either go with a total reassembly timeout value, or a timeout relative to the last received segment. I'm leaning towards whichever ends up being easier to implement. * Have people found it necessary to keep the context of a reassembled MAD around after reassembly has completed? - Sean From mshefty at ichips.intel.com Tue Mar 1 10:34:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 10:34:16 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> References: <4224B419.1080601@ichips.intel.com> Message-ID: <4224B5A8.4000109@ichips.intel.com> Sean Hefty wrote: > I'm studying the RMPP implementation requirements for reassembly, and > there are a couple of issues/questions. Also, does anyone know of any existing RMPP implementations outside of the SourceForge IB stack? - Sean From halr at voltaire.com Tue Mar 1 10:41:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Mar 2005 13:41:50 -0500 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B5A8.4000109@ichips.intel.com> References: <4224B419.1080601@ichips.intel.com> <4224B5A8.4000109@ichips.intel.com> Message-ID: <1109702314.23094.1763.camel@localhost.localdomain> Hi Sean, On Tue, 2005-03-01 at 13:34, Sean Hefty wrote: > Also, does anyone know of any existing RMPP implementations outside of > the SourceForge IB stack? Voltaire has one in its gen1 stack. -- Hal From mshefty at ichips.intel.com Tue Mar 1 10:52:05 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 10:52:05 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <1109702314.23094.1763.camel@localhost.localdomain> References: <4224B419.1080601@ichips.intel.com> <4224B5A8.4000109@ichips.intel.com> <1109702314.23094.1763.camel@localhost.localdomain> Message-ID: <4224B9D5.2010205@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > On Tue, 2005-03-01 at 13:34, Sean Hefty wrote: > >>Also, does anyone know of any existing RMPP implementations outside of >>the SourceForge IB stack? > > > Voltaire has one in its gen1 stack. (Resending to list) Can you send me a link to the directory? - Sean From roland at topspin.com Tue Mar 1 11:48:46 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 11:48:46 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Mar 2005 10:27:37 -0800") References: <4224B419.1080601@ichips.intel.com> Message-ID: <527jkrp2ox.fsf@topspin.com> Sean> * For the total transaction timeout, the equation given to Sean> calculate the value would probably require 1000+ lines of Sean> code, and the default value given is 40 seconds, which seems Sean> long. Any opinions on what approach to take here? I can Sean> either go with a total reassembly timeout value, or a Sean> timeout relative to the last received segment. I'm leaning Sean> towards whichever ends up being easier to implement. I'd be somewhat scared to tinker with the timeout calculations without doing some heavy-duty research into how the modified version interacts with a spec-compliant implementation. Experience with TCP shows that protocol behavior in the face of packet loss can be complex and unpredictable and that minor changes in the protocol can lead to large degradations in performance. - R. From mshefty at ichips.intel.com Tue Mar 1 11:57:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 11:57:27 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <527jkrp2ox.fsf@topspin.com> References: <4224B419.1080601@ichips.intel.com> <527jkrp2ox.fsf@topspin.com> Message-ID: <4224C927.80206@ichips.intel.com> Roland Dreier wrote: > Sean> * For the total transaction timeout, the equation given to > Sean> calculate the value would probably require 1000+ lines of > Sean> code, and the default value given is 40 seconds, which seems > Sean> long. Any opinions on what approach to take here? I can > Sean> either go with a total reassembly timeout value, or a > Sean> timeout relative to the last received segment. I'm leaning > Sean> towards whichever ends up being easier to implement. > > I'd be somewhat scared to tinker with the timeout calculations without > doing some heavy-duty research into how the modified version interacts > with a spec-compliant implementation. Experience with TCP shows that > protocol behavior in the face of packet loss can be complex and > unpredictable and that minor changes in the protocol can lead to large > degradations in performance. I would tend to agree, except that the IB spec gives this beauty of an equation for calculating total transaction timeout: 4.096 us x 8 x ceiling(payload length/220) x (2 ^ packet lifetime from sender to receiver + 2 ^ packet lifetime from receiver to sender + 2 ^ receiver response time value (ClassPortInfo:RespTimeValue or 20) + 2 ^ sender response time value (ClassPortInfo:RespTimeValue or 20) Getting from receiving the first segment of an RMPP MAD to this value is non-trivial, and doing so before the sender times out is even more difficult. Is there spec compliant implementation of this in existence? If so, I'd be interested in seeing it. - Sean From roland at topspin.com Tue Mar 1 12:00:50 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 12:00:50 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224C927.80206@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Mar 2005 11:57:27 -0800") References: <4224B419.1080601@ichips.intel.com> <527jkrp2ox.fsf@topspin.com> <4224C927.80206@ichips.intel.com> Message-ID: <523bvfp24t.fsf@topspin.com> Sean> Getting from receiving the first segment of an RMPP MAD to Sean> this value is non-trivial, and doing so before the sender Sean> times out is even more difficult. Is there spec compliant Sean> implementation of this in existence? If so, I'd be Sean> interested in seeing it. Yeah, I know that equation. It doesn't seem that bad to calculate -- I guess the worst part is dividing by 220, but that shouldn't be more than a few hundred cycles. - R. From mshefty at ichips.intel.com Tue Mar 1 12:05:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 12:05:06 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <523bvfp24t.fsf@topspin.com> References: <4224B419.1080601@ichips.intel.com> <527jkrp2ox.fsf@topspin.com> <4224C927.80206@ichips.intel.com> <523bvfp24t.fsf@topspin.com> Message-ID: <4224CAF2.1070002@ichips.intel.com> Roland Dreier wrote: > Sean> Getting from receiving the first segment of an RMPP MAD to > Sean> this value is non-trivial, and doing so before the sender > Sean> times out is even more difficult. Is there spec compliant > Sean> implementation of this in existence? If so, I'd be > Sean> interested in seeing it. > > Yeah, I know that equation. It doesn't seem that bad to calculate -- > I guess the worst part is dividing by 220, but that shouldn't be more > than a few hundred cycles. I'm more concerned about getting the necessary data than performing the actual calculation. - Sean From jlentini at netapp.com Tue Mar 1 12:58:15 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 1 Mar 2005 15:58:15 -0500 (EST) Subject: [openib-general] MTHCA features In-Reply-To: <521xazqrp4.fsf@topspin.com> References: <52bra4s8w1.fsf@topspin.com> <521xazqrp4.fsf@topspin.com> Message-ID: NFS/RDMA doesn't require memory windows but it will make use of them if they are available. -james On Tue, 1 Mar 2005, Roland Dreier wrote: > James> I ask because the DAT API contains interfaces that allow > James> users to interact with memory windows. > > Are there any real applications that use those interfaces? > > Thanks, > Roland > From hch at lst.de Tue Mar 1 14:05:43 2005 From: hch at lst.de (Christoph Hellwig) Date: Tue, 1 Mar 2005 23:05:43 +0100 Subject: [openib-general] putting in dead wood for DAPL and similar abomination Message-ID: <20050301220543.GA16443@lst.de> Please don't put in things like the address translation service or memory windows for DAPL folks. The IB code in the kernel already has far too much unused stuff and adding more will not go past reviews for kernel inclusions - as will DAPL itself exactly because of such utter stupidities. Similar hint to the NFS over RDMA folks at CITI - if you want your stuff to go in use the openib helper directly below the transport switch - differnet RDMA transports are too diverse to be sanely abstracted out and DAPL does a horrible job at that. If we need to consolidate code for differnt transports we can put it into a library later on. From tduffy at sun.com Tue Mar 1 14:13:27 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 14:13:27 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> Message-ID: <1109715208.11800.41.camel@duffman> On Tue, 2005-03-01 at 08:07 +0200, Yaron Haviv wrote: > The one thing that ATS provide and is not possible with ARP is reverse > resolution GID->IP, any ideas how to achieve that without ATS ? RARP. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Tue Mar 1 14:53:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 00:53:18 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52u0o4pfe8.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> Message-ID: <20050301225318.GA16946@mellanox.co.il> Quoting r. Roland Dreier : > Subject: ANNOUNCE: First usable version of userspace verbs > > I'm happy to announce the initial availability of userspace verbs > support for brave testers. > > To try this out, check out the roland-uverbs subversion branch: > > svn co https://openib.org/svn/gen2/branches/roland-uverbs > > and build as usual. Select CONFIG_INFINIBAND_USER_VERBS to build > userspace verbs support. > > If you want to use a linux-2.6.10 kernel, you will need to apply the > new linux-2.6.10-backports.diff patch from the branch (which just > exports get_sb_pseudo()). No patches at all are required for an > up-to-date BK or linux-2.6.11-rc4 tree. > > If you use udev, add the rule > > KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" > > to your configuration. Otherwise, create the required device files: > > mknod /dev/infiniband/uverbs0 c 231 128 > mknod /dev/infiniband/uverbs1 c 231 129 > > and so on for as many HCAs as you have installed. > > The build the userspace libraries in src/userspace/libibverbs and > src/userspace/libmthca with the usual > > ./autogen.sh && ./configure && make && sudo make install > > passing whatever parameters to configure you want; you can use > --prefix to install to another location. If you set a non-standard > prefix, it may be useful to pass a -I in CPPFLAGS to the > configure for libmthca. > > Once you have the libraries built and installed, load the ib_mthca and > ib_uverbs modules. By default, libibverbs will search for driver > libraries in /lib/infiniband; if you installed libmthca > somewhere else, set the OPENIB_DRIVER_PATH environment variable to > point to the directory with mthca.so. > > To actually try things out, you can use the ibv_pingpong program > shipped as part of the libibverbs package. For example, one one > system start the server side > > $ ibv_pingpong > > and on another system start the client by passing the address of the > server (in this example I use IPv6 over IPoIB): > > $ ibv_pingpong fe80::202:c901:7fc:c711%ib0 > > The pingpong program has a number of options -- run ibv_pingpong -h to > see a list of the switches you can try. > > The current code is stable for me, but all that means is that my tiny > selection of tests and test systems has not uncovered any of the bugs > that are undoubtedly present. Some of the limitations I know about: > > - Only RC is implemented. There are not even any functions to call > to create UD address handles yet. > - Only Tavor mode is supported -- PCI Express HCAs will not work if > they are running mem-free firmware. > - On x86, only CPUs with SSE will work now. I'd be surprised if > anyone has x86 system with an HCA that doesn't have SSE. > > Also, I've only tried 32-bit i386 userspace running on i386 and x86_64 > kernels -- I don't expect any portability problems but I haven't even > built for other architectures. > > In any case, please give this a spin and let me know how it looks to you. > > My short- and medium-term plans are: > > 1. Catch up on reviewing and applying the patche queue I'm sitting on. > 2. Land the Arbel mem-free mode support from the roland-uverbs branch > onto the main trunk (and merge it upstream once 2.6.11 is out and > 2.6.12 opens). > 3. Implement UD support for userspace. I should have this done before > the end of next week. > 4. Implement mem-free support for userspace. > > Thanks, > Roland Roland, I have implemented a small test for the rdma functionality. I based it on the pingpong test, the main change being polling on data instead of completions (but I also changed the clock sampling to use the realtime clock from -lrt, since it gives a more consistent timing results on my system). This is useful as an example of using rdma, and is also useful as a post send latency benchmark, for tuning (nicer than the send test in that it let us measure post send separately from poll cq). Do you want such stuff under libibverbs/examples, or somewhere else? -- MST - Michael S. Tsirkin From tduffy at sun.com Tue Mar 1 15:01:46 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 15:01:46 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD2D@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAD2D@taurus.voltaire.com> Message-ID: <1109718106.11800.51.camel@duffman> [ putting back on list ] On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > Did you try RARP with IPoIB ? I have not. > I thought that there is some issue that it doesn't work Currently, the rarpd only works with ethernet, but I don't see why this couldn't be fixed. > Also I hope you can comment on the other ib_at capabilities which are > more important than ATS I don't mind the idea of abstracting out address translation. I think maybe this is a premature optimization and we should see how each ULP uses/does it first, then abstract out common code. Otherwise, I feel neither strongly for or against your proposal. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Mar 1 15:02:14 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:02:14 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050301225318.GA16946@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 00:53:18 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> Message-ID: <52y8d7nf61.fsf@topspin.com> Michael> I have implemented a small test for the rdma Michael> functionality. I based it on the pingpong test, the main Michael> change being polling on data instead of completions (but Michael> I also changed the clock sampling to use the realtime Michael> clock from -lrt, since it gives a more consistent timing Michael> results on my system). Sounds great, thanks. Michael> Do you want such stuff under libibverbs/examples, or Michael> somewhere else? Please generate a patch putting it under libibverbs/examples. If it makes sense to share code from pingpong.c, feel free to split pingpong.c into multiple source files and share the code. Thanks, Roland From roland at topspin.com Tue Mar 1 15:06:48 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:06:48 -0800 Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: <20050301220543.GA16443@lst.de> (Christoph Hellwig's message of "Tue, 1 Mar 2005 23:05:43 +0100") References: <20050301220543.GA16443@lst.de> Message-ID: <52u0nvneyf.fsf@topspin.com> Christoph> Please don't put in things like the address translation Christoph> service or memory windows for DAPL folks. The IB code Christoph> in the kernel already has far too much unused stuff and Christoph> adding more will not go past reviews for kernel Christoph> inclusions - as will DAPL itself exactly because of Christoph> such utter stupidities. Similar hint to the NFS over Christoph> RDMA folks at CITI - if you want your stuff to go in Christoph> use the openib helper directly below the transport Christoph> switch - differnet RDMA transports are too diverse to Christoph> be sanely abstracted out and DAPL does a horrible job Christoph> at that. If we need to consolidate code for differnt Christoph> transports we can put it into a library later on. I agree with this sentiment. (Notice how I asked if any real applications are using memory windows?) I also agree that it makes sense to build abstractions by looking at multiple real implementations, rather than trying to design the abstractions in advance. We're just now beginning to understand how a clean InfiniBand stack should look, and I haven't seen any free software for other RDMA transports. By the way, at least for the code I wrote, anything that doesn't have a kernel user yet is there because it is used by a real protocol that should make it upstream eventually. - R. From roland at topspin.com Tue Mar 1 15:13:29 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:13:29 -0800 Subject: [openib-general] [PATCH][IPOIB] data_debug_level should be declared static In-Reply-To: <1109697716.11800.2.camel@duffman> (Tom Duffy's message of "Tue, 01 Mar 2005 09:21:56 -0800") References: <1109697716.11800.2.camel@duffman> Message-ID: <52ll97nena.fsf@topspin.com> Thanks, applied. - R. From roland at topspin.com Tue Mar 1 15:22:30 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:22:30 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050301225318.GA16946@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 00:53:18 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> Message-ID: <52hdjvne89.fsf@topspin.com> mst> but I also changed the clock sampling to use the realtime mst> clock from -lrt, since it gives a more consistent timing mst> results on my system. By the way, what exactly are you using? clock_gettime() with CLOCK_REALTIME? Do you know what the difference from gettimeofday is? I haven't followed Linux timekeeping development too closely but there should be some portable libc way to get high-resolution time without a system call (ie rdtsc on x86, mftb on ppc, etc). - R. From yaronh at voltaire.com Tue Mar 1 15:38:50 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 2 Mar 2005 01:38:50 +0200 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAD30@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Christoph Hellwig > Sent: Wednesday, March 02, 2005 12:06 AM > To: openib-general at openib.org > Subject: [openib-general] putting in dead wood for DAPL and > similarabomination > > Please don't put in things like the address translation service or > memory windows for DAPL folks. The IB code in the kernel already > has far too much unused stuff and adding more will not go past reviews > for kernel inclusions - as will DAPL itself exactly because of such > utter stupidities. Even if your approach to DAPL was right you still have address translation service in SDP, and would need one for NFS/RDMA, and another one to iSER and another one for Lustre, etc' (even if they are coded directly to the verbs) Not to mention other protocols that access the SA (e.g. SRP, ..). So is your idea to duplicate that functionality for all the ULPs ? Would that make the code simpler and easier to maintain ? Yaron From hch at lst.de Tue Mar 1 15:46:44 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 2 Mar 2005 00:46:44 +0100 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD30@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAD30@taurus.voltaire.com> Message-ID: <20050301234644.GA18115@lst.de> On Wed, Mar 02, 2005 at 01:38:50AM +0200, Yaron Haviv wrote: > Even if your approach to DAPL was right you still have address > translation service in SDP, and would need one for NFS/RDMA, and another > one to iSER and another one for Lustre, etc' (even if they are coded > directly to the verbs) Not to mention other protocols that access the SA > (e.g. SRP, ..). > > So is your idea to duplicate that functionality for all the ULPs ? > Would that make the code simpler and easier to maintain ? Get the code out first and then see what can be shared and what not, there's no way to find a sane API otherwise. From libor at topspin.com Tue Mar 1 15:52:43 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 1 Mar 2005 15:52:43 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAC6A@taurus.voltaire.com>; from yaronh@voltaire.com on Mon, Feb 28, 2005 at 11:55:54PM +0200 References: <35EA21F54A45CB47B879F21A91F4862F3FAC6A@taurus.voltaire.com> Message-ID: <20050301155243.B27810@topspin.com> On Mon, Feb 28, 2005 at 11:55:54PM +0200, Yaron Haviv wrote: > From: Libor Michalek > > > > SDP does implement a subset of the proposed functionality for > > resolving IP addresses to PathRecords which can then be used in > > a CM REQ request, plus some basic caching. All the code is isolated > > to a single file, sdp_link.c. There's really only a single entry > > point API, plus a completion function: > > > > int sdp_link_path_lookup(u32 dst_addr, > > u32 src_addr, > > int bound_dev_if, > > void (*completion)(u64 id, > > int status, > > u32 dst_addr, > > u32 src_addr, > > u8 hw_port, > > struct ib_device *ca, > > struct ib_sa_path_rec *path, > > void *arg), > > void *arg, > > u64 *id); > > > > The values are based on strictly what is needed by either the Linux > > routing code to resolve the address, or the IB APIs to establish the > > connection. The implementation has three stages: > > > > - src/dst IP address -> IPoIB net_device, IB ca, IB port, IB pkey. > > - dst IP address and IPoIB net_device -> dst GID using IPoIB ARP > > - dst GID -> PathRecord using ib_sa. > > Libor the idea is that ib_at provides similar functionality > Sahar looked through your SDP code prior to proposing the API > We would like to have a common API for all the ULP's that provide that > functionality, and specifically now when we implement kDAPL over OpenIB. Sure, it does make sense to break this code into it's own module if there are multiple ULPs that need to use the code, and sounds like we are getting close to having another ULP which needs this resolution. However, the API feels like it is intended to provide every possible bell and whistle imaginable. It is far better to start with a simple clean minimum of features and add to the functionality as new ULPs are introduced or old ULPs are improved. I may be wrong, and people have intentions for each function and parameter that you proposed, but it feels so large that it would be good to hear which ULPs you envision using each of the functions, especially some of the less obvious ones. Remember, this API does not need to be frozen out of the gate, changes can and will be made, incompatabilities will be introduced. I would like to see the feature set, if possible, split between user and kernel space, we should minimize what's in the kernel, and features that are only needed in userspace, should only be implemented in userspace. I also see kDAPL as weak justification for a feature. (notice I did no say uDAPL) I would be better to see a kDAPL proposal, by which I mean code, that had a chance before we start including features for it in surrounding code. As it stands it has an uphill battle, and not just because of the API itself. > To summaries the differences: > > The reasons we broken it to two functions (IP->GID, GID->Path) and not > have an IP->Path API (like we also used to have in our gen1 stack) are: > > a. some consumers will only need the 1st part (e.g. just to know which > HCA to use) > b. some may use only the 2nd part (e.g. IPoIB, SRP) > c. you can get parameters from the first part (e.g. P_Key, and decide to > overwrite it with your own P_Key, etc') > d. the 2nd function provides more options for multipath, partitioning, > QoS > e. we can now more easily use different IP resolution mechanisms without > changing the 2nd function (ARP or ATS). I have no real problem with spliting the two halves of the resolution into two functions, as long as the common case of IP->Path is easy to perform. By which I mean that all the parameters I need for GID->Path are either in the IP->GID result or are obvious. Which it sounds like from your later comment. > We added source IP and TOS as optional parameters for the IP->GID, just > because IP route can be defined for Src/dst/TOS, and it's already part > of Linux. OK, sounds good. I'm using source IP now, since it's possible to bind a socket to a specific source address before connecting. -Libor From mst at mellanox.co.il Tue Mar 1 16:22:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 02:22:19 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52hdjvne89.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> <52hdjvne89.fsf@topspin.com> Message-ID: <20050302002219.GB16946@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > mst> but I also changed the clock sampling to use the realtime > mst> clock from -lrt, since it gives a more consistent timing > mst> results on my system. > > By the way, what exactly are you using? clock_gettime() with > CLOCK_REALTIME? Yes. > Do you know what the difference from gettimeofday is? I didnt investigate, all I know is that with clock_gettime I seem to get consistent results across runs, not so with gettimeofday. > I haven't followed Linux timekeeping development too closely but there > should be some portable libc way to get high-resolution time without a > system call (ie rdtsc on x86, mftb on ppc, etc). > > - R. I can look up the librt source, I guess. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Tue Mar 1 16:38:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 02:38:36 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52hdjvne89.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> <52hdjvne89.fsf@topspin.com> Message-ID: <20050302003836.GA17646@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > mst> but I also changed the clock sampling to use the realtime > mst> clock from -lrt, since it gives a more consistent timing > mst> results on my system. > > By the way, what exactly are you using? clock_gettime() with > CLOCK_REALTIME? Do you know what the difference from gettimeofday is? > > I haven't followed Linux timekeeping development too closely but there > should be some portable libc way to get high-resolution time without a > system call (ie rdtsc on x86, mftb on ppc, etc). > > - R. > Looking at libc sources (glibc-2.3.2-200304020432) there appears an internal macro for it, but I dont see it exported, it seems to be used for ./malloc/memusage.c implementation. I'll look for a library outside of libc that we can use. clock_gettime is a syscall, so has overhead course. Still, since we call it once per 1000 iterations, the overhead isnt big. -- MST - Michael S. Tsirkin From tziporet at mellanox.co.il Wed Mar 2 00:33:56 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 2 Mar 2005 10:33:56 +0200 Subject: [openib-general] MTHCA features Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF063@mtlex01.yok.mtl.com> We recommend to work with FMRs and not memory windows due to performance. FMRS are much faster and available for kernel modules only. They are not yet implemented in mthca but it is possible to add them. Tziporet -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Tuesday, March 01, 2005 10:58 PM To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [openib-general] MTHCA features NFS/RDMA doesn't require memory windows but it will make use of them if they are available. -james -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Mar 2 01:27:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 11:27:51 +0200 Subject: [openib-general] [PATCH] uverbs: whitespace fix Message-ID: <20050302092751.GB25029@mellanox.co.il> Whitespace fix. Signed-off-by: Michael S. Tsirkin Index: mthca_provider.c =================================================================== --- mthca_provider.c (revision 1895) +++ mthca_provider.c (working copy) @@ -674,7 +674,7 @@ static struct ib_mr *mthca_reg_user_mr(s return ERR_PTR(-ENOMEM); list_for_each_entry(chunk, ®ion->chunk_list, list) - npages += chunk->nents; + npages += chunk->nents; page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); if (!page_list) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 2 02:12:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 12:12:12 +0200 Subject: [openib-general] [PATCH] uverbs_mem printk Message-ID: <20050302101212.GC25029@mellanox.co.il> Since userspace can trivially trigger get_user_pages failure by passing in an illegal virtual address/size pair, I suggest removing the printk when this happends: I think kernel messages should reflect kernel problems, not user level application bugs. Signed-off-by: Michael S. Tsirkin Index: core/uverbs_mem.c =================================================================== --- core/uverbs_mem.c (revision 1895) +++ core/uverbs_mem.c (working copy) @@ -69,11 +69,6 @@ int ib_umem_get(struct ib_device *dev, s PAGE_SIZE / sizeof (struct page *)), 1, 0, page_list, NULL); - if (ret < 0) { - printk(KERN_ERR "get_user_pages: %d\n", ret); - printk(KERN_ERR "failed at cur_base %lx\n", cur_base); - } - if (ret < 0) goto out; -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 2 02:43:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 12:43:11 +0200 Subject: [openib-general] mr_table.max_mtt_order Message-ID: <20050302104311.GD25029@mellanox.co.il> Roland, two questions: 1. I'm looking at mthca_init_mr_table. The following loop: for (i = 1, dev->mr_table.max_mtt_order = 0; i < dev->limits.num_mtt_segs; i <<= 1, ++dev->mr_table.max_mtt_order) ; /* nothing */ Seems to exit th first time when (1 << (dev->mr_table.max_mtt_order) ) >= dev->limits.num_mtt_segs So if dev->limits.num_mtt_segs is not a power of 2, (1 << (dev->mr_table.max_mtt_order) ) > dev->limits.num_mtt_segs and so max_mtt_order seems to be too large by 1? Did I misunderstand something, or is there something that forces dev->limits.num_mtt_segs to be a power of 2? 2. There are some places in mthca where we try to round some value up to the power of 2, some done by loops like this one. I find them error-prone. Will you accept a patch replacing them with an inline function? Using fls, this function will also be more efficient than a linear loop. -- MST - Michael S. Tsirkin From Arkady.Kanevsky at netapp.com Wed Mar 2 03:44:22 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 2 Mar 2005 06:44:22 -0500 Subject: [openib-general] IB Address Translation service Message-ID: Some historical perspective - ATS was defined prior to IPoIB. The requirements. DAT has two needs: 1. forward translation: given an IP address returns back IB GID/LID. 2. reverse translation: given IB GID/LID returns back an IP address of the requestor. ULPs: NFS, DAFS. SDP encoded IP addresses into its headers. But DAT is API and cannot define a protocol for it. Abstract address translation is a good idea. For IB we can use ATS or IPoIB. For iWARP it will be no-op. We must ensure that the DAPL that we submit to Linux can be layered on top of all RDMA transports. Since IPoIB had not had plugfest/connectathon or some other interop that demonstrate ARP and RARP I suggest we have both ATS and IPoIB support. ATS has been fully successfully tested at DAPL Plugfest. In DAPL we had not assessed the HA requirements implications on address translations which is currently under discussion. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Duffy [mailto:tduffy at sun.com] > Sent: Tuesday, March 01, 2005 6:02 PM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: RE: [openib-general] IB Address Translation service > > > [ putting back on list ] > > On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > > Did you try RARP with IPoIB ? > > I have not. > > > I thought that there is some issue that it doesn't work > > Currently, the rarpd only works with ethernet, but I don't > see why this couldn't be fixed. > > > Also I hope you can comment on the other ib_at capabilities > which are > > more important than ATS > > I don't mind the idea of abstracting out address translation. > I think maybe this is a premature optimization and we should > see how each ULP uses/does it first, then abstract out common > code. Otherwise, I feel neither strongly for or against your > proposal. > > -tduffy > From gdror at mellanox.co.il Wed Mar 2 05:35:59 2005 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Wed, 2 Mar 2005 15:35:59 +0200 Subject: [dat-discussions] RE: [openib-general] IB Address Translation service Message-ID: <506C3D7B14CDD411A52C00025558DED607211F55@mtlex01.yok.mtl.com> From: Kanevsky, Arkady [mailto:arkady at netapp.com] Sent: Wednesday, March 02, 2005 1:44 PM > > Some historical perspective - ATS was defined prior to IPoIB. > > The requirements. > DAT has two needs: > 1. forward translation: given an IP address returns back IB > GID/LID. 2. reverse translation: given IB GID/LID returns > back an IP address of the requestor. > > ULPs: NFS, DAFS. > > SDP encoded IP addresses into its headers. Arkady, you meant that SDP placed the IP addresses into the private data of the CM REQ message. This message just go once when the connection is established. Right ? In other words, if one wants to perform reverse lookup when not using ATS, then the private data of the REQ message in DAPL has to change so that the connecting node can send it's IP address. > But DAT is API and cannot define a protocol for it. > > Abstract address translation is a good idea. > For IB we can use ATS or IPoIB. > For iWARP it will be no-op. > We must ensure that the DAPL that we submit to Linux can be > layered on top of all RDMA transports. > > Since IPoIB had not had plugfest/connectathon or some other > interop that demonstrate ARP and RARP I suggest we have both > ATS and IPoIB support. ATS has been fully successfully tested > at DAPL Plugfest. As far as I know IPoIB has been tested for interop to some degree last plugfest. I don't know the details. Note that it was tested as a standalone module and not as an address resolution mechanism for DAPL. > > In DAPL we had not assessed the HA requirements implications > on address translations which is currently under discussion. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance phone: 781-768-5395 > 375 Totten Pond Rd. Fax: 781-895-1195 > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > -----Original Message----- > > From: Tom Duffy [mailto:tduffy at sun.com] > > Sent: Tuesday, March 01, 2005 6:02 PM > > To: Yaron Haviv > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] IB Address Translation service > > > > > > [ putting back on list ] > > > > On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > > > Did you try RARP with IPoIB ? > > > > I have not. > > > > > I thought that there is some issue that it doesn't work > > > > Currently, the rarpd only works with ethernet, but I don't > > see why this couldn't be fixed. > > > > > Also I hope you can comment on the other ib_at capabilities > > which are > > > more important than ATS > > > > I don't mind the idea of abstracting out address translation. > > I think maybe this is a premature optimization and we should > > see how each ULP uses/does it first, then abstract out common > > code. Otherwise, I feel neither strongly for or against your > > proposal. > > > > -tduffy > > > > Yahoo! Groups Sponsor > ADVERTISEMENT > > > > > > > Yahoo! Groups Links > > To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > To > unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Mar 2 07:14:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 17:14:46 +0200 Subject: [openib-general] Fwd: Linux 2.6.11 Message-ID: <20050302151446.GA26194@mellanox.co.il> To little fanfare, the first mainline kernel with InfiniBand support has been released: # ls linux-2.6.10/drivers/infiniband /bin/ls: linux-2.6.10/drivers/infiniband: No such file or directory # ls linux-2.6.11/drivers/infiniband . .. Kconfig Makefile core hw include ulp And its _all_ officially bug free (see attachment), which must mean gen2 code is officially bug free too! -- MST - Michael S. Tsirkin -------------- next part -------------- An embedded message was scrubbed... From: Linus Torvalds Subject: Linux 2.6.11 Date: no date Size: 5140 URL: From jlentini at netapp.com Wed Mar 2 08:11:35 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Mar 2005 11:11:35 -0500 (EST) Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: <20050301220543.GA16443@lst.de> References: <20050301220543.GA16443@lst.de> Message-ID: hch> Please don't put in things like the address translation service or hch> memory windows for DAPL folks. The IB code in the kernel already hch> has far too much unused stuff and adding more will not go past reviews hch> for kernel inclusions - as will DAPL itself exactly because of such hch> utter stupidities. Similar hint to the NFS over RDMA folks at CITI - hch> if you want your stuff to go in use the openib helper directly below hch> the transport switch - differnet RDMA transports are too diverse to hch> be sanely abstracted out and DAPL does a horrible job at that. DAPL has been efficiently supported on top of InfiniBand, iWARP, the Virtual Interface Architecture, Quadrics, and Myrinet. hch> If we need to consolidate code for differnt transports we can put hch> it into a library later on. From jlentini at netapp.com Wed Mar 2 08:26:49 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Mar 2005 11:26:49 -0500 (EST) Subject: [openib-general] IB Address Translation service In-Reply-To: <1109715208.11800.41.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> Message-ID: tduffy> > The one thing that ATS provide and is not possible with tduffy> > ARP is reverse resolution GID->IP, any ideas how to achieve tduffy> > that without ATS ? tduffy> tduffy> RARP. Where is the encapsulation of RARP packets on IB defined? The "Transmission of IP over InfiniBand" IETF draft specifies the procedure for ARP and Neighbor Discovery, but not RARP. From tduffy at sun.com Wed Mar 2 10:11:34 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 02 Mar 2005 10:11:34 -0800 Subject: [openib-general] [PATCH][SDP] Make sdp compile on 2.6.11 Message-ID: <1109787094.4913.7.camel@duffman> Now that 2.6.11 is out, need to make sdp compile with 2.6.11. Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_iocb.c (revision 1937) +++ drivers/infiniband/ulp/sdp/sdp_iocb.c (working copy) @@ -141,6 +141,7 @@ static int sdp_iocb_page_save(struct sdp struct page *page; unsigned long pfn; pgd_t *pgd; + pud_t *pud; pmd_t *pmd; pte_t *ptep; pte_t pte; @@ -182,8 +183,12 @@ static int sdp_iocb_page_save(struct sdp pgd = pgd_offset_gate(iocb->mm, addr); if (!pgd || pgd_none(*pgd)) break; + + pud = pud_offset(pgd, addr); + if (!pud || pud_none(*pud)) + break; - pmd = pmd_offset(pgd, addr); + pmd = pmd_offset(pud, addr); if (!pmd || pmd_none(*pmd)) break; From yaronh at voltaire.com Wed Mar 2 10:15:52 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 2 Mar 2005 20:15:52 +0200 Subject: [openib-general] IB Address Translation service Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAD96@taurus.voltaire.com> > -----Original Message----- > From: Tom Duffy [mailto:tduffy at sun.com] > Sent: Wednesday, March 02, 2005 1:02 AM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: RE: [openib-general] IB Address Translation service > > [ putting back on list ] > > On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > > Did you try RARP with IPoIB ? > > I have not. > > > I thought that there is some issue that it doesn't work > > Currently, the rarpd only works with ethernet, but I don't see why this > couldn't be fixed. > Tom, IPoIB HW Address consists of GID+QPN+.. In order to issue a RARP I believe you should supply the full HW address to get the IP address back, how would you know the remote IPoIB QPN ? or can you do it without a QPN ? Yaron From Thomas.Talpey at netapp.com Wed Mar 2 10:40:32 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 02 Mar 2005 13:40:32 -0500 Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: <20050301220543.GA16443@lst.de> References: <20050301220543.GA16443@lst.de> Message-ID: <6.2.1.2.2.20050301214342.01403700@exnane01.nane.netapp.com> At 05:05 PM 3/1/2005, Christoph Hellwig wrote: >Similar hint to the NFS over RDMA folks at CITI - >if you want your stuff to go in use the openib helper directly below >the transport switch - differnet RDMA transports are too diverse to >be sanely abstracted out and DAPL does a horrible job at that. If >we need to consolidate code for differnt transports we can put it >into a library later on. Ok, I'll speak for the NFS over RDMA implementation. (I've brought your hint to the attention of the CITI folks - we are working together this week here at the NFS Connectathon). The NFS/RDMA client, and soon the server, use kDAPL for a simple reason - we need an RDMA API which allows us to plug in RDMA NICs without also having to modify NFS client, server and RPC code. You're trying to sentence us to coding NFS to individual hardware. It's unacceptable to have to modify NFS and RPC just because a new adapter has been attached. It's the same NFS/RDMA protocol over IB, iWARP, and even VI. Offering "consolidation" "later on" is an enormous step backward from what we're using (successfully) today. Tom. From Thomas.Talpey at netapp.com Wed Mar 2 10:49:00 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 02 Mar 2005 13:49:00 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD96@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F3FAD96@taurus.voltaire.com> Message-ID: <6.2.1.2.2.20050302134205.05a92eb0@exnane01.nane.netapp.com> At 01:15 PM 3/2/2005, Yaron Haviv wrote: >In order to issue a RARP I believe you should supply the full HW address >to get the IP address back, how would you know the remote IPoIB QPN ? or >can you do it without a QPN ? To say nothing of the fact that there must be a RARPD, administered and secured on each subnet. Aren't there enough daemons needed to support this stuff as it is? The advantage of ATS is that it "just works" whether wired point to point, or via a switch, or whatever. It requires no central administration, works as transparently as ARP and ND, and supports IP addressing so applications don't have any ambiguity in how they resolve names. If we get rid of ATS, what do we replace it with? Raw IB GID's from the application?? Tom. From robert.j.woodruff at intel.com Wed Mar 2 11:22:56 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 2 Mar 2005 11:22:56 -0800 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> James wrote:, >DAPL has been efficiently supported on top of InfiniBand, iWARP, the >Virtual Interface Architecture, Quadrics, and Myrinet. I think the point is that only one of those interconnects (IB) is in the kernel, the rest are proprietary. Do any of the other RDMA interconnect vendors plan to submit their code for inclusion into Linux in the near future ? woody From halr at voltaire.com Wed Mar 2 11:32:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Mar 2005 14:32:21 -0500 Subject: [openib-general] [PATCH][CORE] fix sparse warnings about static variables In-Reply-To: <1109700755.11800.21.camel@duffman> References: <1109700755.11800.21.camel@duffman> Message-ID: <1109791941.4645.17.camel@localhost.localdomain> On Tue, 2005-03-01 at 13:12, Tom Duffy wrote: > This gets rid of the new sparse warnings like: > > /build1/tduffy/openib-work/linux-2.6.10-openib/drivers/infiniband/core/mad.c:50:14: warning: symbol 'ib_mad_cache' was not declared. Should it be static? Thanks. Applied. -- Hal From sean.hefty at intel.com Wed Mar 2 12:14:28 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 2 Mar 2005 12:14:28 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> Message-ID: >I'm studying the RMPP implementation requirements for reassembly, and >there are a couple of issues/questions. A couple more comments while coding up the reassembly: struct ib_mad_recv_buf contains struct ib_mad *mad. I'm wondering if it makes sense to change this to a union of ib_mad *, ib_rmpp_mad *, ib_vendor_mad *, ib_smp *, ib_sa_mad *. Currently, the user casts the returned MAD to the correct format. This would be a minor, but visible change to all current MAD users... Has anyone given thought on how to best expose RMPP to user mode? - Sean From halr at voltaire.com Wed Mar 2 12:17:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Mar 2005 15:17:06 -0500 Subject: [Fwd: [openib-general] [RFC] Diagnostic tree structure] Message-ID: <1109794536.4645.22.camel@localhost.localdomain> The diagnostics tree structure has been flattened one level. There is no longer host and net subdirectories. The diag tools have been moved up one level in the tree. Let me know if there are any problems I introduced by doing this. Thanks. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] [RFC] Diagnostic tree structure Date: 17 Feb 2005 09:06:39 -0500 Hi, The current userspace diagnostics tree structure has host and net subdirectories. The distinction between the two is blurring so we would like to flatten the tree and just have all the tools under diags. Any objections ? Thanks. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Wed Mar 2 12:24:47 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 02 Mar 2005 12:24:47 -0800 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> Message-ID: <6.2.0.14.2.20050302122242.02994f68@esmail.cup.hp.com> At 11:22 AM 3/2/2005, Woodruff, Robert J wrote: > James wrote:, > >DAPL has been efficiently supported on top of InfiniBand, iWARP, the > >Virtual Interface Architecture, Quadrics, and Myrinet. > >I think the point is that only one of those interconnects (IB) is >in the kernel, the rest are proprietary. Do any of the other RDMA >interconnect vendors plan to submit their code for inclusion into Linux >in the near future ? I know some people are working on iWARP devices though this could be done on the OpenRDMA source work (still a work in progress) which supports both iWARP and IB. BTW, I second Tom, et. al. push to use an API to abstract this and avoid having to permute every subsystem to work for a given device. The RNIC PI is intended to provide abstraction for iWARP / IB hardware to a very large extent (think of this as a standard verbs interface). IT API / DAPL provide another layer of abstraction and can be used to integrate subsystems either over the RNIC PI or whatever verbs API people desire. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Wed Mar 2 13:48:47 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 2 Mar 2005 22:48:47 +0100 Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: References: <20050301220543.GA16443@lst.de> Message-ID: <20050302214847.GA4253@lst.de> On Wed, Mar 02, 2005 at 11:11:35AM -0500, James Lentini wrote: > DAPL has been efficiently supported on top of InfiniBand, iWARP, the > Virtual Interface Architecture, Quadrics, and Myrinet. And I've not seen any kernel submittsion for either of them - and what's important no single kDAPL application that actually shows any benefit that way. Volatair's iSER implementation would surely be smaller when directly written to the OpenIB interface, and is already smaller than the whole kDAPL layer. From yaronh at voltaire.com Wed Mar 2 14:26:33 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 3 Mar 2005 00:26:33 +0200 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Christoph Hellwig > Sent: Wednesday, March 02, 2005 11:49 PM > To: James Lentini > Cc: Christoph Hellwig; openib-general at openib.org > Subject: Re: [openib-general] putting in dead wood for DAPL and > similarabomination > > On Wed, Mar 02, 2005 at 11:11:35AM -0500, James Lentini wrote: > > DAPL has been efficiently supported on top of InfiniBand, iWARP, the > > Virtual Interface Architecture, Quadrics, and Myrinet. > > And I've not seen any kernel submittsion for either of them - and what's > important no single kDAPL application that actually shows any benefit > that way. Volatair's iSER implementation would surely be smaller when > directly written to the OpenIB interface, and is already smaller than > the whole kDAPL layer. Christoph, the reason the iSER code is very thin is that it is using kDAPL (and Linux iSCSI), it doesn't need to deal with SA calls, CM calls, LIDs, GIDs, and a bunch of other things. Besides being RDMA transport independent DAPL enable people to code to RDMA without been intimately familiar with the HW, we saw people coding to it in days, Which I can't say the same for Verbs. Abstract layers are not new to Linux, Sockets is another type of abstraction with multiple protocols/families underneath, or even Ethernet Why aren't you suggesting to do TCP implementation for ATM cards, and one for PPP, etc' Yaron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From jon-openib at umich.edu Wed Mar 2 14:32:22 2005 From: jon-openib at umich.edu (Jon Bauman) Date: Wed, 2 Mar 2005 14:32:22 -0800 Subject: [openib-general] putting in dead wood for DAPL and similar abomination Message-ID: <92a7dc786ef0764d64c329a62c42a3aa@umich.edu> At 05:05 PM 3/1/2005, Christoph Hellwig wrote: > Similar hint to the NFS over RDMA folks at CITI - > if you want your stuff to go in use the openib helper directly below > the transport switch - differnet RDMA transports are too diverse to > be sanely abstracted out and DAPL does a horrible job at that. If > we need to consolidate code for differnt transports we can put it > into a library later on. CITI folk here. I'm not familiar with the openib helper you refer to, but since you mention the transport switch, I'll assume you're referring to the client. I'm currently working on the NFS over RPCRDMA server, so this isn't of much help to me. While I'd agree that DAPL has it's shortcomings, it's not finalized yet, and I know of no other alternatives. On the other hand, I don't agree that the different RDMA transports are necessarily too diverse to provide a reasonable API for them. It seems silly to invest a lot of effort writing directly for IB, since we couldn't reuse the code for other transports. Why create a nonstandard library after the fact when so much work has gone into DAPL already? Even if DAPL needs to change, we can later make our changes just once at that layer. We should have basic functionality with NFS atop DAPL in the near future that will enable us to plug in different transports without changing the ULP code. Would that convince you that DAPL is at least a useful starting point? From tduffy at sun.com Wed Mar 2 15:39:48 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 02 Mar 2005 15:39:48 -0800 Subject: [openib-general] "arping" failing over ipoib Message-ID: <1109806789.4913.43.camel@duffman> I am trying to configure my fedora box to bring up ib0 on startup. Unfortunately it is failing, saying that the IP address is already taken on the network -- no matter what IP address I use. I traced this down to the fact that the ifup script uses "arping" to test this condition. It appears arping is failing something like this: # arping -c 2 -w 3 -D -I ib0 192.168.0.62 ARPING 192.168.0.62 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received -1 response(s) where I can ping: # ping 192.168.0.62 PING 192.168.0.62 (192.168.0.62) 56(84) bytes of data. 64 bytes from 192.168.0.62: icmp_seq=0 ttl=64 time=0.079 ms --- 192.168.0.62 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.079/0.079/0.079/0.000 ms, pipe 2 [root at flopteron iputils]# arp -a ? (192.168.0.62) at 00:00:00:14:FE:80:00:00:00 [infiniband] on ib0 ? (10.6.98.1) at 00:00:0C:07:AC:00 [ether] on eth0 # ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 00:00:00:84:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.0.25 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::202:c901:a99:e0a1/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:39 errors:0 dropped:0 overruns:0 frame:0 TX packets:39 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:3164 (3.0 KiB) TX bytes:3320 (3.2 KiB) I have looked at the arping code, and it seems to be crafting the packet correctly, even using 32 in the type field, so I am a bit befuddled as to why this isn't working. Any ideas? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Wed Mar 2 16:31:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Mar 2005 19:31:42 -0500 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <1109806789.4913.43.camel@duffman> References: <1109806789.4913.43.camel@duffman> Message-ID: <1109809902.4645.48.camel@localhost.localdomain> On Wed, 2005-03-02 at 18:39, Tom Duffy wrote: > I am trying to configure my fedora box to bring up ib0 on startup. > Unfortunately it is failing, saying that the IP address is already taken > on the network -- no matter what IP address I use. I traced this down > to the fact that the ifup script uses "arping" to test this condition. > It appears arping is failing something like this: > > # arping -c 2 -w 3 -D -I ib0 192.168.0.62 > ARPING 192.168.0.62 from 0.0.0.0 ib0 > Sent 2 probes (2 broadcast(s)) > Received -1 response(s) Here's what I see going on: arpping appears to cause a join (with component mask 0x10083) to MGID . 0xFFFF:FFFF:FFFF:0742:0070:26C2:C43F:81C0. That does not look like an IPoIB MGID to me. Not sure how this MGID is generated. The SM refuses this with status 0x0600 (ERR_REQ_INSUFFICIENT_COMPONENTS) which is what a join request gets when the group is not already created. -- Hal From hch at lst.de Wed Mar 2 19:48:27 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 3 Mar 2005 04:48:27 +0100 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> Message-ID: <20050303034827.GA9092@lst.de> On Thu, Mar 03, 2005 at 12:26:33AM +0200, Yaron Haviv wrote: > > And I've not seen any kernel submittsion for either of them - and > what's > > important no single kDAPL application that actually shows any benefit > > that way. Volatair's iSER implementation would surely be smaller when > > directly written to the OpenIB interface, and is already smaller than > > the whole kDAPL layer. > > Christoph, the reason the iSER code is very thin is that it is using > kDAPL > (and Linux iSCSI), it doesn't need to deal with SA calls, CM calls, > LIDs, GIDs, and a bunch of other things. Umm, no - it's not small. In it's current form it's freakin' huge. That's partially a fault of stupid kDAPL APIs, the cisco iscsi code and the broken implementation of the iscsi transport switch, but also because it's pretty bad code. The current iSER code is 10928 LOC, add to that 22155 LOC of kDAPL (not including the actual provider for IB) and 5822 LOC linux-iscsi kernel code. Compare that to the 25412 LOC total for drivers/infiniband in Linux 2.6.11. Here's the challenge: if someone gets me the funding I'll write complete iSER of IB implementation in less than 10k LOC based on the open-iscsi code if someone gets me the funding. > Besides being RDMA transport independent DAPL enable people to code to > RDMA without been intimately familiar with the HW, we saw people coding > to it in days, Which I can't say the same for Verbs. Which means they'll hack up total crap code. > Abstract layers are not new to Linux, Sockets is another type of > abstraction with multiple protocols/families underneath, And you forgot that sockets are a really small abstraction layer, which kDAPL is not. And even though sockets provide a really nice abstraction for the data transmission you need to know about address families for connection establishment and control. Really bad anology, you lost :) From roland at topspin.com Wed Mar 2 20:32:42 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:32:42 -0800 Subject: [openib-general] [PATCH] uverbs: whitespace fix In-Reply-To: <20050302092751.GB25029@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 11:27:51 +0200") References: <20050302092751.GB25029@mellanox.co.il> Message-ID: <52ekexs61h.fsf@topspin.com> Thanks, applied. From roland at topspin.com Wed Mar 2 20:34:04 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:34:04 -0800 Subject: [openib-general] [PATCH] uverbs_mem printk In-Reply-To: <20050302101212.GC25029@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 12:12:12 +0200") References: <20050302101212.GC25029@mellanox.co.il> Message-ID: <52acpls5z7.fsf@topspin.com> Thanks, applied. That was really just some left over debugging code anyway. From roland at topspin.com Wed Mar 2 20:37:13 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:37:13 -0800 Subject: [openib-general] mr_table.max_mtt_order In-Reply-To: <20050302104311.GD25029@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 12:43:11 +0200") References: <20050302104311.GD25029@mellanox.co.il> Message-ID: <526509s5ty.fsf@topspin.com> Michael> Did I misunderstand something, or is there something that Michael> forces dev->limits.num_mtt_segs to be a power of 2? Well, right now it's essentially hard coded to 1<<20, so it's OK for now. In general the buddy allocator used for allocating MTT entries will break if Michael> 2. There are some places in mthca where we try to round Michael> some value up to the power of 2, some done by loops like Michael> this one. I find them error-prone. Will you accept a Michael> patch replacing them with an inline function? Using fls, Michael> this function will also be more efficient than a linear Michael> loop. I thought about this a little. I think that any inline function forces someone reading the code to look up what it does, no matter how descriptive the name we come up with. I think it would be better to use fls() directly. I'm already in the habit of using ffs() to compute log_2 of powers of two but for some reason I never remember fls(). - R. From roland at topspin.com Wed Mar 2 20:40:11 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:40:11 -0800 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <1109806789.4913.43.camel@duffman> (Tom Duffy's message of "Wed, 02 Mar 2005 15:39:48 -0800") References: <1109806789.4913.43.camel@duffman> Message-ID: <521xaxs5p0.fsf@topspin.com> Tom> # arping -c 2 -w 3 -D -I ib0 192.168.0.62 Tom> ARPING 192.168.0.62 from 0.0.0.0 ib0 Tom> Sent 2 probes (2 broadcast(s)) Tom> Received -1 response(s) What are the network startup scripts expecting? How is arping getting so confused that it reports -1 responses? (Sorry, haven't had a chance to look at the code). Tom> I have looked at the arping code, and it seems to be crafting Tom> the packet correctly, even using 32 in the type field, so I Tom> am a bit befuddled as to why this isn't working. What packet does arping create? Unfortunately, because a "normal" IPoIB packet doesn't include any encapsulation beyond the 4 bytes of ethertype/reserved, it's a little difficult for userspace to send broadcast packets. If arping tries to create an ethernet-like header, then the IPoIB driver is going to get a little confused. - R. From tduffy at sun.com Wed Mar 2 21:24:08 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 02 Mar 2005 21:24:08 -0800 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <521xaxs5p0.fsf@topspin.com> References: <1109806789.4913.43.camel@duffman> <521xaxs5p0.fsf@topspin.com> Message-ID: <1109827448.26501.3.camel@duffman> On Wed, 2005-03-02 at 20:40 -0800, Roland Dreier wrote: > Tom> # arping -c 2 -w 3 -D -I ib0 192.168.0.62 > Tom> ARPING 192.168.0.62 from 0.0.0.0 ib0 > Tom> Sent 2 probes (2 broadcast(s)) > Tom> Received -1 response(s) > > What are the network startup scripts expecting? How is arping getting > so confused that it reports -1 responses? (Sorry, haven't had a > chance to look at the code). That is a good question. I think the code is b0rked. -1 is coming from the "received" variable. This is never initialized. Initializing it to 0 at least causes arping to fail gracefully, letting ifup continue. I have opened fedora bug 150156 regarding this and emailed the arping maintainer. > Tom> I have looked at the arping code, and it seems to be crafting > Tom> the packet correctly, even using 32 in the type field, so I > Tom> am a bit befuddled as to why this isn't working. > > What packet does arping create? Unfortunately, because a "normal" > IPoIB packet doesn't include any encapsulation beyond the 4 bytes of > ethertype/reserved, it's a little difficult for userspace to send > broadcast packets. If arping tries to create an ethernet-like header, > then the IPoIB driver is going to get a little confused. OK, well I thought it was right according to the ARP IPoIB encapsulation IETF draft. I will look at it a bit more in depth tomorrow. In any event, what is the right format of the packet for userspace to craft? Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][2/11] IB: fix vendor MAD deregistration In-Reply-To: <2005322131.pkxanHLh4SQ8X31k@topspin.com> Message-ID: <2005322131.5pgryiWlkZPYdcE7@topspin.com> From: Shahar Frank Fix bug when deregistering a vendor class MAD agent. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-02 20:26:03.185796628 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-02 20:26:10.980104746 -0800 @@ -41,7 +41,6 @@ #include "smi.h" #include "agent.h" - MODULE_LICENSE("Dual BSD/GPL"); MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); @@ -490,6 +489,7 @@ cancel_mads(mad_agent_priv); port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->timed_work); flush_workqueue(port_priv->wq); @@ -1266,12 +1266,12 @@ } port_priv = agent_priv->qp_info->port_priv; + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); class = port_priv->version[ agent_priv->reg_req->mgmt_class_version].class; if (!class) goto vendor_check; - mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); method = class->method_table[mgmt_class]; if (method) { /* Remove any methods for this mad agent */ @@ -1293,16 +1293,21 @@ } vendor_check: + if (!is_vendor_class(mgmt_class)) + goto out; + + /* normalize mgmt_class to vendor range 2 */ + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); vendor = port_priv->version[ agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) goto out; - mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); vendor_class = vendor->vendor_class[mgmt_class]; if (vendor_class) { index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); - if (index == -1) + if (index < 0) goto out; method = vendor_class->method_table[index]; if (method) { --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:03.185796628 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:10.980104746 -0800 @@ -58,8 +58,8 @@ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 #define MAX_MGMT_OUI 8 -#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ - IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 +#define MAX_MGMT_VENDOR_RANGE2 (IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1) struct ib_mad_list_head { struct list_head list; From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][0/11] InfiniBand fixes Message-ID: <2005322131.J5dPz9nJYwSlaHs6@topspin.com> Here is a batch of fixes from the OpenIB subversion tree for merging. Thanks, Roland From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][3/11] IB: sparse fixes In-Reply-To: <2005322131.5pgryiWlkZPYdcE7@topspin.com> Message-ID: <2005322131.O2Ym8iporsXeypcV@topspin.com> From: Tom Duffy Fix some sparse warnings by making sure we have appropriate "extern" declarations visible. Signed-off-by: Tom Duffy Signed-off-by: Hal Rosenstock ( Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/agent.c 2005-03-02 20:26:10.599187430 -0800 +++ linux-export/drivers/infiniband/core/agent.c 2005-03-02 20:26:11.456001445 -0800 @@ -45,14 +45,11 @@ #include "smi.h" #include "agent_priv.h" #include "mad_priv.h" - +#include "agent.h" spinlock_t ib_agent_port_list_lock; static LIST_HEAD(ib_agent_port_list); -extern kmem_cache_t *ib_mad_cache; - - /* * Caller must hold ib_agent_port_list_lock */ --- linux-export.orig/drivers/infiniband/core/cache.c 2005-03-02 20:26:03.085818330 -0800 +++ linux-export/drivers/infiniband/core/cache.c 2005-03-02 20:26:11.456001445 -0800 @@ -37,6 +37,8 @@ #include #include +#include + #include "core_priv.h" struct ib_pkey_cache { --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:10.980104746 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:11.457001228 -0800 @@ -192,4 +192,6 @@ struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; +extern kmem_cache_t *ib_mad_cache; + #endif /* __IB_MAD_PRIV_H__ */ --- linux-export.orig/drivers/infiniband/core/smi.c 2005-03-02 20:26:03.085818330 -0800 +++ linux-export/drivers/infiniband/core/smi.c 2005-03-02 20:26:11.458001011 -0800 @@ -37,7 +37,7 @@ */ #include - +#include "smi.h" /* * Fixup a directed route SMP for sending From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][1/11] IB: simplify MAD code In-Reply-To: <2005322131.J5dPz9nJYwSlaHs6@topspin.com> Message-ID: <2005322131.pkxanHLh4SQ8X31k@topspin.com> From: Hal Rosenstock Remove unneeded MAD agent registration by using a single agent for both directed-route and LID-routed MADs. Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/agent.c 2005-03-02 20:26:03.280776011 -0800 +++ linux-export/drivers/infiniband/core/agent.c 2005-03-02 20:26:10.599187430 -0800 @@ -66,14 +66,13 @@ if (device) { list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->dr_smp_agent->device == device && + if (entry->smp_agent->device == device && entry->port_num == port_num) return entry; } } else { list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if ((entry->dr_smp_agent == mad_agent) || - (entry->lr_smp_agent == mad_agent) || + if ((entry->smp_agent == mad_agent) || (entry->perf_mgmt_agent == mad_agent)) return entry; } @@ -111,7 +110,7 @@ return 1; } - return smi_check_local_smp(port_priv->dr_smp_agent, smp); + return smi_check_local_smp(port_priv->smp_agent, smp); } static int agent_mad_send(struct ib_mad_agent *mad_agent, @@ -231,10 +230,8 @@ /* Get mad agent based on mgmt_class in MAD */ switch (mad->mad.mad.mad_hdr.mgmt_class) { case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - mad_agent = port_priv->dr_smp_agent; - break; case IB_MGMT_CLASS_SUBN_LID_ROUTED: - mad_agent = port_priv->lr_smp_agent; + mad_agent = port_priv->smp_agent; break; case IB_MGMT_CLASS_PERF_MGMT: mad_agent = port_priv->perf_mgmt_agent; @@ -284,7 +281,6 @@ { int ret; struct ib_agent_port_private *port_priv; - struct ib_mad_reg_req reg_req; unsigned long flags; /* First, check if port already open for SMI */ @@ -308,35 +304,19 @@ spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->send_posted_list); - /* Obtain MAD agent for directed route SM class */ - reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; - reg_req.mgmt_class_version = 1; - - port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); + /* Obtain send only MAD agent for SM class (SMI QP) */ + port_priv->smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); - if (IS_ERR(port_priv->dr_smp_agent)) { - ret = PTR_ERR(port_priv->dr_smp_agent); + if (IS_ERR(port_priv->smp_agent)) { + ret = PTR_ERR(port_priv->smp_agent); goto error2; } - /* Obtain MAD agent for LID routed SM class */ - reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->lr_smp_agent)) { - ret = PTR_ERR(port_priv->lr_smp_agent); - goto error3; - } - - /* Obtain MAD agent for PerfMgmt class */ - reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + /* Obtain send only MAD agent for PerfMgmt class (GSI QP) */ port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, IB_QPT_GSI, NULL, 0, @@ -344,15 +324,15 @@ NULL, NULL); if (IS_ERR(port_priv->perf_mgmt_agent)) { ret = PTR_ERR(port_priv->perf_mgmt_agent); - goto error4; + goto error3; } - port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + port_priv->mr = ib_get_dma_mr(port_priv->smp_agent->qp->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(port_priv->mr)) { printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); ret = PTR_ERR(port_priv->mr); - goto error5; + goto error4; } spin_lock_irqsave(&ib_agent_port_list_lock, flags); @@ -361,12 +341,10 @@ return 0; -error5: - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error4: - ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error3: - ib_unregister_mad_agent(port_priv->dr_smp_agent); + ib_unregister_mad_agent(port_priv->smp_agent); error2: kfree(port_priv); error1: @@ -391,8 +369,7 @@ ib_dereg_mr(port_priv->mr); ib_unregister_mad_agent(port_priv->perf_mgmt_agent); - ib_unregister_mad_agent(port_priv->lr_smp_agent); - ib_unregister_mad_agent(port_priv->dr_smp_agent); + ib_unregister_mad_agent(port_priv->smp_agent); kfree(port_priv); return 0; --- linux-export.orig/drivers/infiniband/core/agent_priv.h 2005-03-02 20:26:03.280776011 -0800 +++ linux-export/drivers/infiniband/core/agent_priv.h 2005-03-02 20:26:10.599187430 -0800 @@ -55,8 +55,7 @@ struct list_head send_posted_list; spinlock_t send_list_lock; int port_num; - struct ib_mad_agent *dr_smp_agent; /* DR SM class */ - struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *smp_agent; /* SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ struct ib_mr *mr; }; From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][4/11] IB/mthca: add missing break In-Reply-To: <2005322131.O2Ym8iporsXeypcV@topspin.com> Message-ID: <2005322131.oecVhU1CS3swCooO@topspin.com> Add missing break statements in switch in mthca_profile.c (pointed out by Michael Tsirkin). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-02 20:26:03.023831785 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-02 20:26:11.904904003 -0800 @@ -241,10 +241,12 @@ case MTHCA_RES_UDAV: dev->av_table.ddr_av_base = profile[i].start; dev->av_table.num_ddr_avs = profile[i].num; + break; case MTHCA_RES_UARC: init_hca->uarc_base = profile[i].start; init_hca->log_uarc_sz = ffs(request->uarc_size) - 13; init_hca->log_uar_sz = ffs(request->num_uar) - 1; + break; default: break; } From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][5/11] IB/mthca: fix reset value endianness In-Reply-To: <2005322131.oecVhU1CS3swCooO@topspin.com> Message-ID: <2005322131.ube7cIPz9y7840bB@topspin.com> MTHCA_RESET_VALUE must always be swapped, since the HCA expects to see it in big-endian order and we write it with writel. This means on little-endian systems we have to swap it to big-endian order before writing, and on big-endian systems we need to swap it to make up for the additional swap that writel will do. This fixes resetting the HCA on big-endian machines. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_reset.c 2005-03-02 20:26:02.970843287 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_reset.c 2005-03-02 20:26:12.219835642 -0800 @@ -50,7 +50,7 @@ struct pci_dev *bridge = NULL; #define MTHCA_RESET_OFFSET 0xf0010 -#define MTHCA_RESET_VALUE cpu_to_be32(1) +#define MTHCA_RESET_VALUE swab32(1) /* * Reset the chip. This is somewhat ugly because we have to From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][6/11] IB/ipoib: fix rx memory leak In-Reply-To: <2005322131.ube7cIPz9y7840bB@topspin.com> Message-ID: <2005322131.6N8qBqgz1WuD4wnL@topspin.com> Fix memory leak when posting a receive buffer (pointed out by Shirley Ma). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:02.919854355 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:12.514771621 -0800 @@ -137,6 +137,9 @@ if (ret) { ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", id, ret); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(skb); priv->rx_ring[id].skb = NULL; } From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][7/11] IB/ipoib: use list_for_each_entry_safe when required In-Reply-To: <2005322131.6N8qBqgz1WuD4wnL@topspin.com> Message-ID: <2005322131.K2SnvQsocHnkTwPm@topspin.com> From: Shirley Ma Change uses of list_for_each_entry() where the loop variable is freed inside the loop to list_for_each_entry_safe(). Signed-off-by: Shirley Ma Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:02.832873236 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:12.799709771 -0800 @@ -790,7 +790,7 @@ spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry(mcast, &remove_list, list) { + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { ipoib_mcast_leave(dev, mcast); ipoib_mcast_free(mcast); } @@ -902,7 +902,7 @@ spin_unlock_irqrestore(&priv->lock, flags); /* We have to cancel outside of the spinlock */ - list_for_each_entry(mcast, &remove_list, list) { + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][8/11] IB/ipoib: rename global symbols In-Reply-To: <2005322131.K2SnvQsocHnkTwPm@topspin.com> Message-ID: <2005322131.OKEJHXn13XfMX2Aa@topspin.com> Make IPoIB data_debug_level module parameter static to the single file where it is used. Also Rename IPoIB module parameter variable from "debug_level" to "ipoib_debug_level". This avoids possible name clashes if IPoIB is built into the kernel. We use module_param_named so that the user-visible parameter names remain the same. Signed-off-by: Tom Duffy Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-03-02 20:26:02.744892334 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib.h 2005-03-02 20:26:13.207621227 -0800 @@ -308,11 +308,11 @@ #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -extern int debug_level; +extern int ipoib_debug_level; #define ipoib_dbg(priv, format, arg...) \ do { \ - if (debug_level > 0) \ + if (ipoib_debug_level > 0) \ ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ } while (0) #define ipoib_dbg_mcast(priv, format, arg...) \ --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:12.514771621 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:13.208621010 -0800 @@ -40,7 +40,7 @@ #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA -int data_debug_level; +static int data_debug_level; module_param(data_debug_level, int, 0644); MODULE_PARM_DESC(data_debug_level, --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:02.744892334 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.207621227 -0800 @@ -51,9 +51,9 @@ MODULE_LICENSE("Dual BSD/GPL"); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -int debug_level; +int ipoib_debug_level; -module_param(debug_level, int, 0644); +module_param_named(debug_level, ipoib_debug_level, int, 0644); MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][9/11] IB/ipoib: small fixes In-Reply-To: <2005322131.OKEJHXn13XfMX2Aa@topspin.com> Message-ID: <2005322131.kDy0lnKe0rjDV0tv@topspin.com> From: Shirley Ma IPoIB small fixes: Initialize path->ah to NULL, and fix dereference after free of neigh in error path of neigh_add_path(). Signed-off-by: Shirley Ma Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.207621227 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.653524436 -0800 @@ -346,8 +346,9 @@ if (!path) return NULL; - path->dev = dev; + path->dev = dev; path->pathrec.dlid = 0; + path->ah = NULL; skb_queue_head_init(&path->queue); @@ -450,8 +451,8 @@ err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - kfree(neigh); neigh->neighbour->ops->destructor = NULL; + kfree(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][11/11] IB/ipoib: fix locking on path deletion In-Reply-To: <2005322131.HYDDjSPPN3QdHwmF@topspin.com> Message-ID: <2005322131.6juV8g9K5T9OJ7gu@topspin.com> Fix up locking for IPoIB path table. Make sure that destruction of address handles, neighbour info and path structs is locked properly to avoid races and deadlocks. (Problem originally diagnosed by Shirley Ma) Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.977454122 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:14.301383808 -0800 @@ -215,16 +215,25 @@ return 0; } -static void __path_free(struct net_device *dev, struct ipoib_path *path) +static void path_free(struct net_device *dev, struct ipoib_path *path) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tn; struct sk_buff *skb; + unsigned long flags; while ((skb = __skb_dequeue(&path->queue))) dev_kfree_skb_irq(skb); + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + /* + * It's safe to call ipoib_put_ah() inside priv->lock + * here, because we know that path->ah will always + * hold one more reference, so ipoib_put_ah() will + * never do more than decrement the ref count. + */ if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; @@ -232,11 +241,11 @@ kfree(neigh); } + spin_unlock_irqrestore(&priv->lock, flags); + if (path->ah) ipoib_put_ah(path->ah); - rb_erase(&path->rb_node, &priv->path_tree); - list_del(&path->list); kfree(path); } @@ -248,15 +257,20 @@ unsigned long flags; spin_lock_irqsave(&priv->lock, flags); + list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); + + list_for_each_entry(path, &remove_list, list) + rb_erase(&path->rb_node, &priv->path_tree); + spin_unlock_irqrestore(&priv->lock, flags); list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); wait_for_completion(&path->done); - __path_free(dev, path); + path_free(dev, path); } } @@ -361,8 +375,6 @@ path->pathrec.pkey = cpu_to_be16(priv->pkey); path->pathrec.numb_path = 1; - __path_add(dev, path); - return path; } @@ -422,6 +434,8 @@ (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) goto err; + + __path_add(dev, path); } list_add_tail(&neigh->list, &path->neigh_list); @@ -497,8 +511,12 @@ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); - if (path_rec_start(dev, path)) - __path_free(dev, path); + if (path_rec_start(dev, path)) { + spin_unlock(&priv->lock); + path_free(dev, path); + return; + } else + __path_add(dev, path); } else { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -658,7 +676,7 @@ static void ipoib_neigh_destructor(struct neighbour *n) { - struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_neigh *neigh; struct ipoib_dev_priv *priv = netdev_priv(n->dev); unsigned long flags; struct ipoib_ah *ah = NULL; @@ -670,6 +688,7 @@ spin_lock_irqsave(&priv->lock, flags); + neigh = *to_ipoib_neigh(n); if (neigh) { if (neigh->ah) ah = neigh->ah; From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][10/11] IB/ipoib: don't call ipoib_put_ah with lock held In-Reply-To: <2005322131.kDy0lnKe0rjDV0tv@topspin.com> Message-ID: <2005322131.HYDDjSPPN3QdHwmF@topspin.com> From: Shirley Ma ipoib_put_ah() may call ipoib_free_ah(), which might take the device's lock. Therefore we need to make sure we don't call ipoib_put_ah() when holding the lock already. Signed-off-by: Shirley Ma Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.653524436 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.977454122 -0800 @@ -661,6 +661,7 @@ struct ipoib_neigh *neigh = *to_ipoib_neigh(n); struct ipoib_dev_priv *priv = netdev_priv(n->dev); unsigned long flags; + struct ipoib_ah *ah = NULL; ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", @@ -671,13 +672,16 @@ if (neigh) { if (neigh->ah) - ipoib_put_ah(neigh->ah); + ah = neigh->ah; list_del(&neigh->list); *to_ipoib_neigh(n) = NULL; kfree(neigh); } spin_unlock_irqrestore(&priv->lock, flags); + + if (ah) + ipoib_put_ah(ah); } static int ipoib_neigh_setup(struct neighbour *neigh) --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:12.799709771 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:13.977454122 -0800 @@ -93,6 +93,8 @@ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; + LIST_HEAD(ah_list); + struct ipoib_ah *ah, *tah; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -101,7 +103,8 @@ spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { - ipoib_put_ah(neigh->ah); + if (neigh->ah) + list_add_tail(&neigh->ah->list, &ah_list); *to_ipoib_neigh(neigh->neighbour) = NULL; neigh->neighbour->ops->destructor = NULL; kfree(neigh); @@ -109,6 +112,9 @@ spin_unlock_irqrestore(&priv->lock, flags); + list_for_each_entry_safe(ah, tah, &ah_list, list) + ipoib_put_ah(ah); + if (mcast->ah) ipoib_put_ah(mcast->ah); From halr at voltaire.com Wed Mar 2 22:07:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Mar 2005 01:07:31 -0500 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <521xaxs5p0.fsf@topspin.com> References: <1109806789.4913.43.camel@duffman> <521xaxs5p0.fsf@topspin.com> Message-ID: <1109829887.4645.132.camel@localhost.localdomain> On Wed, 2005-03-02 at 23:40, Roland Dreier wrote: > What packet does arping create? I think it creates an ARP first. It does everything "directly". > Unfortunately, because a "normal" IPoIB packet doesn't include > any encapsulation beyond the 4 bytes of ethertype/reserved, > it's a little difficult for userspace to send broadcast packets. For ARP specifically (ethertype 0x0806), there is no way without additional information for the IPoIB driver to know whether it is intended to be broadcast or unicast. I'm also not sure how responses would make it back to user space either as I would expect arping to be looking for the response or lack thereof. > If arping tries to create an ethernet-like header, > then the IPoIB driver is going to get a little confused. That's what is currently going on and is why the group gets confused. The driver is likely looking into this different packet incorrectly. It does recognize that it is not a unicast packet. -- Hal From shaharf at voltaire.com Thu Mar 3 02:45:03 2005 From: shaharf at voltaire.com (shaharf) Date: Thu, 3 Mar 2005 12:45:03 +0200 Subject: [openib-general] IB Address Translation service Message-ID: > > The advantage of ATS is that it "just works" whether wired point to > point, or via a switch, or whatever. It requires no central > administration, > works as transparently as ARP and ND, and supports IP addressing so > applications don't have any ambiguity in how they resolve names. > > If we get rid of ATS, what do we replace it with? Raw IB GID's from > the application?? > > Tom. > Tom, I am with you regarding that subject. Even though both IB-ARP and ATS should be considered to be a hack, I think that IB-ARP is much more problematic and it does not deliver a complete solution: it doesn't contain all data required, it does not solve the SM load problem (due the requirement for the path record query) and it is a mechanism that contradicts IB management architecture. I would fix the ATS to include some missing fields, and maybe define unified ATS + path query for performance. The SM/SA scalability problem should be solved by distributing the SA part of it, probably using a single write/multiple reader model and a simple cache coherency protocol to allow efficient caching by sub SA agents or even hosts. This type of distribution is also requested clearly in the SOW section 1.3.1. Shahar From Thomas.Talpey at netapp.com Thu Mar 3 06:46:53 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 03 Mar 2005 09:46:53 -0500 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> Message-ID: <6.2.1.2.2.20050303094159.050933a0@exnane01.nane.netapp.com> At 02:22 PM 3/2/2005, Woodruff, Robert J wrote: >I think the point is that only one of those interconnects (IB) is >in the kernel, the rest are proprietary. Do any of the other RDMA >interconnect vendors plan to submit their code for inclusion into Linux >in the near future ? Yes - take a look at where you can freely download their complete stack, including drivers for their iWARP NIC, plus MPI and DAPL API libraries. It runs on 2.6.10 and many versions back (including 2.4.x). I'll let Clem speak for his plans to submit it, however. I'm just a satisfied user. Tom. From Thomas.Talpey at netapp.com Thu Mar 3 06:56:49 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 03 Mar 2005 09:56:49 -0500 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <20050303034827.GA9092@lst.de> References: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> <20050303034827.GA9092@lst.de> Message-ID: <6.2.1.2.2.20050303094748.03422620@exnane01.nane.netapp.com> At 10:48 PM 3/2/2005, Christoph Hellwig wrote: >The current iSER code is 10928 LOC, add to that 22155 LOC of kDAPL (not >including the actual provider for IB) and 5822 LOC linux-iscsi kernel >code. Compare that to the 25412 LOC total for drivers/infiniband in Linux >2.6.11. Is this just about LOCs? I think you should wait to see how large kDAPL is *after* it has been properly integrated into the kernel before judging that. At present, the code is heavily commented and fully generalized to aid porting to multiple operating systems. It will look quite different once it is freed of these attributes. Also, I'll point out there is extensive debug and trace throughout the code, which are optional. BTW I agree with Yaron that one copy of code in DAPL replaces N copies of the same code in all RDMA drivers (IB, iWARP etc), or worse, upper layers. Which is why NFS/RDMA needs it. Tom. From clemc at ammasso.com Thu Mar 3 07:24:35 2005 From: clemc at ammasso.com (Clem Cole) Date: Thu, 3 Mar 2005 10:24:35 -0500 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <8E9D028761D8264D910612167E8457E8742C24@mail2> Thanks for the kind words Tom. Indeed, Ammasso is fully open source and are thrilled at the idea of getting DAPL into the mainline. We went GA with our 1.2 release recently and have had our AMSO100 hardware and associated iWARP software in the hands of about 60 different sites - both HPTC and commerical. Feel free to download the code and have a look. As Tom says it has been tested on kernels as far back as those for RH 7.3 and modern as kernel.org's 2.6.10. We have tested on both x86 and x86-64. As Tom says, we too (as well as our customers) are statisfied users of the DAPL interface. DAPL certainly continues to show that it works pretty well for us and is easier to use than the low level QP verbs layer. Recently an ISV moved a hunk of code from another interface (we do not know which) to our DAPL implementation in about 2.5 weeks. For the record, our kDAPL and uDAPL are dervived from the DAT reference code. Besides writing the Ammasso specific provider code, since IB != iWARP we did have to make some small changes to common code to handle iWARP specific difference. We are working with Tom, Arkady and the rest of the DAPL community to get the iWARP changes back into the base to help ensure that the DAPL interface is not considerred just an IB thing or that people come to the incorrect conclusion that IB verbs are ``good enough.'' For whatever its worth, we are actively working on not only being DAPL/iWARP >>compliant<< (working with UNH etc) but also >>compatible<< - going to plugfests and working with actual ISVs that have written linux code that rely on DAPL/iWARP -- i.e. not only do we pass the full uDAPL/kDAPL test suite, we have been working with a number of different large commerical vendors (who's source code we have never seen) to get their tests as well as their >>applications<< which were designed to run over DAPL. Since these codes had been previously only tested on IB (i.e we are the first shipping iWARP provider), we feel pretty good that our DAPL works as expected. Note: our plan is to increase the number of applications that use the code as quickly as possible; but under we are small start up and are band limited by the number of ISV we can work directly at one time. If you have specific questions, feel free to take them off line to me. Clem Cole Dist. Eng PS If you are interested in getting your hands on hardware, drop me a line and I'll make connection to our sales guys. -----Original Message----- From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com] Sent: Thursday, March 03, 2005 9:47 AM To: openib-general at openib.org Cc: Clem Cole Subject: RE: [openib-general] putting in dead wood for DAPL and similarabomination At 02:22 PM 3/2/2005, Woodruff, Robert J wrote: >I think the point is that only one of those interconnects (IB) is >in the kernel, the rest are proprietary. Do any of the other RDMA >interconnect vendors plan to submit their code for inclusion into Linux >in the near future ? Yes - take a look at where you can freely download their complete stack, including drivers for their iWARP NIC, plus MPI and DAPL API libraries. It runs on 2.6.10 and many versions back (including 2.4.x). I'll let Clem speak for his plans to submit it, however. I'm just a satisfied user. Tom. From yaronh at voltaire.com Thu Mar 3 07:23:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 3 Mar 2005 17:23:00 +0200 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAE4E@taurus.voltaire.com> > -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > Sent: Thursday, March 03, 2005 5:48 AM > To: Yaron Haviv > Cc: Christoph Hellwig; James Lentini; openib-general at openib.org > Subject: Re: [openib-general] putting in dead wood for DAPL and > similarabomination > > The current iSER code is 10928 LOC, add to that 22155 LOC of kDAPL (not > including the actual provider for IB) and 5822 LOC linux-iscsi kernel > code. Compare that to the 25412 LOC total for drivers/infiniband in Linux > 2.6.11. As Tom indicated we expect a significant code shrink for kDAPL, it will be much more Linux friendly when we are done with it, some parts will be re-written. Also the iSER code is not optimal in terms of LOC, and we can clean up some redundant code if we are in an LOC contest, I believe after we glue all the layers we will focus on reducing LOCs and test code. > Here's the challenge: if someone gets me the funding I'll write > complete iSER of IB implementation in less than 10k LOC based on the > open-iscsi code if someone gets me the funding. You know there is also the challenge of making it work, perform, interoperate, and support some features, not all is about LOC :) Anyway thanks for offering us support we may take you up on the some day Yaron From roland at topspin.com Thu Mar 3 12:43:48 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 12:43:48 -0800 Subject: [openib-general] mr_table.max_mtt_order In-Reply-To: <526509s5ty.fsf@topspin.com> (Roland Dreier's message of "Wed, 02 Mar 2005 20:37:13 -0800") References: <20050302104311.GD25029@mellanox.co.il> <526509s5ty.fsf@topspin.com> Message-ID: <52hdjspiij.fsf@topspin.com> Roland> I thought about this a little. I think that any inline Roland> function forces someone reading the code to look up what Roland> it does, no matter how descriptive the name we come up Roland> with. I think it would be better to use fls() directly. Roland> I'm already in the habit of using ffs() to compute log_2 Roland> of powers of two but for some reason I never remember Roland> fls(). Actually, since there's already roundup_pow_of_two() in kernel.h, let's use that. - R. From mst at mellanox.co.il Thu Mar 3 13:05:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Mar 2005 23:05:25 +0200 Subject: [openib-general] mr_table.max_mtt_order In-Reply-To: <52hdjspiij.fsf@topspin.com> References: <20050302104311.GD25029@mellanox.co.il> <526509s5ty.fsf@topspin.com> <52hdjspiij.fsf@topspin.com> Message-ID: <20050303210525.GA4022@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] mr_table.max_mtt_order > > Roland> I thought about this a little. I think that any inline > Roland> function forces someone reading the code to look up what > Roland> it does, no matter how descriptive the name we come up > Roland> with. I think it would be better to use fls() directly. > Roland> I'm already in the habit of using ffs() to compute log_2 > Roland> of powers of two but for some reason I never remember > Roland> fls(). > > Actually, since there's already roundup_pow_of_two() in kernel.h, > let's use that. > > - R. > I plan to post a patch on Sunday. -- MST - Michael S. Tsirkin From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][0/26] InfiniBand merge Message-ID: <2005331520.b7ycIGGfSwBBRSED@topspin.com> Here's another series of patches that applies on top of the fixes I posted yesterday. This series syncs the kernel with everything ready for merging from the OpenIB subversion tree. Most of these patches add more support for "mem-free" mode to mthca. This allows PCI Express HCAs to operate by storing context in the host system's memory rather than in dedicated memory attached to the HCA. With this series of patches, mem-free mode is usable -- in fact, this series of patches is being posted from a system whose only network connection is IP-over-IB running on a mem-free HCA. Thanks, Roland From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][1/26] IB: fix ib_find_cached_gid() port numbering In-Reply-To: <2005331520.b7ycIGGfSwBBRSED@topspin.com> Message-ID: <2005331520.OFd0tTycEIjc5XlW@topspin.com> From: Sean Hefty Fix ib_find_cached_gid() to return the correct port number relative to the port numbering used by the device. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/cache.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/core/cache.c 2005-03-03 15:02:57.180310444 -0800 @@ -114,7 +114,7 @@ cache = device->cache.gid_cache[p]; for (i = 0; i < cache->table_len; ++i) { if (!memcmp(gid, &cache->table[i], sizeof *gid)) { - *port_num = p; + *port_num = p + start_port(device); if (index) *index = i; ret = 0; From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][2/26] IB/mthca: CQ minor tweaks In-Reply-To: <2005331520.OFd0tTycEIjc5XlW@topspin.com> Message-ID: <2005331520.GI2ijwUAkM9zyNyy@topspin.com> From: "Michael S. Tsirkin" Clean up CQ code so that we only calculate the address of a CQ entry once when using it. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-02-03 16:59:43.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:51.832670421 -0800 @@ -147,20 +147,21 @@ + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; } -static inline int cqe_sw(struct mthca_cq *cq, int i) +static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) { - return !(MTHCA_CQ_ENTRY_OWNER_HW & - get_cqe(cq, i)->owner); + struct mthca_cqe *cqe; + cqe = get_cqe(cq, i); + return (MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner) ? NULL : cqe; } -static inline int next_cqe_sw(struct mthca_cq *cq) +static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) { return cqe_sw(cq, cq->cons_index); } -static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +static inline void set_cqe_hw(struct mthca_cqe *cqe) { - get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; + cqe->owner = MTHCA_CQ_ENTRY_OWNER_HW; } static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, @@ -388,7 +389,8 @@ int free_cqe = 1; int err = 0; - if (!next_cqe_sw(cq)) + cqe = next_cqe_sw(cq); + if (!cqe) return -EAGAIN; /* @@ -397,8 +399,6 @@ */ rmb(); - cqe = get_cqe(cq, cq->cons_index); - if (0) { mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), @@ -509,8 +509,8 @@ entry->status = IB_WC_SUCCESS; out: - if (free_cqe) { - set_cqe_hw(cq, cq->cons_index); + if (likely(free_cqe)) { + set_cqe_hw(cqe); ++(*freed); cq->cons_index = (cq->cons_index + 1) & cq->ibcq.cqe; } @@ -655,7 +655,7 @@ } for (i = 0; i < nent; ++i) - set_cqe_hw(cq, i); + set_cqe_hw(get_cqe(cq, i)); cq->cqn = mthca_alloc(&dev->cq_table.alloc); if (cq->cqn == -1) @@ -773,7 +773,7 @@ int j; printk(KERN_ERR "context for CQN %x (cons index %x, next sw %d)\n", - cq->cqn, cq->cons_index, next_cqe_sw(cq)); + cq->cqn, cq->cons_index, !!next_cqe_sw(cq)); for (j = 0; j < 16; ++j) printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); } From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <2005331520.GI2ijwUAkM9zyNyy@topspin.com> Message-ID: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> From: Michael S. Tsirkin Avoid taking the CQ table lock in the fast path path by using synchronize_irq() after removing a CQ from the table to make sure that no completion events are still in progress. This gets a nice speedup (about 4%) in IP over IB on my hardware. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:51.832670421 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.368554099 -0800 @@ -33,6 +33,7 @@ */ #include +#include #include @@ -181,11 +182,7 @@ { struct mthca_cq *cq; - spin_lock(&dev->cq_table.lock); cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); - if (cq) - atomic_inc(&cq->refcount); - spin_unlock(&dev->cq_table.lock); if (!cq) { mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); @@ -193,9 +190,6 @@ } cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); - - if (atomic_dec_and_test(&cq->refcount)) - wake_up(&cq->wait); } void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) @@ -783,6 +777,11 @@ cq->cqn & (dev->limits.num_cqs - 1)); spin_unlock_irq(&dev->cq_table.lock); + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) + synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); + else + synchronize_irq(dev->pdev->irq); + atomic_dec(&cq->refcount); wait_event(cq->wait, !atomic_read(&cq->refcount)); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][4/26] IB/mthca: improve CQ locking part 2 In-Reply-To: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> Message-ID: <2005331520.lXKA9W9JoVIrmqB8@topspin.com> From: Michael S. Tsirkin Locking during the poll cq operation can be reduced by locking the cq while qp is being removed from the qp array. This also avoids an extra atomic operation for reference counting. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.368554099 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.923433653 -0800 @@ -418,14 +418,14 @@ spin_unlock(&(*cur_qp)->lock); } - spin_lock(&dev->qp_table.lock); + /* + * We do not have to take the QP table lock here, + * because CQs will be locked while QPs are removed + * from the table. + */ *cur_qp = mthca_array_get(&dev->qp_table.qp, be32_to_cpu(cqe->my_qpn) & (dev->limits.num_qps - 1)); - if (*cur_qp) - atomic_inc(&(*cur_qp)->refcount); - spin_unlock(&dev->qp_table.lock); - if (!*cur_qp) { mthca_warn(dev, "CQ entry for unknown QP %06x\n", be32_to_cpu(cqe->my_qpn) & 0xffffff); @@ -537,12 +537,8 @@ inc_cons_index(dev, cq, freed); } - if (qp) { + if (qp) spin_unlock(&qp->lock); - if (atomic_dec_and_test(&qp->refcount)) - wake_up(&qp->wait); - } - spin_unlock_irqrestore(&cq->lock, flags); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-02-03 16:59:28.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:52.924433436 -0800 @@ -1083,9 +1083,21 @@ return 0; err_out_free: - spin_lock_irq(&dev->qp_table.lock); + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, mqpn); - spin_unlock_irq(&dev->qp_table.lock); + spin_unlock(&dev->qp_table.lock); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); err_out: dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, @@ -1100,11 +1112,28 @@ u8 status; int size; int i; + struct mthca_cq *send_cq; + struct mthca_cq *recv_cq; + + send_cq = to_mcq(qp->ibqp.send_cq); + recv_cq = to_mcq(qp->ibqp.recv_cq); - spin_lock_irq(&dev->qp_table.lock); + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, qp->qpn & (dev->limits.num_qps - 1)); - spin_unlock_irq(&dev->qp_table.lock); + spin_unlock(&dev->qp_table.lock); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); atomic_dec(&qp->refcount); wait_event(qp->wait, !atomic_read(&qp->refcount)); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][5/26] IB/mthca: CQ cleanups In-Reply-To: <2005331520.lXKA9W9JoVIrmqB8@topspin.com> Message-ID: <2005331520.bkPiyqSCQe0LOju5@topspin.com> Simplify some of the code for CQ handling slightly. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.923433653 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:53.538300187 -0800 @@ -150,9 +150,8 @@ static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) { - struct mthca_cqe *cqe; - cqe = get_cqe(cq, i); - return (MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner) ? NULL : cqe; + struct mthca_cqe *cqe = get_cqe(cq, i); + return MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner ? NULL : cqe; } static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) @@ -378,7 +377,7 @@ struct mthca_wq *wq; struct mthca_cqe *cqe; int wqe_index; - int is_error = 0; + int is_error; int is_send; int free_cqe = 1; int err = 0; @@ -401,12 +400,9 @@ dump_cqe(cqe); } - if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == - MTHCA_ERROR_CQE_OPCODE_MASK) { - is_error = 1; - is_send = cqe->opcode & 1; - } else - is_send = cqe->is_send & 0x80; + is_error = (cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK; + is_send = is_error ? cqe->opcode & 0x01 : cqe->is_send & 0x80; if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { if (*cur_qp) { From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][6/26] IB: remove unsignaled receives In-Reply-To: <2005331520.bkPiyqSCQe0LOju5@topspin.com> Message-ID: <2005331520.psAuTRchMaqO6dem@topspin.com> From: Michael S. Tsirkin Remove support for unsignaled receive requests. This is a non-standard extension to the IB spec that is not used by any known applications or protocols, and is not supported by newer hardware. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-03 14:12:54.671054304 -0800 @@ -2191,7 +2191,6 @@ recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; - recv_wr.recv_flags = IB_RECV_SIGNALED; do { /* Allocate and map receive buffer */ @@ -2386,7 +2385,6 @@ qp_init_attr.send_cq = qp_info->port_priv->cq; qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-01-25 20:48:48.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:54.672054087 -0800 @@ -369,14 +369,12 @@ struct mthca_cq *recv_cq, enum ib_qp_type type, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, struct mthca_qp *qp); int mthca_alloc_sqp(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, struct mthca_cq *recv_cq, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, int qpn, int port, struct mthca_sqp *sqp); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-01-25 20:49:23.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:54.673053870 -0800 @@ -343,7 +343,7 @@ to_mcq(init_attr->send_cq), to_mcq(init_attr->recv_cq), init_attr->qp_type, init_attr->sq_sig_type, - init_attr->rq_sig_type, qp); + qp); qp->ibqp.qp_num = qp->qpn; break; } @@ -364,7 +364,7 @@ err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), to_mcq(init_attr->send_cq), to_mcq(init_attr->recv_cq), - init_attr->sq_sig_type, init_attr->rq_sig_type, + init_attr->sq_sig_type, qp->ibqp.qp_num, init_attr->port_num, to_msqp(qp)); break; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-01-25 20:47:46.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:54.674053653 -0800 @@ -154,7 +154,6 @@ void *last; int max_gs; int wqe_shift; - enum ib_sig_type policy; }; struct mthca_qp { @@ -172,6 +171,7 @@ struct mthca_wq rq; struct mthca_wq sq; + enum ib_sig_type sq_policy; int send_wqe_offset; u64 *wrid; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:52.924433436 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:54.675053436 -0800 @@ -690,7 +690,7 @@ MTHCA_QP_BIT_SRE | MTHCA_QP_BIT_SWE | MTHCA_QP_BIT_SAE); - if (qp->sq.policy == IB_SIGNAL_ALL_WR) + if (qp->sq_policy == IB_SIGNAL_ALL_WR) qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); if (attr_mask & IB_QP_RETRY_CNT) { qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); @@ -778,8 +778,8 @@ qp->resp_depth = attr->max_rd_atomic; } - if (qp->rq.policy == IB_SIGNAL_ALL_WR) - qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); @@ -977,7 +977,6 @@ struct mthca_cq *send_cq, struct mthca_cq *recv_cq, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, struct mthca_qp *qp) { int err; @@ -987,8 +986,7 @@ qp->state = IB_QPS_RESET; qp->atomic_rd_en = 0; qp->resp_depth = 0; - qp->sq.policy = send_policy; - qp->rq.policy = recv_policy; + qp->sq_policy = send_policy; qp->rq.cur = 0; qp->sq.cur = 0; qp->rq.next = 0; @@ -1008,7 +1006,6 @@ struct mthca_cq *recv_cq, enum ib_qp_type type, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, struct mthca_qp *qp) { int err; @@ -1025,7 +1022,7 @@ return -ENOMEM; err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, - send_policy, recv_policy, qp); + send_policy, qp); if (err) { mthca_free(&dev->qp_table.alloc, qp->qpn); return err; @@ -1044,7 +1041,6 @@ struct mthca_cq *send_cq, struct mthca_cq *recv_cq, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, int qpn, int port, struct mthca_sqp *sqp) @@ -1073,8 +1069,7 @@ sqp->qp.transport = MLX; err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, - send_policy, recv_policy, - &sqp->qp); + send_policy, &sqp->qp); if (err) goto err_out_free; @@ -1495,9 +1490,7 @@ ((struct mthca_next_seg *) wqe)->nda_op = 0; ((struct mthca_next_seg *) wqe)->ee_nds = cpu_to_be32(MTHCA_NEXT_DBD); - ((struct mthca_next_seg *) wqe)->flags = - (wr->recv_flags & IB_RECV_SIGNALED) ? - cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + ((struct mthca_next_seg *) wqe)->flags = 0; wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; --- linux-export.orig/drivers/infiniband/include/ib_verbs.h 2005-01-25 20:47:00.000000000 -0800 +++ linux-export/drivers/infiniband/include/ib_verbs.h 2005-03-03 14:12:54.669054738 -0800 @@ -73,7 +73,6 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), - IB_DEVICE_RQ_SIG_TYPE = (1<<15) }; enum ib_atomic_cap { @@ -408,7 +407,6 @@ struct ib_srq *srq; struct ib_qp_cap cap; enum ib_sig_type sq_sig_type; - enum ib_sig_type rq_sig_type; enum ib_qp_type qp_type; u8 port_num; /* special QP types only */ }; @@ -533,10 +531,6 @@ IB_SEND_INLINE = (1<<3) }; -enum ib_recv_flags { - IB_RECV_SIGNALED = 1 -}; - struct ib_sge { u64 addr; u32 length; @@ -579,7 +573,6 @@ u64 wr_id; struct ib_sge *sg_list; int num_sge; - int recv_flags; }; enum ib_access_flags { --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-03 14:12:54.668054955 -0800 @@ -105,7 +105,6 @@ .wr_id = wr_id | IPOIB_OP_RECV, .sg_list = &list, .num_sge = 1, - .recv_flags = IB_RECV_SIGNALED }; struct ib_recv_wr *bad_wr; --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2005-01-15 15:19:59.000000000 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2005-03-03 14:12:54.667055172 -0800 @@ -165,7 +165,6 @@ .max_recv_sge = 1 }, .sq_sig_type = IB_SIGNAL_ALL_WR, - .rq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_UD }; From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][7/26] IB/mthca: map registers for mem-free mode In-Reply-To: <2005331520.psAuTRchMaqO6dem@topspin.com> Message-ID: <2005331520.q2c4004P8DuwgJEx@topspin.com> Move the request/ioremap of regions related to event handling into mthca_eq.c. Map the correct regions depending on whether we're in Tavor or native mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_config_reg.h 2005-01-25 20:48:48.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_config_reg.h 2005-03-03 14:12:55.516870705 -0800 @@ -46,5 +46,6 @@ #define MTHCA_MAP_ECR_SIZE (MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE) #define MTHCA_CLR_INT_BASE 0xf00d8 #define MTHCA_CLR_INT_SIZE 0x00008 +#define MTHCA_EQ_SET_CI_SIZE (8 * 32) #endif /* MTHCA_CONFIG_REG_H */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:54.672054087 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:55.515870922 -0800 @@ -237,9 +237,17 @@ struct semaphore cap_mask_mutex; void __iomem *hcr; - void __iomem *ecr_base; - void __iomem *clr_base; void __iomem *kar; + void __iomem *clr_base; + union { + struct { + void __iomem *ecr_base; + } tavor; + struct { + void __iomem *eq_arm; + void __iomem *eq_set_ci_base; + } arbel; + } eq_regs; struct mthca_cmd cmd; struct mthca_limits limits; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-01-25 20:48:48.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:55.516870705 -0800 @@ -366,10 +366,10 @@ if (dev->eq_table.clr_mask) writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); - if ((ecr = readl(dev->ecr_base + 4)) != 0) { + if ((ecr = readl(dev->eq_regs.tavor.ecr_base + 4)) != 0) { work = 1; - writel(ecr, dev->ecr_base + + writel(ecr, dev->eq_regs.tavor.ecr_base + MTHCA_ECR_CLR_BASE - MTHCA_ECR_BASE + 4); for (i = 0; i < MTHCA_NUM_EQ; ++i) @@ -578,6 +578,129 @@ dev->eq_table.eq + i); } +static int __devinit mthca_map_reg(struct mthca_dev *dev, + unsigned long offset, unsigned long size, + void __iomem **map) +{ + unsigned long base = pci_resource_start(dev->pdev, 0); + + if (!request_mem_region(base + offset, size, DRV_NAME)) + return -EBUSY; + + *map = ioremap(base + offset, size); + if (!*map) { + release_mem_region(base + offset, size); + return -ENOMEM; + } + + return 0; +} + +static void mthca_unmap_reg(struct mthca_dev *dev, unsigned long offset, + unsigned long size, void __iomem *map) +{ + unsigned long base = pci_resource_start(dev->pdev, 0); + + release_mem_region(base + offset, size); + iounmap(map); +} + +static int __devinit mthca_map_eq_regs(struct mthca_dev *dev) +{ + unsigned long mthca_base; + + mthca_base = pci_resource_start(dev->pdev, 0); + + if (dev->hca_type == ARBEL_NATIVE) { + /* + * We assume that the EQ arm and EQ set CI registers + * fall within the first BAR. We can't trust the + * values firmware gives us, since those addresses are + * valid on the HCA's side of the PCI bus but not + * necessarily the host side. + */ + if (mthca_map_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + &dev->clr_base)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + return -ENOMEM; + } + + /* + * Add 4 because we limit ourselves to EQs 0 ... 31, + * so we only need the low word of the register. + */ + if (mthca_map_reg(dev, ((pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_arm_base) + 4, 4, + &dev->eq_regs.arbel.eq_arm)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + dev->clr_base); + return -ENOMEM; + } + + if (mthca_map_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_set_ci_base, + MTHCA_EQ_SET_CI_SIZE, + &dev->eq_regs.arbel.eq_set_ci_base)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + mthca_unmap_reg(dev, ((pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_arm_base) + 4, 4, + dev->eq_regs.arbel.eq_arm); + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + dev->clr_base); + return -ENOMEM; + } + } else { + if (mthca_map_reg(dev, MTHCA_CLR_INT_BASE, MTHCA_CLR_INT_SIZE, + &dev->clr_base)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + return -ENOMEM; + } + + if (mthca_map_reg(dev, MTHCA_ECR_BASE, + MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE, + &dev->eq_regs.tavor.ecr_base)) { + mthca_err(dev, "Couldn't map ecr register, " + "aborting.\n"); + mthca_unmap_reg(dev, MTHCA_CLR_INT_BASE, MTHCA_CLR_INT_SIZE, + dev->clr_base); + return -ENOMEM; + } + } + + return 0; + +} + +static void __devexit mthca_unmap_eq_regs(struct mthca_dev *dev) +{ + if (dev->hca_type == ARBEL_NATIVE) { + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_set_ci_base, + MTHCA_EQ_SET_CI_SIZE, + dev->eq_regs.arbel.eq_set_ci_base); + mthca_unmap_reg(dev, ((pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_arm_base) + 4, 4, + dev->eq_regs.arbel.eq_arm); + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + dev->clr_base); + } else { + mthca_unmap_reg(dev, MTHCA_ECR_BASE, + MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE, + dev->eq_regs.tavor.ecr_base); + mthca_unmap_reg(dev, MTHCA_CLR_INT_BASE, MTHCA_CLR_INT_SIZE, + dev->clr_base); + } +} + int __devinit mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt) { int ret; @@ -636,6 +759,10 @@ if (err) return err; + err = mthca_map_eq_regs(dev); + if (err) + goto err_out_free; + if (dev->mthca_flags & MTHCA_FLAG_MSI || dev->mthca_flags & MTHCA_FLAG_MSI_X) { dev->eq_table.clr_mask = 0; @@ -653,7 +780,7 @@ (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, &dev->eq_table.eq[MTHCA_EQ_COMP]); if (err) - goto err_out_free; + goto err_out_unmap; err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, @@ -720,6 +847,9 @@ err_out_comp: mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); +err_out_unmap: + mthca_unmap_eq_regs(dev); + err_out_free: mthca_alloc_cleanup(&dev->eq_table.alloc); return err; @@ -740,5 +870,7 @@ for (i = 0; i < MTHCA_NUM_EQ; ++i) mthca_free_eq(dev, &dev->eq_table.eq[i]); + mthca_unmap_eq_regs(dev); + mthca_alloc_cleanup(&dev->eq_table.alloc); } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-01-25 20:49:05.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:55.516870705 -0800 @@ -686,37 +686,18 @@ int err; /* - * We request our first BAR in two chunks, since the MSI-X - * vector table is right in the middle. + * We can't just use pci_request_regions() because the MSI-X + * table is right in the middle of the first BAR. If we did + * pci_request_region and grab all of the first BAR, then + * setting up MSI-X would fail, since the PCI core wants to do + * request_mem_region on the MSI-X vector table. * - * This is why we can't just use pci_request_regions() -- if - * we did then setting up MSI-X would fail, since the PCI core - * wants to do request_mem_region on the MSI-X vector table. + * So just request what we need right now, and request any + * other regions we need when setting up EQs. */ - if (!request_mem_region(pci_resource_start(pdev, 0) + - MTHCA_HCR_BASE, - MTHCA_HCR_SIZE, - DRV_NAME)) { - err = -EBUSY; - goto err_hcr_failed; - } - - if (!request_mem_region(pci_resource_start(pdev, 0) + - MTHCA_ECR_BASE, - MTHCA_MAP_ECR_SIZE, - DRV_NAME)) { - err = -EBUSY; - goto err_ecr_failed; - } - - if (!request_mem_region(pci_resource_start(pdev, 0) + - MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE, - DRV_NAME)) { - err = -EBUSY; - goto err_int_failed; - } - + if (!request_mem_region(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, + MTHCA_HCR_SIZE, DRV_NAME)) + return -EBUSY; err = pci_request_region(pdev, 2, DRV_NAME); if (err) @@ -731,24 +712,11 @@ return 0; err_bar4_failed: - pci_release_region(pdev, 2); -err_bar2_failed: - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); -err_int_failed: - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_ECR_BASE, - MTHCA_MAP_ECR_SIZE); -err_ecr_failed: - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_HCR_BASE, +err_bar2_failed: + release_mem_region(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); -err_hcr_failed: return err; } @@ -761,16 +729,7 @@ pci_release_region(pdev, 2); - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_ECR_BASE, - MTHCA_MAP_ECR_SIZE); - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_HCR_BASE, + release_mem_region(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); } @@ -941,31 +900,13 @@ goto err_free_dev; } - mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); - if (!mdev->clr_base) { - mthca_err(mdev, "Couldn't map interrupt clear register, " - "aborting.\n"); - err = -ENOMEM; - goto err_iounmap; - } - - mdev->ecr_base = ioremap(mthca_base + MTHCA_ECR_BASE, - MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE); - if (!mdev->ecr_base) { - mthca_err(mdev, "Couldn't map ecr register, " - "aborting.\n"); - err = -ENOMEM; - goto err_iounmap_clr; - } - mthca_base = pci_resource_start(pdev, 2); mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); if (!mdev->kar) { mthca_err(mdev, "Couldn't map kernel access region, " "aborting.\n"); err = -ENOMEM; - goto err_iounmap_ecr; + goto err_iounmap; } err = mthca_tune_pci(mdev); @@ -1014,12 +955,6 @@ err_iounmap_kar: iounmap(mdev->kar); -err_iounmap_ecr: - iounmap(mdev->ecr_base); - -err_iounmap_clr: - iounmap(mdev->clr_base); - err_iounmap: iounmap(mdev->hcr); @@ -1067,9 +1002,8 @@ mthca_close_hca(mdev); + iounmap(mdev->kar); iounmap(mdev->hcr); - iounmap(mdev->ecr_base); - iounmap(mdev->clr_base); if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) pci_disable_msix(pdev); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][9/26] IB/mthca: dynamic context memory mapping for mem-free mode In-Reply-To: <2005331520.qwFp6OBqldRd6oo8@topspin.com> Message-ID: <2005331520.nfKPjEcWG6DlwOqo@topspin.com> Add support for mapping more memory into HCA's context to cover context tables when new objects are allocated. Pass the object size into mthca_alloc_icm_table(), reference count the ICM chunks, and add new mthca_table_get() and mthca_table_put() functions to handle mapping memory when allocating or destroying objects. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.152732681 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.772598129 -0800 @@ -363,10 +363,9 @@ } mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, - mdev->limits.num_mtt_segs * init_hca->mtt_seg_sz, - mdev->limits.reserved_mtts * - init_hca->mtt_seg_sz, 1); + mdev->limits.num_mtt_segs, + mdev->limits.reserved_mtts, 1); if (!mdev->mr_table.mtt_table) { mthca_err(mdev, "Failed to map MTT context memory, aborting.\n"); err = -ENOMEM; @@ -374,10 +373,9 @@ } mdev->mr_table.mpt_table = mthca_alloc_icm_table(mdev, init_hca->mpt_base, - mdev->limits.num_mpts * dev_lim->mpt_entry_sz, - mdev->limits.reserved_mrws * - dev_lim->mpt_entry_sz, 1); + mdev->limits.num_mpts, + mdev->limits.reserved_mrws, 1); if (!mdev->mr_table.mpt_table) { mthca_err(mdev, "Failed to map MPT context memory, aborting.\n"); err = -ENOMEM; @@ -385,10 +383,9 @@ } mdev->qp_table.qp_table = mthca_alloc_icm_table(mdev, init_hca->qpc_base, - mdev->limits.num_qps * dev_lim->qpc_entry_sz, - mdev->limits.reserved_qps * - dev_lim->qpc_entry_sz, 1); + mdev->limits.num_qps, + mdev->limits.reserved_qps, 0); if (!mdev->qp_table.qp_table) { mthca_err(mdev, "Failed to map QP context memory, aborting.\n"); err = -ENOMEM; @@ -396,10 +393,9 @@ } mdev->qp_table.eqp_table = mthca_alloc_icm_table(mdev, init_hca->eqpc_base, - mdev->limits.num_qps * dev_lim->eqpc_entry_sz, - mdev->limits.reserved_qps * - dev_lim->eqpc_entry_sz, 1); + mdev->limits.num_qps, + mdev->limits.reserved_qps, 0); if (!mdev->qp_table.eqp_table) { mthca_err(mdev, "Failed to map EQP context memory, aborting.\n"); err = -ENOMEM; @@ -407,10 +403,9 @@ } mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, - mdev->limits.num_cqs * dev_lim->cqc_entry_sz, - mdev->limits.reserved_cqs * - dev_lim->cqc_entry_sz, 1); + mdev->limits.num_cqs, + mdev->limits.reserved_cqs, 0); if (!mdev->cq_table.table) { mthca_err(mdev, "Failed to map CQ context memory, aborting.\n"); err = -ENOMEM; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-01-25 20:46:29.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-03 14:12:56.773597912 -0800 @@ -79,6 +79,7 @@ if (!icm) return icm; + icm->refcount = 0; INIT_LIST_HEAD(&icm->chunk_list); cur_order = get_order(MTHCA_ICM_ALLOC_SIZE); @@ -138,9 +139,62 @@ return NULL; } +int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj) +{ + int i = (obj & (table->num_obj - 1)) * table->obj_size / MTHCA_TABLE_CHUNK_SIZE; + int ret = 0; + u8 status; + + down(&table->mutex); + + if (table->icm[i]) { + ++table->icm[i]->refcount; + goto out; + } + + table->icm[i] = mthca_alloc_icm(dev, MTHCA_TABLE_CHUNK_SIZE >> PAGE_SHIFT, + (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) | + __GFP_NOWARN); + if (!table->icm[i]) { + ret = -ENOMEM; + goto out; + } + + if (mthca_MAP_ICM(dev, table->icm[i], table->virt + i * MTHCA_TABLE_CHUNK_SIZE, + &status) || status) { + mthca_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + ret = -ENOMEM; + goto out; + } + + ++table->icm[i]->refcount; + +out: + up(&table->mutex); + return ret; +} + +void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj) +{ + int i = (obj & (table->num_obj - 1)) * table->obj_size / MTHCA_TABLE_CHUNK_SIZE; + u8 status; + + down(&table->mutex); + + if (--table->icm[i]->refcount == 0) { + mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, + MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + mthca_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + } + + up(&table->mutex); +} + struct mthca_icm_table *mthca_alloc_icm_table(struct mthca_dev *dev, - u64 virt, unsigned size, - unsigned reserved, + u64 virt, int obj_size, + int nobj, int reserved, int use_lowmem) { struct mthca_icm_table *table; @@ -148,20 +202,23 @@ int i; u8 status; - num_icm = size / MTHCA_TABLE_CHUNK_SIZE; + num_icm = obj_size * nobj / MTHCA_TABLE_CHUNK_SIZE; table = kmalloc(sizeof *table + num_icm * sizeof *table->icm, GFP_KERNEL); if (!table) return NULL; - table->virt = virt; - table->num_icm = num_icm; - init_MUTEX(&table->sem); + table->virt = virt; + table->num_icm = num_icm; + table->num_obj = nobj; + table->obj_size = obj_size; + table->lowmem = use_lowmem; + init_MUTEX(&table->mutex); for (i = 0; i < num_icm; ++i) table->icm[i] = NULL; - for (i = 0; i < (reserved + MTHCA_TABLE_CHUNK_SIZE - 1) / MTHCA_TABLE_CHUNK_SIZE; ++i) { + for (i = 0; i * MTHCA_TABLE_CHUNK_SIZE < reserved * obj_size; ++i) { table->icm[i] = mthca_alloc_icm(dev, MTHCA_TABLE_CHUNK_SIZE >> PAGE_SHIFT, (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) | __GFP_NOWARN); @@ -173,6 +230,12 @@ table->icm[i] = NULL; goto err; } + + /* + * Add a reference to this ICM chunk so that it never + * gets freed (since it contains reserved firmware objects). + */ + ++table->icm[i]->refcount; } return table; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-01-25 20:46:29.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-03 14:12:56.773597912 -0800 @@ -53,12 +53,16 @@ struct mthca_icm { struct list_head chunk_list; + int refcount; }; struct mthca_icm_table { u64 virt; int num_icm; - struct semaphore sem; + int num_obj; + int obj_size; + int lowmem; + struct semaphore mutex; struct mthca_icm *icm[0]; }; @@ -75,10 +79,12 @@ void mthca_free_icm(struct mthca_dev *dev, struct mthca_icm *icm); struct mthca_icm_table *mthca_alloc_icm_table(struct mthca_dev *dev, - u64 virt, unsigned size, - unsigned reserved, + u64 virt, int obj_size, + int nobj, int reserved, int use_lowmem); void mthca_free_icm_table(struct mthca_dev *dev, struct mthca_icm_table *table); +int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); +void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); static inline void mthca_icm_first(struct mthca_icm *icm, struct mthca_icm_iter *iter) From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][8/26] IB/mthca: add UAR allocation In-Reply-To: <2005331520.q2c4004P8DuwgJEx@topspin.com> Message-ID: <2005331520.qwFp6OBqldRd6oo8@topspin.com> Add support for allocating user access regions (UARs). Use this to allocate a region for kernel at driver init instead using hard-coded MTHCA_KAR_PAGE index. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/Makefile 2005-01-15 15:16:40.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/Makefile 2005-03-03 14:12:56.155732030 -0800 @@ -9,4 +9,4 @@ ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ - mthca_provider.o mthca_memfree.o + mthca_provider.o mthca_memfree.o mthca_uar.o --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:53.538300187 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:56.153732464 -0800 @@ -666,7 +666,7 @@ MTHCA_CQ_FLAG_TR); cq_context->start = cpu_to_be64(0); cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - MTHCA_KAR_PAGE); + dev->driver_uar.index); cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:55.515870922 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:56.152732681 -0800 @@ -65,7 +65,6 @@ }; enum { - MTHCA_KAR_PAGE = 1, MTHCA_MAX_PORTS = 2 }; @@ -108,6 +107,7 @@ int gid_table_len; int pkey_table_len; int local_ca_ack_delay; + int num_uars; int max_sg; int num_qps; int reserved_qps; @@ -148,6 +148,12 @@ } *page_list; }; +struct mthca_uar_table { + struct mthca_alloc alloc; + u64 uarc_base; + int uarc_size; +}; + struct mthca_pd_table { struct mthca_alloc alloc; }; @@ -252,6 +258,7 @@ struct mthca_cmd cmd; struct mthca_limits limits; + struct mthca_uar_table uar_table; struct mthca_pd_table pd_table; struct mthca_mr_table mr_table; struct mthca_eq_table eq_table; @@ -260,6 +267,7 @@ struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; + struct mthca_uar driver_uar; struct mthca_pd driver_pd; struct mthca_mr driver_mr; @@ -318,6 +326,7 @@ int mthca_array_init(struct mthca_array *array, int nent); void mthca_array_cleanup(struct mthca_array *array, int nent); +int mthca_init_uar_table(struct mthca_dev *dev); int mthca_init_pd_table(struct mthca_dev *dev); int mthca_init_mr_table(struct mthca_dev *dev); int mthca_init_eq_table(struct mthca_dev *dev); @@ -326,6 +335,7 @@ int mthca_init_av_table(struct mthca_dev *dev); int mthca_init_mcg_table(struct mthca_dev *dev); +void mthca_cleanup_uar_table(struct mthca_dev *dev); void mthca_cleanup_pd_table(struct mthca_dev *dev); void mthca_cleanup_mr_table(struct mthca_dev *dev); void mthca_cleanup_eq_table(struct mthca_dev *dev); @@ -337,6 +347,9 @@ int mthca_register_device(struct mthca_dev *dev); void mthca_unregister_device(struct mthca_dev *dev); +int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar); +void mthca_uar_free(struct mthca_dev *dev, struct mthca_uar *uar); + int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:55.516870705 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:56.154732247 -0800 @@ -469,7 +469,7 @@ MTHCA_EQ_FLAG_TR); eq_context->start = cpu_to_be64(0); eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - MTHCA_KAR_PAGE); + dev->driver_uar.index); eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); eq_context->intr = intr; eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:55.516870705 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.152732681 -0800 @@ -570,13 +570,35 @@ MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); - err = mthca_init_pd_table(dev); + err = mthca_init_uar_table(dev); if (err) { mthca_err(dev, "Failed to initialize " - "protection domain table, aborting.\n"); + "user access region table, aborting.\n"); return err; } + err = mthca_uar_alloc(dev, &dev->driver_uar); + if (err) { + mthca_err(dev, "Failed to allocate driver access region, " + "aborting.\n"); + goto err_uar_table_free; + } + + dev->kar = ioremap(dev->driver_uar.pfn << PAGE_SHIFT, PAGE_SIZE); + if (!dev->kar) { + mthca_err(dev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_uar_free; + } + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + goto err_kar_unmap; + } + err = mthca_init_mr_table(dev); if (err) { mthca_err(dev, "Failed to initialize " @@ -677,7 +699,16 @@ err_pd_table_free: mthca_cleanup_pd_table(dev); - return err; + +err_kar_unmap: + iounmap(dev->kar); + +err_uar_free: + mthca_uar_free(dev, &dev->driver_uar); + +err_uar_table_free: + mthca_cleanup_uar_table(dev); + return err; } static int __devinit mthca_request_regions(struct pci_dev *pdev, @@ -789,7 +820,6 @@ static int mthca_version_printed = 0; int ddr_hidden = 0; int err; - unsigned long mthca_base; struct mthca_dev *mdev; if (!mthca_version_printed) { @@ -891,8 +921,7 @@ sema_init(&mdev->cmd.poll_sem, 1); mdev->cmd.use_events = 0; - mthca_base = pci_resource_start(pdev, 0); - mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); + mdev->hcr = ioremap(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); if (!mdev->hcr) { mthca_err(mdev, "Couldn't map command register, " "aborting.\n"); @@ -900,22 +929,13 @@ goto err_free_dev; } - mthca_base = pci_resource_start(pdev, 2); - mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); - if (!mdev->kar) { - mthca_err(mdev, "Couldn't map kernel access region, " - "aborting.\n"); - err = -ENOMEM; - goto err_iounmap; - } - err = mthca_tune_pci(mdev); if (err) - goto err_iounmap_kar; + goto err_iounmap; err = mthca_init_hca(mdev); if (err) - goto err_iounmap_kar; + goto err_iounmap; err = mthca_setup_hca(mdev); if (err) @@ -948,13 +968,11 @@ mthca_cleanup_mr_table(mdev); mthca_cleanup_pd_table(mdev); + mthca_cleanup_uar_table(mdev); err_close: mthca_close_hca(mdev); -err_iounmap_kar: - iounmap(mdev->kar); - err_iounmap: iounmap(mdev->hcr); @@ -1000,9 +1018,12 @@ mthca_cleanup_mr_table(mdev); mthca_cleanup_pd_table(mdev); + iounmap(mdev->kar); + mthca_uar_free(mdev, &mdev->driver_uar); + mthca_cleanup_uar_table(mdev); + mthca_close_hca(mdev); - iounmap(mdev->kar); iounmap(mdev->hcr); if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-03 14:12:56.153732464 -0800 @@ -236,6 +236,7 @@ init_hca->mtt_seg_sz = ffs(dev_lim->mtt_seg_sz) - 7; break; case MTHCA_RES_UAR: + dev->limits.num_uars = profile[i].num; init_hca->uar_scratch_base = profile[i].start; break; case MTHCA_RES_UDAV: --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:54.674053653 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:56.153732464 -0800 @@ -49,6 +49,11 @@ DECLARE_PCI_UNMAP_ADDR(mapping) }; +struct mthca_uar { + unsigned long pfn; + int index; +}; + struct mthca_mr { struct ib_mr ibmr; int order; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:54.675053436 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:56.155732030 -0800 @@ -625,7 +625,7 @@ qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | (31 << 24)); } - qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->usr_page = cpu_to_be32(dev->driver_uar.index); qp_context->local_qpn = cpu_to_be32(qp->qpn); if (attr_mask & IB_QP_DEST_QPN) { qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/hw/mthca/mthca_uar.c 2005-03-03 14:12:56.152732681 -0800 @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "mthca_dev.h" + +int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar) +{ + uar->index = mthca_alloc(&dev->uar_table.alloc); + if (uar->index == -1) + return -ENOMEM; + + uar->pfn = (pci_resource_start(dev->pdev, 2) >> PAGE_SHIFT) + uar->index; + + return 0; +} + +void mthca_uar_free(struct mthca_dev *dev, struct mthca_uar *uar) +{ + mthca_free(&dev->uar_table.alloc, uar->index); +} + +int mthca_init_uar_table(struct mthca_dev *dev) +{ + int ret; + + ret = mthca_alloc_init(&dev->uar_table.alloc, + dev->limits.num_uars, + dev->limits.num_uars - 1, + dev->limits.reserved_uars); + + return ret; +} + +void mthca_cleanup_uar_table(struct mthca_dev *dev) +{ + /* XXX check if any UARs are still allocated? */ + mthca_alloc_cleanup(&dev->uar_table.alloc); +} From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][10/26] IB/mthca: mem-free memory region support In-Reply-To: <2005331520.nfKPjEcWG6DlwOqo@topspin.com> Message-ID: <2005331520.6xlwh79w94Kl0EpH@topspin.com> Add support for mem-free mode to memory region code. This mostly amounts to properly munging between keys and indices. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-01-15 15:16:11.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-03-03 14:12:57.165512841 -0800 @@ -53,7 +53,8 @@ u32 window_count; u32 window_count_limit; u64 mtt_seg; - u32 reserved[3]; + u32 mtt_sz; /* Arbel only */ + u32 reserved[2]; } __attribute__((packed)); #define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) @@ -121,21 +122,38 @@ spin_unlock(&dev->mr_table.mpt_alloc.lock); } +static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind) +{ + if (dev->hca_type == ARBEL_NATIVE) + return (ind >> 24) | (ind << 8); + else + return ind; +} + +static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key) +{ + if (dev->hca_type == ARBEL_NATIVE) + return (key << 24) | (key >> 8); + else + return key; +} + int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, u32 access, struct mthca_mr *mr) { void *mailbox; struct mthca_mpt_entry *mpt_entry; + u32 key; int err; u8 status; might_sleep(); mr->order = -1; - mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); - if (mr->ibmr.lkey == -1) + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) return -ENOMEM; - mr->ibmr.rkey = mr->ibmr.lkey; + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); @@ -151,7 +169,7 @@ MTHCA_MPT_FLAG_REGION | access); mpt_entry->page_size = 0; - mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->key = cpu_to_be32(key); mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = 0; mpt_entry->length = ~0ULL; @@ -160,7 +178,7 @@ sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); err = mthca_SW2HW_MPT(dev, mpt_entry, - mr->ibmr.lkey & (dev->limits.num_mpts - 1), + key & (dev->limits.num_mpts - 1), &status); if (err) mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); @@ -182,6 +200,7 @@ void *mailbox; u64 *mtt_entry; struct mthca_mpt_entry *mpt_entry; + u32 key; int err = -ENOMEM; u8 status; int i; @@ -189,10 +208,10 @@ might_sleep(); WARN_ON(buffer_size_shift >= 32); - mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); - if (mr->ibmr.lkey == -1) + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) return -ENOMEM; - mr->ibmr.rkey = mr->ibmr.lkey; + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; i < list_len; @@ -254,7 +273,7 @@ access); mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); - mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->key = cpu_to_be32(key); mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = cpu_to_be64(iova); mpt_entry->length = cpu_to_be64(total_size); @@ -275,7 +294,7 @@ } err = mthca_SW2HW_MPT(dev, mpt_entry, - mr->ibmr.lkey & (dev->limits.num_mpts - 1), + key & (dev->limits.num_mpts - 1), &status); if (err) mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); @@ -307,7 +326,8 @@ might_sleep(); err = mthca_HW2SW_MPT(dev, NULL, - mr->ibmr.lkey & (dev->limits.num_mpts - 1), + key_to_hw_index(dev, mr->ibmr.lkey) & + (dev->limits.num_mpts - 1), &status); if (err) mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); @@ -318,7 +338,7 @@ if (mr->order >= 0) mthca_free_mtt(dev, mr->first_seg, mr->order); - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); } int __devinit mthca_init_mr_table(struct mthca_dev *dev) From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][11/26] IB/mthca: mem-free EQ initialization In-Reply-To: <2005331520.6xlwh79w94Kl0EpH@topspin.com> Message-ID: <2005331520.nW52EhJhFo4sAhLI@topspin.com> Add code to initialize EQ context properly in both Tavor and mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:56.154732247 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:57.462448386 -0800 @@ -54,10 +54,10 @@ u32 flags; u64 start; u32 logsize_usrpage; - u32 pd; + u32 tavor_pd; /* reserved for Arbel */ u8 reserved1[3]; u8 intr; - u32 lost_count; + u32 arbel_pd; /* lost_count for Tavor */ u32 lkey; u32 reserved2[2]; u32 consumer_index; @@ -75,6 +75,7 @@ #define MTHCA_EQ_STATE_ARMED ( 1 << 8) #define MTHCA_EQ_STATE_FIRED ( 2 << 8) #define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) +#define MTHCA_EQ_STATE_ARBEL ( 8 << 8) enum { MTHCA_EVENT_TYPE_COMP = 0x00, @@ -467,10 +468,16 @@ MTHCA_EQ_OWNER_HW | MTHCA_EQ_STATE_ARMED | MTHCA_EQ_FLAG_TR); - eq_context->start = cpu_to_be64(0); - eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - dev->driver_uar.index); - eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + if (dev->hca_type == ARBEL_NATIVE) + eq_context->flags |= cpu_to_be32(MTHCA_EQ_STATE_ARBEL); + + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24); + if (dev->hca_type == ARBEL_NATIVE) { + eq_context->arbel_pd = cpu_to_be32(dev->driver_pd.pd_num); + } else { + eq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index); + eq_context->tavor_pd = cpu_to_be32(dev->driver_pd.pd_num); + } eq_context->intr = intr; eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][12/26] IB/mthca: mem-free interrupt handling In-Reply-To: <2005331520.nW52EhJhFo4sAhLI@topspin.com> Message-ID: <2005331520.KR3jHRDWtXI3rzl6@topspin.com> Update interrupt handling code to handle mem-free mode. While we're at it, improve the Tavor interrupt handling to avoid an extra MMIO read of the event cause register. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:56.152732681 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:57.857362663 -0800 @@ -171,6 +171,7 @@ struct mthca_alloc alloc; void __iomem *clr_int; u32 clr_mask; + u32 arm_mask; struct mthca_eq eq[MTHCA_NUM_EQ]; u64 icm_virt; struct page *icm_page; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:57.462448386 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:57.859362229 -0800 @@ -165,19 +165,46 @@ MTHCA_ASYNC_EVENT_MASK; } -static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) +static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { u32 doorbell[2]; doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); + /* + * This barrier makes sure that all updates to ownership bits + * done by set_eqe_hw() hit memory before the consumer index + * is updated. set_eq_ci() allows the HCA to possibly write + * more EQ entries, and we want to avoid the exceedingly + * unlikely possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); mthca_write64(doorbell, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -static inline void eq_req_not(struct mthca_dev *dev, int eqn) +static inline void arbel_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) +{ + /* See comment in tavor_set_eq_ci() above. */ + wmb(); + __raw_writel(cpu_to_be32(ci), dev->eq_regs.arbel.eq_set_ci_base + + eq->eqn * 8); + /* We still want ordering, just not swabbing, so add a barrier */ + mb(); +} + +static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) +{ + if (dev->hca_type == ARBEL_NATIVE) + arbel_set_eq_ci(dev, eq, ci); + else + tavor_set_eq_ci(dev, eq, ci); +} + +static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { u32 doorbell[2]; @@ -189,16 +216,23 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } +static inline void arbel_eq_req_not(struct mthca_dev *dev, u32 eqn_mask) +{ + writel(eqn_mask, dev->eq_regs.arbel.eq_arm); +} + static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { - u32 doorbell[2]; + if (dev->hca_type != ARBEL_NATIVE) { + u32 doorbell[2]; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, - MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } } static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, u32 entry) @@ -233,7 +267,7 @@ ib_dispatch_event(&record); } -static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +static int mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) { struct mthca_eqe *eqe; int disarm_cqn; @@ -334,60 +368,93 @@ ++eq->cons_index; eqes_found = 1; - if (set_ci) { - wmb(); /* see comment below */ + if (unlikely(set_ci)) { + /* + * Conditional on hca_type is OK here because + * this is a rare case, not the fast path. + */ set_eq_ci(dev, eq, eq->cons_index); set_ci = 0; } } /* - * This barrier makes sure that all updates to - * ownership bits done by set_eqe_hw() hit memory - * before the consumer index is updated. set_eq_ci() - * allows the HCA to possibly write more EQ entries, - * and we want to avoid the exceedingly unlikely - * possibility of the HCA writing an entry and then - * having set_eqe_hw() overwrite the owner field. + * Rely on caller to set consumer index so that we don't have + * to test hca_type in our interrupt handling fast path. */ - if (likely(eqes_found)) { - wmb(); - set_eq_ci(dev, eq, eq->cons_index); - } - eq_req_not(dev, eq->eqn); + return eqes_found; } -static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +static irqreturn_t mthca_tavor_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) { struct mthca_dev *dev = dev_ptr; u32 ecr; - int work = 0; int i; if (dev->eq_table.clr_mask) writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); - if ((ecr = readl(dev->eq_regs.tavor.ecr_base + 4)) != 0) { - work = 1; - + ecr = readl(dev->eq_regs.tavor.ecr_base + 4); + if (ecr) { writel(ecr, dev->eq_regs.tavor.ecr_base + MTHCA_ECR_CLR_BASE - MTHCA_ECR_BASE + 4); for (i = 0; i < MTHCA_NUM_EQ; ++i) - if (ecr & dev->eq_table.eq[i].ecr_mask) - mthca_eq_int(dev, &dev->eq_table.eq[i]); + if (ecr & dev->eq_table.eq[i].eqn_mask && + mthca_eq_int(dev, &dev->eq_table.eq[i])) { + tavor_set_eq_ci(dev, &dev->eq_table.eq[i], + dev->eq_table.eq[i].cons_index); + tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); + } } - return IRQ_RETVAL(work); + return IRQ_RETVAL(ecr); } -static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, +static irqreturn_t mthca_tavor_msi_x_interrupt(int irq, void *eq_ptr, struct pt_regs *regs) { struct mthca_eq *eq = eq_ptr; struct mthca_dev *dev = eq->dev; mthca_eq_int(dev, eq); + tavor_set_eq_ci(dev, eq, eq->cons_index); + tavor_eq_req_not(dev, eq->eqn); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static irqreturn_t mthca_arbel_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (mthca_eq_int(dev, &dev->eq_table.eq[i])) { + work = 1; + arbel_set_eq_ci(dev, &dev->eq_table.eq[i], + dev->eq_table.eq[i].cons_index); + } + + arbel_eq_req_not(dev, dev->eq_table.arm_mask); + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_arbel_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + mthca_eq_int(dev, eq); + arbel_set_eq_ci(dev, eq, eq->cons_index); + arbel_eq_req_not(dev, eq->eqn_mask); /* MSI-X vectors always belong to us */ return IRQ_HANDLED; @@ -496,10 +563,10 @@ kfree(dma_list); kfree(mailbox); - eq->ecr_mask = swab32(1 << eq->eqn); + eq->eqn_mask = swab32(1 << eq->eqn); eq->cons_index = 0; - eq_req_not(dev, eq->eqn); + dev->eq_table.arm_mask |= eq->eqn_mask; mthca_dbg(dev, "Allocated EQ %d with %d entries\n", eq->eqn, nent); @@ -551,6 +618,8 @@ mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", status); + dev->eq_table.arm_mask &= ~eq->eqn_mask; + if (0) { mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { @@ -562,7 +631,6 @@ } } - mthca_free_mr(dev, &eq->mr); for (i = 0; i < npages; ++i) pci_free_consistent(dev->pdev, PAGE_SIZE, @@ -780,6 +848,8 @@ (dev->eq_table.inta_pin < 31 ? 4 : 0); } + dev->eq_table.arm_mask = 0; + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? 128 : dev->eq_table.inta_pin; @@ -810,15 +880,20 @@ for (i = 0; i < MTHCA_NUM_EQ; ++i) { err = request_irq(dev->eq_table.eq[i].msi_x_vector, - mthca_msi_x_interrupt, 0, - eq_name[i], dev->eq_table.eq + i); + dev->hca_type == ARBEL_NATIVE ? + mthca_arbel_msi_x_interrupt : + mthca_tavor_msi_x_interrupt, + 0, eq_name[i], dev->eq_table.eq + i); if (err) goto err_out_cmd; dev->eq_table.eq[i].have_irq = 1; } } else { - err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, - DRV_NAME, dev); + err = request_irq(dev->pdev->irq, + dev->hca_type == ARBEL_NATIVE ? + mthca_arbel_interrupt : + mthca_tavor_interrupt, + SA_SHIRQ, DRV_NAME, dev); if (err) goto err_out_cmd; dev->eq_table.have_irq = 1; @@ -842,6 +917,12 @@ mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + for (i = 0; i < MTHCA_EQ_CMD; ++i) + if (dev->hca_type == ARBEL_NATIVE) + arbel_eq_req_not(dev, dev->eq_table.eq[i].eqn_mask); + else + tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); + return 0; err_out_cmd: --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.772598129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:57.858362446 -0800 @@ -608,13 +608,6 @@ goto err_mr_table_free; } - if (dev->hca_type == ARBEL_NATIVE) { - mthca_warn(dev, "Sorry, native MT25208 mode support is not done, " - "aborting.\n"); - err = -ENODEV; - goto err_pd_free; - } - err = mthca_init_eq_table(dev); if (err) { mthca_err(dev, "Failed to initialize " @@ -638,8 +631,16 @@ mthca_err(dev, "BIOS or ACPI interrupt routing problem?\n"); goto err_cmd_poll; - } else - mthca_dbg(dev, "NOP command IRQ test passed\n"); + } + + mthca_dbg(dev, "NOP command IRQ test passed\n"); + + if (dev->hca_type == ARBEL_NATIVE) { + mthca_warn(dev, "Sorry, native MT25208 mode support is not complete, " + "aborting.\n"); + err = -ENODEV; + goto err_cmd_poll; + } err = mthca_init_cq_table(dev); if (err) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:56.153732464 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:57.858362446 -0800 @@ -70,7 +70,7 @@ struct mthca_eq { struct mthca_dev *dev; int eqn; - u32 ecr_mask; + u32 eqn_mask; u32 cons_index; u16 msi_x_vector; u16 msi_x_entry; From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][13/26] IB/mthca: tweak firmware command debug messages In-Reply-To: <2005331520.KR3jHRDWtXI3rzl6@topspin.com> Message-ID: <2005331520.wYWJriF1rMlIY4lJ@topspin.com> Slightly improve debugging output for UNMAP_ICM and MODIFY_QP firmware commands. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-01-25 20:48:02.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-03 14:12:58.283270213 -0800 @@ -1305,6 +1305,9 @@ int mthca_UNMAP_ICM(struct mthca_dev *dev, u64 virt, u32 page_count, u8 *status) { + mthca_dbg(dev, "Unmapping %d pages at %llx from ICM.\n", + page_count, (unsigned long long) virt); + return mthca_cmd(dev, virt, page_count, 0, CMD_UNMAP_ICM, CMD_TIME_CLASS_B, status); } @@ -1538,10 +1541,10 @@ if (0) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" %08x\n", be32_to_cpup(qp_context)); + printk(" opt param mask: %08x\n", be32_to_cpup(qp_context)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) - printk("[%02x] ", i * 4); + printk(" [%02x] ", i * 4); printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); if ((i + 1) % 8 == 0) printk("\n"); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][14/26] IB/mthca: tweak MAP_ICM_page firmware command In-Reply-To: <2005331520.wYWJriF1rMlIY4lJ@topspin.com> Message-ID: <2005331520.6eBThkRRWYJ5HE5s@topspin.com> Have MAP_ICM_page() firmware command map assume pages are always the HCA-native 4K size rather than using the kernel's page size. This will make handling doorbell pages for mem-free mode simpler. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-03 14:12:58.283270213 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-03 14:12:58.619197294 -0800 @@ -1290,7 +1290,7 @@ return -ENOMEM; inbox[0] = cpu_to_be64(virt); - inbox[1] = cpu_to_be64(dma_addr | (PAGE_SHIFT - 12)); + inbox[1] = cpu_to_be64(dma_addr); err = mthca_cmd(dev, indma, 1, 0, CMD_MAP_ICM, CMD_TIME_CLASS_B, status); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][15/26] IB/mthca: mem-free doorbell record allocation In-Reply-To: <2005331520.6eBThkRRWYJ5HE5s@topspin.com> Message-ID: <2005331520.dH2BeQ6Ko7h8SaKM@topspin.com> Mem-free mode requires the driver to allocate additional doorbell pages for each user access region. Add support for this in mthca_memfree.c, and have the driver allocate a table in db_tab for kernel use. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:57.857362663 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:59.077097900 -0800 @@ -268,9 +268,10 @@ struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; - struct mthca_uar driver_uar; - struct mthca_pd driver_pd; - struct mthca_mr driver_mr; + struct mthca_uar driver_uar; + struct mthca_db_table *db_tab; + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-03 14:12:56.773597912 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-03 14:12:59.079097466 -0800 @@ -267,3 +267,199 @@ kfree(table); } + +static u64 mthca_uarc_virt(struct mthca_dev *dev, int page) +{ + return dev->uar_table.uarc_base + + dev->driver_uar.index * dev->uar_table.uarc_size + + page * 4096; +} + +int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db) +{ + int group; + int start, end, dir; + int i, j; + struct mthca_db_page *page; + int ret = 0; + u8 status; + + down(&dev->db_tab->mutex); + + switch (type) { + case MTHCA_DB_TYPE_CQ_ARM: + case MTHCA_DB_TYPE_SQ: + group = 0; + start = 0; + end = dev->db_tab->max_group1; + dir = 1; + break; + + case MTHCA_DB_TYPE_CQ_SET_CI: + case MTHCA_DB_TYPE_RQ: + case MTHCA_DB_TYPE_SRQ: + group = 1; + start = dev->db_tab->npages - 1; + end = dev->db_tab->min_group2; + dir = -1; + break; + + default: + return -1; + } + + for (i = start; i != end; i += dir) + if (dev->db_tab->page[i].db_rec && + !bitmap_full(dev->db_tab->page[i].used, + MTHCA_DB_REC_PER_PAGE)) { + page = dev->db_tab->page + i; + goto found; + } + + if (dev->db_tab->max_group1 >= dev->db_tab->min_group2 - 1) { + ret = -ENOMEM; + goto out; + } + + page = dev->db_tab->page + end; + page->db_rec = dma_alloc_coherent(&dev->pdev->dev, 4096, + &page->mapping, GFP_KERNEL); + if (!page->db_rec) { + ret = -ENOMEM; + goto out; + } + memset(page->db_rec, 0, 4096); + + ret = mthca_MAP_ICM_page(dev, page->mapping, mthca_uarc_virt(dev, i), &status); + if (!ret && status) + ret = -EINVAL; + if (ret) { + dma_free_coherent(&dev->pdev->dev, 4096, + page->db_rec, page->mapping); + goto out; + } + + bitmap_zero(page->used, MTHCA_DB_REC_PER_PAGE); + if (group == 0) + ++dev->db_tab->max_group1; + else + --dev->db_tab->min_group2; + +found: + j = find_first_zero_bit(page->used, MTHCA_DB_REC_PER_PAGE); + set_bit(j, page->used); + + if (group == 1) + j = MTHCA_DB_REC_PER_PAGE - 1 - j; + + ret = i * MTHCA_DB_REC_PER_PAGE + j; + + page->db_rec[j] = cpu_to_be64((qn << 8) | (type << 5)); + + *db = (u32 *) &page->db_rec[j]; + +out: + up(&dev->db_tab->mutex); + + return ret; +} + +void mthca_free_db(struct mthca_dev *dev, int type, int db_index) +{ + int i, j; + struct mthca_db_page *page; + u8 status; + + i = db_index / MTHCA_DB_REC_PER_PAGE; + j = db_index % MTHCA_DB_REC_PER_PAGE; + + page = dev->db_tab->page + i; + + down(&dev->db_tab->mutex); + + page->db_rec[j] = 0; + if (i >= dev->db_tab->min_group2) + j = MTHCA_DB_REC_PER_PAGE - 1 - j; + clear_bit(j, page->used); + + if (bitmap_empty(page->used, MTHCA_DB_REC_PER_PAGE) && + i >= dev->db_tab->max_group1 - 1) { + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status); + + dma_free_coherent(&dev->pdev->dev, 4096, + page->db_rec, page->mapping); + page->db_rec = NULL; + + if (i == dev->db_tab->max_group1) { + --dev->db_tab->max_group1; + /* XXX may be able to unmap more pages now */ + } + if (i == dev->db_tab->min_group2) + ++dev->db_tab->min_group2; + } + + up(&dev->db_tab->mutex); +} + +int mthca_init_db_tab(struct mthca_dev *dev) +{ + int i; + + if (dev->hca_type != ARBEL_NATIVE) + return 0; + + dev->db_tab = kmalloc(sizeof *dev->db_tab, GFP_KERNEL); + if (!dev->db_tab) + return -ENOMEM; + + init_MUTEX(&dev->db_tab->mutex); + + dev->db_tab->npages = dev->uar_table.uarc_size / PAGE_SIZE; + dev->db_tab->max_group1 = 0; + dev->db_tab->min_group2 = dev->db_tab->npages - 1; + + dev->db_tab->page = kmalloc(dev->db_tab->npages * + sizeof *dev->db_tab->page, + GFP_KERNEL); + if (!dev->db_tab->page) { + kfree(dev->db_tab); + return -ENOMEM; + } + + for (i = 0; i < dev->db_tab->npages; ++i) + dev->db_tab->page[i].db_rec = NULL; + + return 0; +} + +void mthca_cleanup_db_tab(struct mthca_dev *dev) +{ + int i; + u8 status; + + if (dev->hca_type != ARBEL_NATIVE) + return; + + /* + * Because we don't always free our UARC pages when they + * become empty to make mthca_free_db() simpler we need to + * make a sweep through the doorbell pages and free any + * leftover pages now. + */ + for (i = 0; i < dev->db_tab->npages; ++i) { + if (!dev->db_tab->page[i].db_rec) + continue; + + if (!bitmap_empty(dev->db_tab->page[i].used, MTHCA_DB_REC_PER_PAGE)) + mthca_warn(dev, "Kernel UARC page %d not empty\n", i); + + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status); + + dma_free_coherent(&dev->pdev->dev, 4096, + dev->db_tab->page[i].db_rec, + dev->db_tab->page[i].mapping); + } + + kfree(dev->db_tab->page); + kfree(dev->db_tab); +} --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-03 14:12:56.773597912 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-03 14:12:59.078097683 -0800 @@ -125,4 +125,37 @@ return sg_dma_len(&iter->chunk->mem[iter->page_idx]); } +enum { + MTHCA_DB_REC_PER_PAGE = 4096 / 8 +}; + +struct mthca_db_page { + DECLARE_BITMAP(used, MTHCA_DB_REC_PER_PAGE); + u64 *db_rec; + dma_addr_t mapping; +}; + +struct mthca_db_table { + int npages; + int max_group1; + int min_group2; + struct mthca_db_page *page; + struct semaphore mutex; +}; + +enum { + MTHCA_DB_TYPE_INVALID = 0x0, + MTHCA_DB_TYPE_CQ_SET_CI = 0x1, + MTHCA_DB_TYPE_CQ_ARM = 0x2, + MTHCA_DB_TYPE_SQ = 0x3, + MTHCA_DB_TYPE_RQ = 0x4, + MTHCA_DB_TYPE_SRQ = 0x5, + MTHCA_DB_TYPE_GROUP_SEP = 0x7 +}; + +int mthca_init_db_tab(struct mthca_dev *dev); +void mthca_cleanup_db_tab(struct mthca_dev *dev); +int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db); +void mthca_free_db(struct mthca_dev *dev, int type, int db_index); + #endif /* MTHCA_MEMFREE_H */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-03 14:12:56.153732464 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-03 14:12:59.078097683 -0800 @@ -244,9 +244,11 @@ dev->av_table.num_ddr_avs = profile[i].num; break; case MTHCA_RES_UARC: - init_hca->uarc_base = profile[i].start; - init_hca->log_uarc_sz = ffs(request->uarc_size) - 13; - init_hca->log_uar_sz = ffs(request->num_uar) - 1; + dev->uar_table.uarc_size = request->uarc_size; + dev->uar_table.uarc_base = profile[i].start; + init_hca->uarc_base = profile[i].start; + init_hca->log_uarc_sz = ffs(request->uarc_size) - 13; + init_hca->log_uar_sz = ffs(request->num_uar) - 1; break; default: break; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_uar.c 2005-03-03 14:12:56.152732681 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_uar.c 2005-03-03 14:12:59.078097683 -0800 @@ -33,6 +33,7 @@ */ #include "mthca_dev.h" +#include "mthca_memfree.h" int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar) { @@ -58,12 +59,20 @@ dev->limits.num_uars, dev->limits.num_uars - 1, dev->limits.reserved_uars); + if (ret) + return ret; + + ret = mthca_init_db_tab(dev); + if (ret) + mthca_alloc_cleanup(&dev->uar_table.alloc); return ret; } void mthca_cleanup_uar_table(struct mthca_dev *dev) { + mthca_cleanup_db_tab(dev); + /* XXX check if any UARs are still allocated? */ mthca_alloc_cleanup(&dev->uar_table.alloc); } From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][17/26] IB/mthca: refactor CQ buffer allocate/free In-Reply-To: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> Message-ID: <2005331520.tIEkOrHOmFDGvQZK@topspin.com> Factor the allocation and freeing of completion queue buffers into mthca_alloc_cq_buf() and mthca_free_cq_buf(). This makes the code more readable and will eventually make handling userspace CQs simpler (the kernel doesn't have to allocate a buffer at all). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:56.153732464 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:59.925913650 -0800 @@ -557,32 +557,40 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -int mthca_init_cq(struct mthca_dev *dev, int nent, - struct mthca_cq *cq) +static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) { - int size = nent * MTHCA_CQ_ENTRY_SIZE; - dma_addr_t t; - void *mailbox = NULL; - int npages, shift; - u64 *dma_list = NULL; - struct mthca_cq_context *cq_context; - int err = -ENOMEM; - u8 status; int i; + int size; - might_sleep(); + if (cq->is_direct) + pci_free_consistent(dev->pdev, + (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + size = (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE; + for (i = 0; i < (size + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); - mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, - GFP_KERNEL); - if (!mailbox) - goto err_out; + kfree(cq->queue.page_list); + } +} - cq_context = MAILBOX_ALIGN(mailbox); +static int mthca_alloc_cq_buf(struct mthca_dev *dev, int size, + struct mthca_cq *cq) +{ + int err = -ENOMEM; + int npages, shift; + u64 *dma_list = NULL; + dma_addr_t t; + int i; if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { - if (0) - mthca_dbg(dev, "Creating direct CQ of size %d\n", size); - cq->is_direct = 1; npages = 1; shift = get_order(size) + PAGE_SHIFT; @@ -590,7 +598,7 @@ cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); if (!cq->queue.direct.buf) - goto err_out; + return -ENOMEM; pci_unmap_addr_set(&cq->queue.direct, mapping, t); @@ -603,7 +611,7 @@ dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); if (!dma_list) - goto err_out_free; + goto err_free; for (i = 0; i < npages; ++i) dma_list[i] = t + i * (1 << shift); @@ -612,12 +620,9 @@ npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; shift = PAGE_SHIFT; - if (0) - mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); if (!dma_list) - goto err_out; + return -ENOMEM; cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, GFP_KERNEL); @@ -631,7 +636,7 @@ cq->queue.page_list[i].buf = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); if (!cq->queue.page_list[i].buf) - goto err_out_free; + goto err_free; dma_list[i] = t; pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); @@ -640,13 +645,6 @@ } } - for (i = 0; i < nent; ++i) - set_cqe_hw(get_cqe(cq, i)); - - cq->cqn = mthca_alloc(&dev->cq_table.alloc); - if (cq->cqn == -1) - goto err_out_free; - err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, dma_list, shift, npages, 0, size, @@ -654,7 +652,52 @@ MTHCA_MPT_FLAG_LOCAL_READ, &cq->mr); if (err) - goto err_out_free_cq; + goto err_free; + + kfree(dma_list); + + return 0; + +err_free: + mthca_free_cq_buf(dev, cq); + +err_out: + kfree(dma_list); + + return err; +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + void *mailbox = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + cq->ibcq.cqe = nent - 1; + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + return -ENOMEM; + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + err = mthca_alloc_cq_buf(dev, size, cq); + if (err) + goto err_out_mailbox; + + for (i = 0; i < nent; ++i) + set_cqe_hw(get_cqe(cq, i)); spin_lock_init(&cq->lock); atomic_set(&cq->refcount, 1); @@ -697,37 +740,20 @@ cq->cons_index = 0; - kfree(dma_list); kfree(mailbox); return 0; - err_out_free_mr: +err_out_free_mr: mthca_free_mr(dev, &cq->mr); + mthca_free_cq_buf(dev, cq); - err_out_free_cq: - mthca_free(&dev->cq_table.alloc, cq->cqn); - - err_out_free: - if (cq->is_direct) - pci_free_consistent(dev->pdev, size, - cq->queue.direct.buf, - pci_unmap_addr(&cq->queue.direct, mapping)); - else { - for (i = 0; i < npages; ++i) - if (cq->queue.page_list[i].buf) - pci_free_consistent(dev->pdev, PAGE_SIZE, - cq->queue.page_list[i].buf, - pci_unmap_addr(&cq->queue.page_list[i], - mapping)); - - kfree(cq->queue.page_list); - } - - err_out: - kfree(dma_list); +err_out_mailbox: kfree(mailbox); +err_out: + mthca_free(&dev->cq_table.alloc, cq->cqn); + return err; } @@ -778,27 +804,7 @@ wait_event(cq->wait, !atomic_read(&cq->refcount)); mthca_free_mr(dev, &cq->mr); - - if (cq->is_direct) - pci_free_consistent(dev->pdev, - (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, - cq->queue.direct.buf, - pci_unmap_addr(&cq->queue.direct, - mapping)); - else { - int i; - - for (i = 0; - i < ((cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / - PAGE_SIZE; - ++i) - pci_free_consistent(dev->pdev, PAGE_SIZE, - cq->queue.page_list[i].buf, - pci_unmap_addr(&cq->queue.page_list[i], - mapping)); - - kfree(cq->queue.page_list); - } + mthca_free_cq_buf(dev, cq); mthca_free(&dev->cq_table.alloc, cq->cqn); kfree(mailbox); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:54.673053870 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:59.925913650 -0800 @@ -408,8 +408,7 @@ if (err) { kfree(cq); cq = ERR_PTR(err); - } else - cq->ibcq.cqe = nent - 1; + } return &cq->ibcq; } From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <2005331520.dH2BeQ6Ko7h8SaKM@topspin.com> Message-ID: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> Add a mthca_write_db_rec() to wrap writing doorbell records. On 64-bit archs, this is just a 64-bit write, while on 32-bit archs it splits the write into two 32-bit writes with a memory barrier to make sure the two halves of the record are written in the correct order. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h 2005-01-25 20:49:05.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_doorbell.h 2005-03-03 14:12:59.570990692 -0800 @@ -57,6 +57,11 @@ __raw_writeq(*(u64 *) val, dest); } +static inline void mthca_write_db_rec(u32 val[2], u32 *db) +{ + *(u64 *) db = *(u64 *) val; +} + #else /* @@ -80,4 +85,11 @@ spin_unlock_irqrestore(doorbell_lock, flags); } +static inline void mthca_write_db_rec(u32 val[2], u32 *db) +{ + db[0] = val[0]; + wmb(); + db[1] = val[1]; +} + #endif From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][18/26] IB/mthca: mem-free CQ initialization In-Reply-To: <2005331520.tIEkOrHOmFDGvQZK@topspin.com> Message-ID: <2005331520.xvxJqi7Nfv5UdZpQ@topspin.com> Update CQ initialization and cleanup to handle mem-free mode: we need to make sure the HCA has memory mapped for the entry in the CQ context table we will use and also allocate doorbell records. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:59.925913650 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:00.312829664 -0800 @@ -39,6 +39,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" enum { MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE @@ -55,7 +56,7 @@ u32 flags; u64 start; u32 logsize_usrpage; - u32 error_eqn; + u32 error_eqn; /* Tavor only */ u32 comp_eqn; u32 pd; u32 lkey; @@ -64,7 +65,9 @@ u32 consumer_index; u32 producer_index; u32 cqn; - u32 reserved[3]; + u32 ci_db; /* Arbel only */ + u32 state_db; /* Arbel only */ + u32 reserved; } __attribute__((packed)); #define MTHCA_CQ_STATUS_OK ( 0 << 28) @@ -685,10 +688,30 @@ if (cq->cqn == -1) return -ENOMEM; + if (dev->hca_type == ARBEL_NATIVE) { + cq->arm_sn = 1; + + err = mthca_table_get(dev, dev->cq_table.table, cq->cqn); + if (err) + goto err_out; + + err = -ENOMEM; + + cq->set_ci_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, + cq->cqn, &cq->set_ci_db); + if (cq->set_ci_db_index < 0) + goto err_out_icm; + + cq->arm_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_ARM, + cq->cqn, &cq->arm_db); + if (cq->arm_db_index < 0) + goto err_out_ci; + } + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); if (!mailbox) - goto err_out; + goto err_out_mailbox; cq_context = MAILBOX_ALIGN(mailbox); @@ -716,6 +739,11 @@ cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); + if (dev->hca_type == ARBEL_NATIVE) { + cq_context->ci_db = cpu_to_be32(cq->set_ci_db_index); + cq_context->state_db = cpu_to_be32(cq->arm_db_index); + } + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); if (err) { mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); @@ -751,6 +779,14 @@ err_out_mailbox: kfree(mailbox); + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); + +err_out_ci: + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); + +err_out_icm: + mthca_table_put(dev, dev->cq_table.table, cq->cqn); + err_out: mthca_free(&dev->cq_table.alloc, cq->cqn); @@ -806,6 +842,12 @@ mthca_free_mr(dev, &cq->mr); mthca_free_cq_buf(dev, cq); + if (dev->hca_type == ARBEL_NATIVE) { + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); + mthca_table_put(dev, dev->cq_table.table, cq->cqn); + } + mthca_free(&dev->cq_table.alloc, cq->cqn); kfree(mailbox); } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:57.858362446 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:00.312829664 -0800 @@ -143,6 +143,14 @@ int cqn; int cons_index; int is_direct; + + /* Next fields are Arbel only */ + int set_ci_db_index; + u32 *set_ci_db; + int arm_db_index; + u32 *arm_db; + int arm_sn; + union { struct mthca_buf_list direct; struct mthca_buf_list *page_list; From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][19/26] IB/mthca: mem-free CQ operations In-Reply-To: <2005331520.xvxJqi7Nfv5UdZpQ@topspin.com> Message-ID: <2005331520.VEavoMG964z0bUT1@topspin.com> Add support for CQ data path operations (request notification, update consumer index) in mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:00.312829664 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:01.214633912 -0800 @@ -136,11 +136,15 @@ #define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) #define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) -#define MTHCA_CQ_DB_INC_CI (1 << 24) -#define MTHCA_CQ_DB_REQ_NOT (2 << 24) -#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) -#define MTHCA_CQ_DB_SET_CI (4 << 24) -#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) +#define MTHCA_TAVOR_CQ_DB_INC_CI (1 << 24) +#define MTHCA_TAVOR_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_TAVOR_CQ_DB_SET_CI (4 << 24) +#define MTHCA_TAVOR_CQ_DB_REQ_NOT_MULT (5 << 24) + +#define MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL (1 << 24) +#define MTHCA_ARBEL_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_ARBEL_CQ_DB_REQ_NOT_MULT (3 << 24) static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) { @@ -159,7 +163,7 @@ static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) { - return cqe_sw(cq, cq->cons_index); + return cqe_sw(cq, cq->cons_index & cq->ibcq.cqe); } static inline void set_cqe_hw(struct mthca_cqe *cqe) @@ -167,17 +171,26 @@ cqe->owner = MTHCA_CQ_ENTRY_OWNER_HW; } -static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, - int nent) +/* + * incr is ignored in native Arbel (mem-free) mode, so cq->cons_index + * should be correct before calling update_cons_index(). + */ +static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int incr) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); - doorbell[1] = cpu_to_be32(nent - 1); + if (dev->hca_type == ARBEL_NATIVE) { + *cq->set_ci_db = cpu_to_be32(cq->cons_index); + wmb(); + } else { + doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(incr - 1); - mthca_write64(doorbell, - dev->kar + MTHCA_CQ_DOORBELL, - MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } } void mthca_cq_event(struct mthca_dev *dev, u32 cqn) @@ -191,6 +204,8 @@ return; } + ++cq->arm_sn; + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); } @@ -247,8 +262,8 @@ if (nfreed) { wmb(); - inc_cons_index(dev, cq, nfreed); - cq->cons_index = (cq->cons_index + nfreed) & cq->ibcq.cqe; + cq->cons_index += nfreed; + update_cons_index(dev, cq, nfreed); } spin_unlock_irq(&cq->lock); @@ -341,7 +356,7 @@ break; } - err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + err = mthca_free_err_wqe(dev, qp, is_send, wqe_index, &dbd, &new_wqe); if (err) return err; @@ -411,7 +426,7 @@ if (*cur_qp) { if (*freed) { wmb(); - inc_cons_index(dev, cq, *freed); + update_cons_index(dev, cq, *freed); *freed = 0; } spin_unlock(&(*cur_qp)->lock); @@ -505,7 +520,7 @@ if (likely(free_cqe)) { set_cqe_hw(cqe); ++(*freed); - cq->cons_index = (cq->cons_index + 1) & cq->ibcq.cqe; + ++cq->cons_index; } return err; @@ -533,7 +548,7 @@ if (freed) { wmb(); - inc_cons_index(dev, cq, freed); + update_cons_index(dev, cq, freed); } if (qp) @@ -544,20 +559,57 @@ return err == 0 || err == -EAGAIN ? npolled : err; } -void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, - int solicited) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32((solicited ? - MTHCA_CQ_DB_REQ_NOT_SOL : - MTHCA_CQ_DB_REQ_NOT) | - cq->cqn); + doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? + MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : + MTHCA_TAVOR_CQ_DB_REQ_NOT) | + to_mcq(cq)->cqn); doorbell[1] = 0xffffffff; mthca_write64(doorbell, - dev->kar + MTHCA_CQ_DOORBELL, - MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock)); + + return 0; +} + +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct mthca_cq *cq = to_mcq(ibcq); + u32 doorbell[2]; + u32 sn; + u32 ci; + + sn = cq->arm_sn & 3; + ci = cpu_to_be32(cq->cons_index); + + doorbell[0] = ci; + doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | + (notify == IB_CQ_SOLICITED ? 1 : 2)); + + mthca_write_db_rec(doorbell, cq->arm_db); + + /* + * Make sure that the doorbell record in host memory is + * written before ringing the doorbell via PCI MMIO. + */ + wmb(); + + doorbell[0] = cpu_to_be32((sn << 28) | + (notify == IB_CQ_SOLICITED ? + MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : + MTHCA_ARBEL_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = ci; + + mthca_write64(doorbell, + to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock)); + + return 0; } static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:59.077097900 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:01.213634129 -0800 @@ -368,8 +368,8 @@ int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, - int solicited); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_cq *cq); void mthca_free_cq(struct mthca_dev *dev, @@ -384,7 +384,7 @@ struct ib_send_wr **bad_wr); int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, struct ib_recv_wr **bad_wr); -int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, +int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe); int mthca_alloc_qp(struct mthca_dev *dev, struct mthca_pd *pd, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:59.925913650 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:01.213634129 -0800 @@ -421,13 +421,6 @@ return 0; } -static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) -{ - mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), - notify == IB_CQ_SOLICITED); - return 0; -} - static inline u32 convert_access(int acc) { return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | @@ -625,7 +618,6 @@ dev->ib_dev.create_cq = mthca_create_cq; dev->ib_dev.destroy_cq = mthca_destroy_cq; dev->ib_dev.poll_cq = mthca_poll_cq; - dev->ib_dev.req_notify_cq = mthca_req_notify_cq; dev->ib_dev.get_dma_mr = mthca_get_dma_mr; dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; dev->ib_dev.dereg_mr = mthca_dereg_mr; @@ -633,6 +625,11 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; + if (dev->hca_type == ARBEL_NATIVE) + dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; + else + dev->ib_dev.req_notify_cq = mthca_tavor_arm_cq; + init_MUTEX(&dev->cap_mask_mutex); ret = ib_register_device(&dev->ib_dev); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:00.312829664 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.213634129 -0800 @@ -141,7 +141,7 @@ spinlock_t lock; atomic_t refcount; int cqn; - int cons_index; + u32 cons_index; int is_direct; /* Next fields are Arbel only */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:56.155732030 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.215633695 -0800 @@ -1551,7 +1551,7 @@ return err; } -int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, +int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe) { struct mthca_next_seg *next; @@ -1561,7 +1561,10 @@ else next = get_recv_wqe(qp, index); - *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (dev->hca_type == ARBEL_NATIVE) + *dbd = 1; + else + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); if (next->ee_nds & cpu_to_be32(0x3f)) *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | (next->ee_nds & cpu_to_be32(0x3f)); From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][21/26] IB/mthca: mem-free address vectors In-Reply-To: <2005331520.7k4CdyDk307HOUr6@topspin.com> Message-ID: <2005331520.kqgduGt72iMbbNeg@topspin.com> Update address vector handling to support mem-free mode. In mem-free mode, the address vector (in hardware format) is copied by the driver into each send work queue entry, so our address handle creation can become pretty trivial: we just kmalloc() a buffer to hold the formatted address vector. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_av.c 2005-01-15 15:19:30.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_av.c 2005-03-03 14:13:02.121437076 -0800 @@ -60,27 +60,34 @@ u32 index = -1; struct mthca_av *av = NULL; - ah->on_hca = 0; + ah->type = MTHCA_AH_PCI_POOL; - if (!atomic_read(&pd->sqp_count) && - !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + if (dev->hca_type == ARBEL_NATIVE) { + ah->av = kmalloc(sizeof *ah->av, GFP_KERNEL); + if (!ah->av) + return -ENOMEM; + + ah->type = MTHCA_AH_KMALLOC; + av = ah->av; + } else if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { index = mthca_alloc(&dev->av_table.alloc); /* fall back to allocate in host memory */ if (index == -1) - goto host_alloc; + goto on_hca_fail; av = kmalloc(sizeof *av, GFP_KERNEL); if (!av) - goto host_alloc; + goto on_hca_fail; - ah->on_hca = 1; + ah->type = MTHCA_AH_ON_HCA; ah->avdma = dev->av_table.ddr_av_base + index * MTHCA_AV_SIZE; } - host_alloc: - if (!ah->on_hca) { +on_hca_fail: + if (ah->type == MTHCA_AH_PCI_POOL) { ah->av = pci_pool_alloc(dev->av_table.pool, SLAB_KERNEL, &ah->avdma); if (!ah->av) @@ -123,7 +130,7 @@ j * 4, be32_to_cpu(((u32 *) av)[j])); } - if (ah->on_hca) { + if (ah->type == MTHCA_AH_ON_HCA) { memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, av, MTHCA_AV_SIZE); kfree(av); @@ -134,12 +141,21 @@ int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) { - if (ah->on_hca) + switch (ah->type) { + case MTHCA_AH_ON_HCA: mthca_free(&dev->av_table.alloc, (ah->avdma - dev->av_table.ddr_av_base) / MTHCA_AV_SIZE); - else + break; + + case MTHCA_AH_PCI_POOL: pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + break; + + case MTHCA_AH_KMALLOC: + kfree(ah->av); + break; + } return 0; } @@ -147,7 +163,7 @@ int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) { - if (ah->on_hca) + if (ah->type == MTHCA_AH_ON_HCA) return -EINVAL; header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; @@ -176,6 +192,9 @@ { int err; + if (dev->hca_type == ARBEL_NATIVE) + return 0; + err = mthca_alloc_init(&dev->av_table.alloc, dev->av_table.num_ddr_avs, dev->av_table.num_ddr_avs - 1, @@ -212,6 +231,9 @@ void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) { + if (dev->hca_type == ARBEL_NATIVE) + return; + if (dev->av_table.av_map) iounmap(dev->av_table.av_map); pci_pool_destroy(dev->av_table.pool); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.712525837 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:02.120437293 -0800 @@ -82,12 +82,18 @@ struct mthca_av; +enum mthca_ah_type { + MTHCA_AH_ON_HCA, + MTHCA_AH_PCI_POOL, + MTHCA_AH_KMALLOC +}; + struct mthca_ah { - struct ib_ah ibah; - int on_hca; - u32 key; - struct mthca_av *av; - dma_addr_t avdma; + struct ib_ah ibah; + enum mthca_ah_type type; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; }; /* From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][20/26] IB/mthca: mem-free QP initialization In-Reply-To: <2005331520.VEavoMG964z0bUT1@topspin.com> Message-ID: <2005331520.7k4CdyDk307HOUr6@topspin.com> Update QP initialization and cleanup to handle mem-free mode. In mem-free mode, work queue sizes have to be rounded up to a power of 2, we need to allocate doorbells, there must be memory mapped for the entries in the QP and extended QP context table that we use, and the entries of the receive queue must be initialized. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.213634129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.712525837 -0800 @@ -167,6 +167,9 @@ void *last; int max_gs; int wqe_shift; + + int db_index; /* Arbel only */ + u32 *db; }; struct mthca_qp { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.215633695 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.713525620 -0800 @@ -40,6 +40,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" enum { MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, @@ -105,8 +106,11 @@ struct mthca_qp_context { u32 flags; - u32 sched_queue; - u32 mtu_msgmax; + u32 tavor_sched_queue; /* Reserved on Arbel */ + u8 mtu_msgmax; + u8 rq_size_stride; /* Reserved on Tavor */ + u8 sq_size_stride; /* Reserved on Tavor */ + u8 rlkey_arbel_sched_queue; /* Reserved on Tavor */ u32 usr_page; u32 local_qpn; u32 remote_qpn; @@ -121,18 +125,22 @@ u32 reserved2; u32 next_send_psn; u32 cqn_snd; - u32 next_snd_wqe[2]; + u32 snd_wqe_base_l; /* Next send WQE on Tavor */ + u32 snd_db_index; /* (debugging only entries) */ u32 last_acked_psn; u32 ssn; u32 params2; u32 rnr_nextrecvpsn; u32 ra_buff_indx; u32 cqn_rcv; - u32 next_rcv_wqe[2]; + u32 rcv_wqe_base_l; /* Next recv WQE on Tavor */ + u32 rcv_db_index; /* (debugging only entries) */ u32 qkey; u32 srqn; u32 rmsn; - u32 reserved3[19]; + u16 rq_wqe_counter; /* reserved on Tavor */ + u16 sq_wqe_counter; /* reserved on Tavor */ + u32 reserved3[18]; } __attribute__((packed)); struct mthca_qp_param { @@ -193,7 +201,7 @@ u32 imm; /* immediate data */ }; -struct mthca_ud_seg { +struct mthca_tavor_ud_seg { u32 reserved1; u32 lkey; u64 av_addr; @@ -203,6 +211,13 @@ u32 reserved3[2]; }; +struct mthca_arbel_ud_seg { + u32 av[8]; + u32 dqpn; + u32 qkey; + u32 reserved[2]; +}; + struct mthca_bind_seg { u32 flags; /* [31] Atomic [30] rem write [29] rem read */ u32 reserved; @@ -617,14 +632,24 @@ break; } } - /* leave sched_queue as 0 */ + + /* leave tavor_sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) - qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | - (11 << 24)); + qp_context->mtu_msgmax = (IB_MTU_2048 << 5) | 11; else if (attr_mask & IB_QP_PATH_MTU) { - qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | - (31 << 24)); + qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; + } + + if (dev->hca_type == ARBEL_NATIVE) { + qp_context->rq_size_stride = + ((ffs(qp->rq.max) - 1) << 3) | (qp->rq.wqe_shift - 4); + qp_context->sq_size_stride = + ((ffs(qp->sq.max) - 1) << 3) | (qp->sq.wqe_shift - 4); } + + /* leave arbel_sched_queue as 0 */ + qp_context->usr_page = cpu_to_be32(dev->driver_uar.index); qp_context->local_qpn = cpu_to_be32(qp->qpn); if (attr_mask & IB_QP_DEST_QPN) { @@ -708,6 +733,11 @@ qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + if (dev->hca_type == ARBEL_NATIVE) { + qp_context->snd_wqe_base_l = cpu_to_be32(qp->send_wqe_offset); + qp_context->snd_db_index = cpu_to_be32(qp->sq.db_index); + } + if (attr_mask & IB_QP_ACCESS_FLAGS) { /* * Only enable RDMA/atomics if we have responder @@ -787,12 +817,16 @@ if (attr_mask & IB_QP_RQ_PSN) qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); - qp_context->ra_buff_indx = dev->qp_table.rdb_base + - ((qp->qpn & (dev->limits.num_qps - 1)) * MTHCA_RDB_ENTRY_SIZE << - dev->qp_table.rdb_shift); + qp_context->ra_buff_indx = + cpu_to_be32(dev->qp_table.rdb_base + + ((qp->qpn & (dev->limits.num_qps - 1)) * MTHCA_RDB_ENTRY_SIZE << + dev->qp_table.rdb_shift)); qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + if (dev->hca_type == ARBEL_NATIVE) + qp_context->rcv_db_index = cpu_to_be32(qp->rq.db_index); + if (attr_mask & IB_QP_QKEY) { qp_context->qkey = cpu_to_be32(attr->qkey); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); @@ -860,12 +894,20 @@ size = sizeof (struct mthca_next_seg) + qp->sq.max_gs * sizeof (struct mthca_data_seg); - if (qp->transport == MLX) + switch (qp->transport) { + case MLX: size += 2 * sizeof (struct mthca_data_seg); - else if (qp->transport == UD) - size += sizeof (struct mthca_ud_seg); - else /* bind seg is as big as atomic + raddr segs */ + break; + case UD: + if (dev->hca_type == ARBEL_NATIVE) + size += sizeof (struct mthca_arbel_ud_seg); + else + size += sizeof (struct mthca_tavor_ud_seg); + break; + default: + /* bind seg is as big as atomic + raddr segs */ size += sizeof (struct mthca_bind_seg); + } for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; qp->sq.wqe_shift++) @@ -942,7 +984,6 @@ err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, npages, 0, size, - MTHCA_MPT_FLAG_LOCAL_WRITE | MTHCA_MPT_FLAG_LOCAL_READ, &qp->mr); if (err) @@ -972,6 +1013,60 @@ return err; } +static int mthca_alloc_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + int ret = 0; + + if (dev->hca_type == ARBEL_NATIVE) { + ret = mthca_table_get(dev, dev->qp_table.qp_table, qp->qpn); + if (ret) + return ret; + + ret = mthca_table_get(dev, dev->qp_table.eqp_table, qp->qpn); + if (ret) + goto err_qpc; + + qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ, + qp->qpn, &qp->rq.db); + if (qp->rq.db_index < 0) { + ret = -ENOMEM; + goto err_eqpc; + } + + qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ, + qp->qpn, &qp->sq.db); + if (qp->sq.db_index < 0) { + ret = -ENOMEM; + goto err_rq_db; + } + } + + return 0; + +err_rq_db: + mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); + +err_eqpc: + mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); + +err_qpc: + mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); + + return ret; +} + +static void mthca_free_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + if (dev->hca_type == ARBEL_NATIVE) { + mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index); + mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); + mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); + mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); + } +} + static int mthca_alloc_qp_common(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, @@ -979,7 +1074,9 @@ enum ib_sig_type send_policy, struct mthca_qp *qp) { - int err; + struct mthca_next_seg *wqe; + int ret; + int i; spin_lock_init(&qp->lock); atomic_set(&qp->refcount, 1); @@ -996,8 +1093,51 @@ qp->rq.last = NULL; qp->sq.last = NULL; - err = mthca_alloc_wqe_buf(dev, pd, qp); - return err; + ret = mthca_alloc_memfree(dev, qp); + if (ret) + return ret; + + ret = mthca_alloc_wqe_buf(dev, pd, qp); + if (ret) { + mthca_free_memfree(dev, qp); + return ret; + } + + if (dev->hca_type == ARBEL_NATIVE) { + for (i = 0; i < qp->rq.max; ++i) { + wqe = get_recv_wqe(qp, i); + wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << + qp->rq.wqe_shift); + wqe->ee_nds = cpu_to_be32(1 << (qp->rq.wqe_shift - 4)); + } + + for (i = 0; i < qp->sq.max; ++i) { + wqe = get_send_wqe(qp, i); + wqe->nda_op = cpu_to_be32((((i + 1) & (qp->sq.max - 1)) << + qp->sq.wqe_shift) + + qp->send_wqe_offset); + } + } + + return 0; +} + +static void mthca_align_qp_size(struct mthca_dev *dev, struct mthca_qp *qp) +{ + int i; + + if (dev->hca_type != ARBEL_NATIVE) + return; + + for (i = 0; 1 << i < qp->rq.max; ++i) + ; /* nothing */ + + qp->rq.max = 1 << i; + + for (i = 0; 1 << i < qp->sq.max; ++i) + ; /* nothing */ + + qp->sq.max = 1 << i; } int mthca_alloc_qp(struct mthca_dev *dev, @@ -1010,6 +1150,8 @@ { int err; + mthca_align_qp_size(dev, qp); + switch (type) { case IB_QPT_RC: qp->transport = RC; break; case IB_QPT_UC: qp->transport = UC; break; @@ -1048,6 +1190,8 @@ int err = 0; u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + mthca_align_qp_size(dev, &sqp->qp); + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, &sqp->header_dma, GFP_KERNEL); @@ -1160,14 +1304,15 @@ kfree(qp->wrid); + mthca_free_memfree(dev, qp); + if (is_sqp(dev, qp)) { atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); dma_free_coherent(&dev->pdev->dev, to_msqp(qp)->header_buf_size, to_msqp(qp)->header_buf, to_msqp(qp)->header_dma); - } - else + } else mthca_free(&dev->qp_table.alloc, qp->qpn); } @@ -1350,17 +1495,17 @@ break; case UD: - ((struct mthca_ud_seg *) wqe)->lkey = + ((struct mthca_tavor_ud_seg *) wqe)->lkey = cpu_to_be32(to_mah(wr->wr.ud.ah)->key); - ((struct mthca_ud_seg *) wqe)->av_addr = + ((struct mthca_tavor_ud_seg *) wqe)->av_addr = cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); - ((struct mthca_ud_seg *) wqe)->dqpn = + ((struct mthca_tavor_ud_seg *) wqe)->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn); - ((struct mthca_ud_seg *) wqe)->qkey = + ((struct mthca_tavor_ud_seg *) wqe)->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); - wqe += sizeof (struct mthca_ud_seg); - size += sizeof (struct mthca_ud_seg) / 16; + wqe += sizeof (struct mthca_tavor_ud_seg); + size += sizeof (struct mthca_tavor_ud_seg) / 16; break; case MLX: From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][22/26] IB/mthca: mem-free work request posting In-Reply-To: <2005331520.kqgduGt72iMbbNeg@topspin.com> Message-ID: <2005331520.ADYAIRdSQBiHhYiD@topspin.com> Implement posting send and receive work requests for mem-free mode. Also tidy up a few things in send/receive posting for Tavor mode (fix smp_wmb()s that should really be just wmb()s, annotate tests in the fast path with likely()/unlikely()). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:01.213634129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:02.565340719 -0800 @@ -380,10 +380,14 @@ void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); -int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, - struct ib_send_wr **bad_wr); -int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, - struct ib_recv_wr **bad_wr); +int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_arbel_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe); int mthca_alloc_qp(struct mthca_dev *dev, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:01.213634129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:02.566340502 -0800 @@ -613,8 +613,6 @@ dev->ib_dev.create_qp = mthca_create_qp; dev->ib_dev.modify_qp = mthca_modify_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; - dev->ib_dev.post_send = mthca_post_send; - dev->ib_dev.post_recv = mthca_post_receive; dev->ib_dev.create_cq = mthca_create_cq; dev->ib_dev.destroy_cq = mthca_destroy_cq; dev->ib_dev.poll_cq = mthca_poll_cq; @@ -625,10 +623,15 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; - if (dev->hca_type == ARBEL_NATIVE) + if (dev->hca_type == ARBEL_NATIVE) { dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; - else + dev->ib_dev.post_send = mthca_arbel_post_send; + dev->ib_dev.post_recv = mthca_arbel_post_receive; + } else { dev->ib_dev.req_notify_cq = mthca_tavor_arm_cq; + dev->ib_dev.post_send = mthca_tavor_post_send; + dev->ib_dev.post_recv = mthca_tavor_post_receive; + } init_MUTEX(&dev->cap_mask_mutex); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.713525620 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:02.567340285 -0800 @@ -253,6 +253,16 @@ u16 vcrc; }; +static const u8 mthca_opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, +}; + static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) { return qp->qpn >= dev->qp_table.sqp_start && @@ -637,9 +647,8 @@ if (qp->transport == MLX || qp->transport == UD) qp_context->mtu_msgmax = (IB_MTU_2048 << 5) | 11; - else if (attr_mask & IB_QP_PATH_MTU) { + else if (attr_mask & IB_QP_PATH_MTU) qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; - } if (dev->hca_type == ARBEL_NATIVE) { qp_context->rq_size_stride = @@ -1385,8 +1394,8 @@ return 0; } -int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, - struct ib_send_wr **bad_wr) +int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); @@ -1402,16 +1411,6 @@ int ind; u8 op0 = 0; - static const u8 opcode[] = { - [IB_WR_SEND] = MTHCA_OPCODE_SEND, - [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, - [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, - [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, - [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, - [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, - [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, - }; - spin_lock_irqsave(&qp->lock, flags); /* XXX check that state is OK to post send */ @@ -1550,7 +1549,7 @@ qp->wrid[ind + qp->rq.max] = wr->wr_id; - if (wr->opcode >= ARRAY_SIZE(opcode)) { + if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) { mthca_err(dev, "opcode invalid\n"); err = -EINVAL; *bad_wr = wr; @@ -1561,15 +1560,15 @@ ((struct mthca_next_seg *) prev_wqe)->nda_op = cpu_to_be32(((ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | - opcode[wr->opcode]); - smp_wmb(); + mthca_opcode[wr->opcode]); + wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); } if (!size0) { size0 = size; - op0 = opcode[wr->opcode]; + op0 = mthca_opcode[wr->opcode]; } ++ind; @@ -1578,7 +1577,7 @@ } out: - if (nreq) { + if (likely(nreq)) { u32 doorbell[2]; doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + @@ -1599,8 +1598,8 @@ return err; } -int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, - struct ib_recv_wr **bad_wr) +int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); @@ -1621,7 +1620,7 @@ ind = qp->rq.next; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (qp->rq.cur + nreq >= qp->rq.max) { + if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { mthca_err(dev, "RQ %06x full\n", qp->qpn); err = -ENOMEM; *bad_wr = wr; @@ -1640,7 +1639,7 @@ wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; - if (wr->num_sge > qp->rq.max_gs) { + if (unlikely(wr->num_sge > qp->rq.max_gs)) { err = -EINVAL; *bad_wr = wr; goto out; @@ -1659,10 +1658,10 @@ qp->wrid[ind] = wr->wr_id; - if (prev_wqe) { + if (likely(prev_wqe)) { ((struct mthca_next_seg *) prev_wqe)->nda_op = cpu_to_be32((ind << qp->rq.wqe_shift) | 1); - smp_wmb(); + wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = cpu_to_be32(MTHCA_NEXT_DBD | size); } @@ -1676,7 +1675,7 @@ } out: - if (nreq) { + if (likely(nreq)) { u32 doorbell[2]; doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); @@ -1696,6 +1695,247 @@ return err; } +int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next & (qp->sq.max - 1); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + switch (qp->transport) { + case UD: + memcpy(((struct mthca_arbel_ud_seg *) wqe)->av, + to_mah(wr->wr.ud.ah)->av, MTHCA_AV_SIZE); + ((struct mthca_arbel_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_arbel_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_arbel_ud_seg); + size += sizeof (struct mthca_arbel_ud_seg) / 16; + break; + + case MLX: + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + break; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (likely(prev_wqe)) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + mthca_opcode[wr->opcode]); + wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) { + size0 = size; + op0 = mthca_opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (likely(nreq)) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((nreq << 24) | + ((qp->sq.next & 0xffff) << 8) | + f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + qp->sq.cur += nreq; + qp->sq.next += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->sq.db = cpu_to_be32(qp->sq.next & 0xffff); + + /* + * Make sure doorbell record is written before we + * write MMIO send doorbell. + */ + wmb(); + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_arbel_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int ind; + int i; + void *wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next & (qp->rq.max - 1); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + + ((struct mthca_next_seg *) wqe)->flags = 0; + + wqe += sizeof (struct mthca_next_seg); + + if (unlikely(wr->num_sge > qp->rq.max_gs)) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < qp->rq.max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(0x100); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + qp->wrid[ind] = wr->wr_id; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } +out: + if (likely(nreq)) { + qp->rq.cur += nreq; + qp->rq.next += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->rq.db = cpu_to_be32(qp->rq.next & 0xffff); + } + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe) { From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][24/26] IB/mthca: QP locking optimization In-Reply-To: <2005331520.kVkmRDQ3e4IStEy9@topspin.com> Message-ID: <2005331520.i9PPmMDNBr0DxH5I@topspin.com> From: Michael S. Tsirkin 1. Split the QP spinlock into separate send and receive locks. The only place where we have to lock both is upon modify_qp, and that is not on data path. 2. Avoid taking any QP locks when polling CQ. This last part is achieved by getting rid of the cur field in mthca_wq, and calculating the number of outstanding WQEs by comparing the head and tail fields. head is only updated by post, tail is only updated by poll. In a rare case where an overrun is detected, a CQ is locked and the overrun condition is re-tested, to avoid any potential for stale tail values. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:01.214633912 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:03.417155819 -0800 @@ -423,15 +423,6 @@ is_send = is_error ? cqe->opcode & 0x01 : cqe->is_send & 0x80; if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { - if (*cur_qp) { - if (*freed) { - wmb(); - update_cons_index(dev, cq, *freed); - *freed = 0; - } - spin_unlock(&(*cur_qp)->lock); - } - /* * We do not have to take the QP table lock here, * because CQs will be locked while QPs are removed @@ -446,8 +437,6 @@ err = -EINVAL; goto out; } - - spin_lock(&(*cur_qp)->lock); } entry->qp_num = (*cur_qp)->qpn; @@ -465,9 +454,9 @@ } if (wq->last_comp < wqe_index) - wq->cur -= wqe_index - wq->last_comp; + wq->tail += wqe_index - wq->last_comp; else - wq->cur -= wq->max - wq->last_comp + wqe_index; + wq->tail += wqe_index + wq->max - wq->last_comp; wq->last_comp = wqe_index; @@ -551,9 +540,6 @@ update_cons_index(dev, cq, freed); } - if (qp) - spin_unlock(&qp->lock); - spin_unlock_irqrestore(&cq->lock, flags); return err == 0 || err == -EAGAIN ? npolled : err; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:02.120437293 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:03.416156036 -0800 @@ -166,21 +166,22 @@ }; struct mthca_wq { - int max; - int cur; - int next; - int last_comp; - void *last; - int max_gs; - int wqe_shift; + spinlock_t lock; + int max; + unsigned next_ind; + unsigned last_comp; + unsigned head; + unsigned tail; + void *last; + int max_gs; + int wqe_shift; - int db_index; /* Arbel only */ - u32 *db; + int db_index; /* Arbel only */ + u32 *db; }; struct mthca_qp { struct ib_qp ibqp; - spinlock_t lock; atomic_t refcount; u32 qpn; int is_direct; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:02.567340285 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:03.418155602 -0800 @@ -577,9 +577,11 @@ else cur_state = attr->cur_qp_state; } else { - spin_lock_irq(&qp->lock); + spin_lock_irq(&qp->sq.lock); + spin_lock(&qp->rq.lock); cur_state = qp->state; - spin_unlock_irq(&qp->lock); + spin_unlock(&qp->rq.lock); + spin_unlock_irq(&qp->sq.lock); } if (attr_mask & IB_QP_STATE) { @@ -1076,6 +1078,16 @@ } } +static void mthca_wq_init(struct mthca_wq* wq) +{ + spin_lock_init(&wq->lock); + wq->next_ind = 0; + wq->last_comp = wq->max - 1; + wq->head = 0; + wq->tail = 0; + wq->last = NULL; +} + static int mthca_alloc_qp_common(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, @@ -1087,20 +1099,13 @@ int ret; int i; - spin_lock_init(&qp->lock); atomic_set(&qp->refcount, 1); qp->state = IB_QPS_RESET; qp->atomic_rd_en = 0; qp->resp_depth = 0; qp->sq_policy = send_policy; - qp->rq.cur = 0; - qp->sq.cur = 0; - qp->rq.next = 0; - qp->sq.next = 0; - qp->rq.last_comp = qp->rq.max - 1; - qp->sq.last_comp = qp->sq.max - 1; - qp->rq.last = NULL; - qp->sq.last = NULL; + mthca_wq_init(&qp->sq); + mthca_wq_init(&qp->rq); ret = mthca_alloc_memfree(dev, qp); if (ret) @@ -1394,6 +1399,24 @@ return 0; } +static inline int mthca_wq_overflow(struct mthca_wq *wq, int nreq, + struct ib_cq *ib_cq) +{ + unsigned cur; + struct mthca_cq *cq; + + cur = wq->head - wq->tail; + if (likely(cur + nreq < wq->max)) + return 0; + + cq = to_mcq(ib_cq); + spin_lock(&cq->lock); + cur = wq->head - wq->tail; + spin_unlock(&cq->lock); + + return cur + nreq >= wq->max; +} + int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { @@ -1411,16 +1434,18 @@ int ind; u8 op0 = 0; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->sq.lock, flags); /* XXX check that state is OK to post send */ - ind = qp->sq.next; + ind = qp->sq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (qp->sq.cur + nreq >= qp->sq.max) { - mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", - qp->sq.cur, qp->sq.max, nreq); + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { + mthca_err(dev, "SQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->sq.head, qp->sq.tail, + qp->sq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1580,7 +1605,7 @@ if (likely(nreq)) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | f0 | op0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); @@ -1591,10 +1616,10 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } - qp->sq.cur += nreq; - qp->sq.next = ind; + qp->sq.next_ind = ind; + qp->sq.head += nreq; - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->sq.lock, flags); return err; } @@ -1613,15 +1638,18 @@ void *wqe; void *prev_wqe; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->rq.lock, flags); /* XXX check that state is OK to post receive */ - ind = qp->rq.next; + ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { - mthca_err(dev, "RQ %06x full\n", qp->qpn); + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { + mthca_err(dev, "RQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->rq.head, qp->rq.tail, + qp->rq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1678,7 +1706,7 @@ if (likely(nreq)) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); wmb(); @@ -1688,10 +1716,10 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } - qp->rq.cur += nreq; - qp->rq.next = ind; + qp->rq.next_ind = ind; + qp->rq.head += nreq; - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->rq.lock, flags); return err; } @@ -1712,16 +1740,18 @@ int ind; u8 op0 = 0; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->sq.lock, flags); /* XXX check that state is OK to post send */ - ind = qp->sq.next & (qp->sq.max - 1); + ind = qp->sq.head & (qp->sq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (qp->sq.cur + nreq >= qp->sq.max) { - mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", - qp->sq.cur, qp->sq.max, nreq); + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { + mthca_err(dev, "SQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->sq.head, qp->sq.tail, + qp->sq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1831,19 +1861,18 @@ u32 doorbell[2]; doorbell[0] = cpu_to_be32((nreq << 24) | - ((qp->sq.next & 0xffff) << 8) | + ((qp->sq.head & 0xffff) << 8) | f0 | op0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - qp->sq.cur += nreq; - qp->sq.next += nreq; + qp->sq.head += nreq; /* * Make sure that descriptors are written before * doorbell record. */ wmb(); - *qp->sq.db = cpu_to_be32(qp->sq.next & 0xffff); + *qp->sq.db = cpu_to_be32(qp->sq.head & 0xffff); /* * Make sure doorbell record is written before we @@ -1855,7 +1884,7 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->sq.lock, flags); return err; } @@ -1871,15 +1900,18 @@ int i; void *wqe; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->rq.lock, flags); /* XXX check that state is OK to post receive */ - ind = qp->rq.next & (qp->rq.max - 1); + ind = qp->rq.head & (qp->rq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { - mthca_err(dev, "RQ %06x full\n", qp->qpn); + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { + mthca_err(dev, "RQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->rq.head, qp->rq.tail, + qp->rq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1921,18 +1953,17 @@ } out: if (likely(nreq)) { - qp->rq.cur += nreq; - qp->rq.next += nreq; + qp->rq.head += nreq; /* * Make sure that descriptors are written before * doorbell record. */ wmb(); - *qp->rq.db = cpu_to_be32(qp->rq.next & 0xffff); + *qp->rq.db = cpu_to_be32(qp->rq.head & 0xffff); } - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->rq.lock, flags); return err; } From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][25/26] IB/mthca: implement query of device caps In-Reply-To: <2005331520.i9PPmMDNBr0DxH5I@topspin.com> Message-ID: <2005331520.mctunM7QrSZHM8mX@topspin.com> From: Michael S. Tsirkin Set device_cap_flags field in mthca's query_device method. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-01-25 20:48:02.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-03-03 14:13:03.934043620 -0800 @@ -95,7 +95,21 @@ }; enum { - DEV_LIM_FLAG_SRQ = 1 << 6 + DEV_LIM_FLAG_RC = 1 << 0, + DEV_LIM_FLAG_UC = 1 << 1, + DEV_LIM_FLAG_UD = 1 << 2, + DEV_LIM_FLAG_RD = 1 << 3, + DEV_LIM_FLAG_RAW_IPV6 = 1 << 4, + DEV_LIM_FLAG_RAW_ETHER = 1 << 5, + DEV_LIM_FLAG_SRQ = 1 << 6, + DEV_LIM_FLAG_BAD_PKEY_CNTR = 1 << 8, + DEV_LIM_FLAG_BAD_QKEY_CNTR = 1 << 9, + DEV_LIM_FLAG_MW = 1 << 16, + DEV_LIM_FLAG_AUTO_PATH_MIG = 1 << 17, + DEV_LIM_FLAG_ATOMIC = 1 << 18, + DEV_LIM_FLAG_RAW_MULTI = 1 << 19, + DEV_LIM_FLAG_UD_AV_PORT_ENFORCE = 1 << 20, + DEV_LIM_FLAG_UD_MULTI = 1 << 21, }; struct mthca_dev_lim { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:03.005245231 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:03.932044054 -0800 @@ -218,6 +218,7 @@ int hca_type; unsigned long mthca_flags; + unsigned long device_cap_flags; u32 rev_id; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:13:03.005245231 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:13:03.933043837 -0800 @@ -171,6 +171,33 @@ mdev->limits.reserved_uars = dev_lim->reserved_uars; mdev->limits.reserved_pds = dev_lim->reserved_pds; + /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. + May be doable since hardware supports it for SRQ. + + IB_DEVICE_N_NOTIFY_CQ is supported by hardware but not by driver. + + IB_DEVICE_SRQ_RESIZE is supported by hardware but SRQ is not + supported by driver. */ + mdev->device_cap_flags = IB_DEVICE_CHANGE_PHY_PORT | + IB_DEVICE_PORT_ACTIVE_EVENT | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_RC_RNR_NAK_GEN; + + if (dev_lim->flags & DEV_LIM_FLAG_BAD_PKEY_CNTR) + mdev->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR; + + if (dev_lim->flags & DEV_LIM_FLAG_BAD_QKEY_CNTR) + mdev->device_cap_flags |= IB_DEVICE_BAD_QKEY_CNTR; + + if (dev_lim->flags & DEV_LIM_FLAG_RAW_MULTI) + mdev->device_cap_flags |= IB_DEVICE_RAW_MULTI; + + if (dev_lim->flags & DEV_LIM_FLAG_AUTO_PATH_MIG) + mdev->device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG; + + if (dev_lim->flags & DEV_LIM_FLAG_UD_AV_PORT_ENFORCE) + mdev->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; + if (dev_lim->flags & DEV_LIM_FLAG_SRQ) mdev->mthca_flags |= MTHCA_FLAG_SRQ; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:02.566340502 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:03.933043837 -0800 @@ -43,6 +43,8 @@ struct ib_smp *in_mad = NULL; struct ib_smp *out_mad = NULL; int err = -ENOMEM; + struct mthca_dev* mdev = to_mdev(ibdev); + u8 status; in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); @@ -50,7 +52,7 @@ if (!in_mad || !out_mad) goto out; - props->fw_ver = to_mdev(ibdev)->fw_ver; + props->fw_ver = mdev->fw_ver; memset(in_mad, 0, sizeof *in_mad); in_mad->base_version = 1; @@ -59,7 +61,7 @@ in_mad->method = IB_MGMT_METHOD_GET; in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; - err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, + err = mthca_MAD_IFC(mdev, 1, 1, 1, NULL, NULL, in_mad, out_mad, &status); if (err) @@ -69,10 +71,11 @@ goto out; } - props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & + props->device_cap_flags = mdev->device_cap_flags; + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & 0xffffff; - props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); - props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); memcpy(&props->sys_image_guid, out_mad->data + 4, 8); memcpy(&props->node_guid, out_mad->data + 12, 8); From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][23/26] IB/mthca: mem-free multicast table In-Reply-To: <2005331520.ADYAIRdSQBiHhYiD@topspin.com> Message-ID: <2005331520.kVkmRDQ3e4IStEy9@topspin.com> Tie up one last loose end by mapping enough context memory to cover the whole multicast table during initialization, and then enable mem-free mode. mthca now supports enough of mem-free mode so that IPoIB works with a mem-free HCA. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:02.565340719 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:03.005245231 -0800 @@ -207,8 +207,9 @@ }; struct mthca_mcg_table { - struct semaphore sem; - struct mthca_alloc alloc; + struct semaphore sem; + struct mthca_alloc alloc; + struct mthca_icm_table *table; }; struct mthca_dev { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:57.858362446 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:13:03.005245231 -0800 @@ -412,8 +412,29 @@ goto err_unmap_eqp; } + /* + * It's not strictly required, but for simplicity just map the + * whole multicast group table now. The table isn't very big + * and it's a lot easier than trying to track ref counts. + */ + mdev->mcg_table.table = mthca_alloc_icm_table(mdev, init_hca->mc_base, + MTHCA_MGM_ENTRY_SIZE, + mdev->limits.num_mgms + + mdev->limits.num_amgms, + mdev->limits.num_mgms + + mdev->limits.num_amgms, + 0); + if (!mdev->mcg_table.table) { + mthca_err(mdev, "Failed to map MCG context memory, aborting.\n"); + err = -ENOMEM; + goto err_unmap_cq; + } + return 0; +err_unmap_cq: + mthca_free_icm_table(mdev, mdev->cq_table.table); + err_unmap_eqp: mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); @@ -587,7 +608,7 @@ goto err_uar_free; } - err = mthca_init_pd_table(dev); + err = mthca_init_pd_table(dev); if (err) { mthca_err(dev, "Failed to initialize " "protection domain table, aborting.\n"); @@ -635,13 +656,6 @@ mthca_dbg(dev, "NOP command IRQ test passed\n"); - if (dev->hca_type == ARBEL_NATIVE) { - mthca_warn(dev, "Sorry, native MT25208 mode support is not complete, " - "aborting.\n"); - err = -ENODEV; - goto err_cmd_poll; - } - err = mthca_init_cq_table(dev); if (err) { mthca_err(dev, "Failed to initialize " @@ -704,7 +718,7 @@ err_uar_table_free: mthca_cleanup_uar_table(dev); - return err; + return err; } static int __devinit mthca_request_regions(struct pci_dev *pdev, @@ -814,6 +828,7 @@ const struct pci_device_id *id) { static int mthca_version_printed = 0; + static int mthca_memfree_warned = 0; int ddr_hidden = 0; int err; struct mthca_dev *mdev; @@ -893,6 +908,10 @@ mdev->pdev = pdev; mdev->hca_type = id->driver_data; + if (mdev->hca_type == ARBEL_NATIVE && !mthca_memfree_warned++) + mthca_warn(mdev, "Warning: native MT25208 mode support is incomplete. " + "Your HCA may not work properly.\n"); + if (ddr_hidden) mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][26/26] IB: MAD cancel callbacks from thread In-Reply-To: <2005331520.mctunM7QrSZHM8mX@topspin.com> Message-ID: <2005331520.zA1xypugai2bUq7X@topspin.com> From: Sean Hefty Modify ib_cancel_mad() to invoke a user's send completion callback from a different thread context than that used by the caller. This allows a caller to hold a lock while calling cancel that is also acquired from their send handler. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-03 14:12:54.671054304 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-03 14:13:04.375947697 -0800 @@ -68,6 +68,7 @@ static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); +static void cancel_sends(void *data); static void local_completions(void *data); static int solicited_mad(struct ib_mad *mad); static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, @@ -341,6 +342,8 @@ INIT_LIST_HEAD(&mad_agent_priv->local_list); INIT_WORK(&mad_agent_priv->local_work, local_completions, mad_agent_priv); + INIT_LIST_HEAD(&mad_agent_priv->canceled_list); + INIT_WORK(&mad_agent_priv->canceled_work, cancel_sends, mad_agent_priv); atomic_set(&mad_agent_priv->refcount, 1); init_waitqueue_head(&mad_agent_priv->wait); @@ -2004,12 +2007,44 @@ return NULL; } +void cancel_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->canceled_list)) { + mad_send_wr = list_entry(mad_agent_priv->canceled_list.next, + struct ib_mad_send_wr_private, + agent_list); + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + void ib_cancel_mad(struct ib_mad_agent *mad_agent, u64 wr_id) { struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; - struct ib_mad_send_wc mad_send_wc; unsigned long flags; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, @@ -2031,19 +2066,12 @@ } list_del(&mad_send_wr->agent_list); + list_add_tail(&mad_send_wr->agent_list, &mad_agent_priv->canceled_list); adjust_timeout(mad_agent_priv); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - mad_send_wc.status = IB_WC_WR_FLUSH_ERR; - mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = mad_send_wr->wr_id; - mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, - &mad_send_wc); - - kfree(mad_send_wr); - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); - + queue_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->canceled_work); out: return; } --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-03-03 14:13:04.375947697 -0800 @@ -95,6 +95,8 @@ unsigned long timeout; struct list_head local_list; struct work_struct local_work; + struct list_head canceled_list; + struct work_struct canceled_work; atomic_t refcount; wait_queue_head_t wait; From jgarzik at pobox.com Thu Mar 3 16:04:22 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:04:22 -0500 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> Message-ID: <4227A606.50703@pobox.com> Roland Dreier wrote: > Add a mthca_write_db_rec() to wrap writing doorbell records. On > 64-bit archs, this is just a 64-bit write, while on 32-bit archs it > splits the write into two 32-bit writes with a memory barrier to make > sure the two halves of the record are written in the correct order. > +static inline void mthca_write_db_rec(u32 val[2], u32 *db) > +{ > + db[0] = val[0]; > + wmb(); > + db[1] = val[1]; > +} > + Are you concerned about ordering, or write-combining? I am unaware of a situation where writes are re-ordered into a reversed, descending order for no apparent reason. Jeff From jgarzik at pobox.com Thu Mar 3 16:07:43 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:07:43 -0500 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks from thread In-Reply-To: <2005331520.zA1xypugai2bUq7X@topspin.com> References: <2005331520.zA1xypugai2bUq7X@topspin.com> Message-ID: <4227A6CF.6080805@pobox.com> Roland Dreier wrote: > +void cancel_sends(void *data) > +{ > + struct ib_mad_agent_private *mad_agent_priv; > + struct ib_mad_send_wr_private *mad_send_wr; > + struct ib_mad_send_wc mad_send_wc; > + unsigned long flags; > + > + mad_agent_priv = (struct ib_mad_agent_private *)data; don't add casts to a void pointer, that's silly. > + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; > + mad_send_wc.vendor_err = 0; > + > + spin_lock_irqsave(&mad_agent_priv->lock, flags); > + while (!list_empty(&mad_agent_priv->canceled_list)) { > + mad_send_wr = list_entry(mad_agent_priv->canceled_list.next, > + struct ib_mad_send_wr_private, > + agent_list); > + > + list_del(&mad_send_wr->agent_list); > + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); > + > + mad_send_wc.wr_id = mad_send_wr->wr_id; > + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, > + &mad_send_wc); > + > + kfree(mad_send_wr); > + if (atomic_dec_and_test(&mad_agent_priv->refcount)) > + wake_up(&mad_agent_priv->wait); > + spin_lock_irqsave(&mad_agent_priv->lock, flags); > + } > + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); dumb question... why is the lock dropped? is it just for the send_handler(), or also for wr_id assigned, kfree, and wake_up() ? From libor at topspin.com Thu Mar 3 16:21:07 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 3 Mar 2005 16:21:07 -0800 Subject: [openib-general] [RFC] userspace CM/verbs QP Message-ID: <20050303162107.A18428@topspin.com> Roland, As it currently stands, the userspace CM needs to pass a QP from userspace to kernel space in order to pass it on to the kernel CM. I'm thinking that the best way to handle this is for the uCM library to pass uCM kernel the uverbs QP handle, and then have the kernel uCM lookup the QP from ib_uverbs. Unfortunetly this means ib_uverbs would need to export a lookup function. Would you like a patch, or do you have some other idea? -Libor From roland at topspin.com Thu Mar 3 16:30:03 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:30:03 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks from thread In-Reply-To: <4227A6CF.6080805@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:07:43 -0500") References: <2005331520.zA1xypugai2bUq7X@topspin.com> <4227A6CF.6080805@pobox.com> Message-ID: <52zmxknth0.fsf@topspin.com> Jeff> don't add casts to a void pointer, that's silly. Fair enough... Jeff> dumb question... why is the lock dropped? is it just for Jeff> the send_handler(), or also for wr_id assigned, kfree, and Jeff> wake_up() ? Not sure... Sean? - R. From roland at topspin.com Thu Mar 3 16:33:15 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:33:15 -0800 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <4227A606.50703@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:04:22 -0500") References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> Message-ID: <52vf88ntbo.fsf@topspin.com> Jeff> Are you concerned about ordering, or write-combining? ordering... write combining would be fine. Jeff> I am unaware of a situation where writes are re-ordered into Jeff> a reversed, descending order for no apparent reason. Hmm... I've seen ppc64 do some pretty freaky reordering but on the other hand that's a 64-bit arch so we don't care in this case. I guess I'd rather keep the barrier there so we don't have the possibility of a rare hardware crash when the HCA just happens to read the doorbell record in a corrupt state. - R. From sean.hefty at intel.com Thu Mar 3 16:34:43 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Mar 2005 16:34:43 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <4227A6CF.6080805@pobox.com> Message-ID: >Roland Dreier wrote: >> +void cancel_sends(void *data) >> +{ >> + struct ib_mad_agent_private *mad_agent_priv; >> + struct ib_mad_send_wr_private *mad_send_wr; >> + struct ib_mad_send_wc mad_send_wc; >> + unsigned long flags; >> + >> + mad_agent_priv = (struct ib_mad_agent_private *)data; > >don't add casts to a void pointer, that's silly. This is my bad. >> + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; >> + mad_send_wc.vendor_err = 0; >> + >> + spin_lock_irqsave(&mad_agent_priv->lock, flags); >> + while (!list_empty(&mad_agent_priv->canceled_list)) { >> + mad_send_wr = list_entry(mad_agent_priv->canceled_list.next, >> + struct ib_mad_send_wr_private, >> + agent_list); >> + >> + list_del(&mad_send_wr->agent_list); >> + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); >> + >> + mad_send_wc.wr_id = mad_send_wr->wr_id; >> + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, >> + &mad_send_wc); >> + >> + kfree(mad_send_wr); >> + if (atomic_dec_and_test(&mad_agent_priv->refcount)) >> + wake_up(&mad_agent_priv->wait); >> + spin_lock_irqsave(&mad_agent_priv->lock, flags); >> + } >> + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); > >dumb question... why is the lock dropped? is it just for the >send_handler(), or also for wr_id assigned, kfree, and wake_up() ? The lock is dropped to avoid calling the user back with it held. The if statement / wake_up call near the bottom of the loop can be replaced with a simple atomic_dec. The test should always fail. The lock is to protect access to the canceled_list. (Sorry about the mailer...) - Sean From jgarzik at pobox.com Thu Mar 3 16:35:00 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:35:00 -0500 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> Message-ID: <4227AD34.4050002@pobox.com> Roland Dreier wrote: > @@ -783,6 +777,11 @@ > cq->cqn & (dev->limits.num_cqs - 1)); > spin_unlock_irq(&dev->cq_table.lock); > > + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) > + synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); > + else > + synchronize_irq(dev->pdev->irq); > + Tangent: I think we need a pci_irq_sync() rather than putting the above code into each driver. Jeff From roland at topspin.com Thu Mar 3 16:37:45 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:37:45 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303162107.A18428@topspin.com> (Libor Michalek's message of "Thu, 3 Mar 2005 16:21:07 -0800") References: <20050303162107.A18428@topspin.com> Message-ID: <52r7iwnt46.fsf@topspin.com> Libor> Roland, As it currently stands, the userspace CM needs to Libor> pass a QP from userspace to kernel space in order to pass Libor> it on to the kernel CM. I'm thinking that the best way to Libor> handle this is for the uCM library to pass uCM kernel the Libor> uverbs QP handle, and then have the kernel uCM lookup the Libor> QP from ib_uverbs. Unfortunetly this means ib_uverbs would Libor> need to export a lookup function. Libor> Would you like a patch, or do you have some other idea? Hmm, I guess that would be OK. It does mean you have to hold a mutex to avoid another userspace thread killing the QP out from under you, which is a little ugly to expose... What do you really need to do with the QP? Can you just have userspace pass the information like QP number that you need? - R. From sean.hefty at intel.com Thu Mar 3 16:39:02 2005 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 3 Mar 2005 16:39:02 -0800 Subject: [openib-general] [RFC] userspace CM/verbs QP Message-ID: > As it currently stands, the userspace CM needs to pass a QP from >userspace to kernel space in order to pass it on to the kernel CM. >I'm thinking that the best way to handle this is for the uCM library >to pass uCM kernel the uverbs QP handle, and then have the kernel >uCM lookup the QP from ib_uverbs. Unfortunetly this means ib_uverbs >would need to export a lookup function. > > Would you like a patch, or do you have some other idea? As an FYI, the kernel CM uses the QP to get the QP number, QP type, if it uses a SRQ, and which device the QP is located on. - Sean From roland at topspin.com Thu Mar 3 16:40:14 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:40:14 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <4227AD34.4050002@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:35:00 -0500") References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> Message-ID: <52mztknt01.fsf@topspin.com> > @@ -783,6 +777,11 @@ > cq->cqn & (dev->limits.num_cqs - 1)); > spin_unlock_irq(&dev->cq_table.lock); > + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) > + synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); > + else > + synchronize_irq(dev->pdev->irq); > + Jeff> Tangent: I think we need a pci_irq_sync() rather than Jeff> putting the above code into each driver. The problem with trying to make it generic is that mthca has multiple MSI-X vectors, and only the driver author could know that we only need to synchronize with the completion event vector. - R. From jgarzik at pobox.com Thu Mar 3 16:41:06 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:41:06 -0500 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <52vf88ntbo.fsf@topspin.com> References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> Message-ID: <4227AEA2.8060007@pobox.com> Roland Dreier wrote: > Jeff> Are you concerned about ordering, or write-combining? > > ordering... write combining would be fine. > > Jeff> I am unaware of a situation where writes are re-ordered into > Jeff> a reversed, descending order for no apparent reason. > > Hmm... I've seen ppc64 do some pretty freaky reordering but on the > other hand that's a 64-bit arch so we don't care in this case. I > guess I'd rather keep the barrier there so we don't have the > possibility of a rare hardware crash when the HCA just happens to read > the doorbell record in a corrupt state. Well, we don't just add code to "hope and pray" for an event that nobody is sure can even occur... Does someone have a concrete case where this could happen? ever? Jeff From roland at topspin.com Thu Mar 3 16:43:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:43:26 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: (Sean Hefty's message of "Thu, 3 Mar 2005 16:34:43 -0800") References: Message-ID: <52fyzcnsup.fsf@topspin.com> >> don't add casts to a void pointer, that's silly. How should we handle this nit? Should I post a new version of this patch or an incremental diff that fixes it up? - R. From roland at topspin.com Thu Mar 3 16:50:59 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:50:59 -0800 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <4227AEA2.8060007@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:41:06 -0500") References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> <4227AEA2.8060007@pobox.com> Message-ID: <52bra0nsi4.fsf@topspin.com> Jeff> Well, we don't just add code to "hope and pray" for an event Jeff> that nobody is sure can even occur... The hardware requires that if the record is written in two 32-bit chunks, then they must be written in order. Of course the hardware probably won't be reading just as we're writing, so almost all of the time we won't notice the problem. It feels more like "hope and pray" to me to leave the barrier out and assume that every possible implementation of every architecture will always write them in order. Jeff> Does someone have a concrete case where this could happen? ever? I don't see how you can rule it out on out-of-order architectures. If the second word becomes ready before the first, then the CPU may execute the second write before the first. It's not precisely the same situation, but if you look at mthca_eq.c you'll see an rmb() in mthca_eq_int(). That's there because on ppc64, I really saw a situation where code like: while (foo->x) { switch (foo->y) { was behaving as if foo->y was being read before foo->x. Even though both foo->x and foo->y are in the same cache line, and foo->x was written by the hardware after foo->y. - R. From libor at topspin.com Thu Mar 3 16:57:05 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 3 Mar 2005 16:57:05 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52r7iwnt46.fsf@topspin.com>; from roland@topspin.com on Thu, Mar 03, 2005 at 04:37:45PM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> Message-ID: <20050303165705.B18428@topspin.com> On Thu, Mar 03, 2005 at 04:37:45PM -0800, Roland Dreier wrote: > Libor> Roland, As it currently stands, the userspace CM needs to > Libor> pass a QP from userspace to kernel space in order to pass > Libor> it on to the kernel CM. I'm thinking that the best way to > Libor> handle this is for the uCM library to pass uCM kernel the > Libor> uverbs QP handle, and then have the kernel uCM lookup the > Libor> QP from ib_uverbs. Unfortunetly this means ib_uverbs would > Libor> need to export a lookup function. > > Libor> Would you like a patch, or do you have some other idea? > > Hmm, I guess that would be OK. It does mean you have to hold a mutex > to avoid another userspace thread killing the QP out from under you, > which is a little ugly to expose... I was thinking of maybe ref counting the access, either in ib_qp or ib_uobject. Adding a pair of functions, lookup and return, to manage the ref count... > What do you really need to do with the QP? Can you just have > userspace pass the information like QP number that you need? I thought about that, here's the current list, but we would need to lookup the pd anyway: qp->pd, qp->qp_num qp->qp_type qp->srq qp->device -Libor From greg at kroah.com Thu Mar 3 16:58:24 2005 From: greg at kroah.com (Greg KH) Date: Thu, 3 Mar 2005 16:58:24 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <4227AD34.4050002@pobox.com> References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> Message-ID: <20050304005824.GA18411@kroah.com> On Thu, Mar 03, 2005 at 07:35:00PM -0500, Jeff Garzik wrote: > Roland Dreier wrote: > >@@ -783,6 +777,11 @@ > > cq->cqn & (dev->limits.num_cqs - 1)); > > spin_unlock_irq(&dev->cq_table.lock); > > > >+ if (dev->mthca_flags & MTHCA_FLAG_MSI_X) > >+ synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); > >+ else > >+ synchronize_irq(dev->pdev->irq); > >+ > > > Tangent: I think we need a pci_irq_sync() rather than putting the above > code into each driver. Sure, I have no problem accepting that into the pci core. thanks, greg k-h From sean.hefty at intel.com Thu Mar 3 17:00:01 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Mar 2005 17:00:01 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303165705.B18428@topspin.com> Message-ID: >I thought about that, here's the current list, but we would need to >lookup the pd anyway: > > qp->pd, > qp->qp_num > qp->qp_type > qp->srq > qp->device The kernel CM uses the PD of an internal mad_agent, and not the PD of the user's QP. So, I don't think it's needed. - Sean From akpm at osdl.org Thu Mar 3 17:01:09 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 3 Mar 2005 17:01:09 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <52fyzcnsup.fsf@topspin.com> References: <52fyzcnsup.fsf@topspin.com> Message-ID: <20050303170109.72e8a3f2.akpm@osdl.org> Roland Dreier wrote: > > >> don't add casts to a void pointer, that's silly. > > How should we handle this nit? Should I post a new version of this > patch or an incremental diff that fixes it up? > I'll fix it up. From roland at topspin.com Thu Mar 3 17:02:36 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 17:02:36 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <20050304005824.GA18411@kroah.com> (Greg KH's message of "Thu, 3 Mar 2005 16:58:24 -0800") References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> <20050304005824.GA18411@kroah.com> Message-ID: <527jkonryr.fsf@topspin.com> Greg> Sure, I have no problem accepting that into the pci core. What would pci_irq_sync() do exactly? - R. From akpm at osdl.org Thu Mar 3 17:07:52 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 3 Mar 2005 17:07:52 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <20050303170109.72e8a3f2.akpm@osdl.org> References: <52fyzcnsup.fsf@topspin.com> <20050303170109.72e8a3f2.akpm@osdl.org> Message-ID: <20050303170752.7bc42e86.akpm@osdl.org> Andrew Morton wrote: > > Roland Dreier wrote: > > > > >> don't add casts to a void pointer, that's silly. > > > > How should we handle this nit? Should I post a new version of this > > patch or an incremental diff that fixes it up? > > > > I'll fix it up. Actually, seeing as 15/26 has vanished into the ether and there have been quite a few comments, please resend everything. From akpm at osdl.org Thu Mar 3 17:22:12 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 3 Mar 2005 17:22:12 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <20050303170752.7bc42e86.akpm@osdl.org> References: <52fyzcnsup.fsf@topspin.com> <20050303170109.72e8a3f2.akpm@osdl.org> <20050303170752.7bc42e86.akpm@osdl.org> Message-ID: <20050303172212.27da9009.akpm@osdl.org> Andrew Morton wrote: > > Andrew Morton wrote: > > > > Roland Dreier wrote: > > > > > > >> don't add casts to a void pointer, that's silly. > > > > > > How should we handle this nit? Should I post a new version of this > > > patch or an incremental diff that fixes it up? > > > > > > > I'll fix it up. > > Actually, seeing as 15/26 has vanished into the ether and there have been > quite a few comments, please resend everything. I seem to have forgotten how to operate this computer thingy. I have all 26 patches. From roland at topspin.com Thu Mar 3 18:02:20 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 18:02:20 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303165705.B18428@topspin.com> (Libor Michalek's message of "Thu, 3 Mar 2005 16:57:05 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> Message-ID: <52zmxkmamr.fsf@topspin.com> Libor> I was thinking of maybe ref counting the access, either in Libor> ib_qp or ib_uobject. Adding a pair of functions, lookup and Libor> return, to manage the ref count... I think it makes sense to put a ref count in struct ib_uobject. That would make it easier to enforce things like "make sure my CQs don't get freed while I create this QP" also. Then I could encapsulate the whole IDR locking mess. - R. From libor at topspin.com Thu Mar 3 18:21:21 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 3 Mar 2005 18:21:21 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52zmxkmamr.fsf@topspin.com>; from roland@topspin.com on Thu, Mar 03, 2005 at 06:02:20PM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> Message-ID: <20050303182121.C18428@topspin.com> On Thu, Mar 03, 2005 at 06:02:20PM -0800, Roland Dreier wrote: > Libor> I was thinking of maybe ref counting the access, either in > Libor> ib_qp or ib_uobject. Adding a pair of functions, lookup and > Libor> return, to manage the ref count... > > I think it makes sense to put a ref count in struct ib_uobject. That > would make it easier to enforce things like "make sure my CQs don't > get freed while I create this QP" also. Then I could encapsulate the > whole IDR locking mess. When you say locking mess, do you mean accessing and potentially deleting the object which is referenced by the IDR table outside of the lock used to access the IDR? -Libor From roland at topspin.com Thu Mar 3 18:59:21 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 18:59:21 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303182121.C18428@topspin.com> (Libor Michalek's message of "Thu, 3 Mar 2005 18:21:21 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> Message-ID: <52vf88m7zq.fsf@topspin.com> Libor> When you say locking mess, do you mean accessing and Libor> potentially deleting the object which is referenced by the Libor> IDR table outside of the lock used to access the IDR? I just meant the fact that right now I have to hold the idr mutex over both looking up an old object (eg a PD) and creating a new object that uses the old object (eg an MR). It makes the cleanup paths and so on potentially tricky. - R. From Thomas.Talpey at netapp.com Thu Mar 3 16:46:50 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 03 Mar 2005 19:46:50 -0500 Subject: kDAPL code size Re: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <6.2.1.2.2.20050303094748.03422620@exnane01.nane.netapp.com > References: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> <20050303034827.GA9092@lst.de> <6.2.1.2.2.20050303094748.03422620@exnane01.nane.netapp.com> Message-ID: <6.2.1.2.2.20050303193401.01bc9eb0@exnane01.nane.netapp.com> At 09:56 AM 3/3/2005, Talpey, Thomas wrote: >that. At present, the code is heavily commented and fully generalized to >aid porting to multiple operating systems. It will look quite different once >it is freed of these attributes. Also, I'll point out there is extensive debug >and trace throughout the code, which are optional. I did a quick check of the source and I can report that over half the lines of kDAPL are comments, taking 22KLOC to around 10KLOC. Debug and kDAPL/uDAPL ifdefs are another ~500, and the dapl_os_* portability glue ~2000. By the way, the NFS/RDMA client code is only 3KLOC. I could guess it would take another few KLOC if it had to interface directly to verbs. And that's just the NFS/RDMA client. Repeat for server, repeat for other upper layers such as iSER. Repeat all for iWARP. Ouch. Tom. From mst at mellanox.co.il Fri Mar 4 06:38:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Mar 2005 16:38:56 +0200 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <52bra0nsi4.fsf@topspin.com> References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> <4227AEA2.8060007@pobox.com> <52bra0nsi4.fsf@topspin.com> Message-ID: <20050304143855.GD13804@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing > > Jeff> Well, we don't just add code to "hope and pray" for an event > Jeff> that nobody is sure can even occur... > > The hardware requires that if the record is written in two 32-bit > chunks, then they must be written in order. Of course the hardware > probably won't be reading just as we're writing, so almost all of the > time we won't notice the problem. Its not necessarily related to reads. writes must arrive in order, even if the card is not reading at that time. -- MST - Michael S. Tsirkin From vonwyl at EIG.UNIGE.CH Fri Mar 4 06:46:20 2005 From: vonwyl at EIG.UNIGE.CH (Marc von Wyl) Date: Fri, 04 Mar 2005 15:46:20 +0100 Subject: [openib-general] Problem when trying to compile management and libibverb In-Reply-To: <20050222194001.GD25382@mellanox.co.il> References: <52u0o4pfe8.fsf@topspin.com> <20050222194001.GD25382@mellanox.co.il> Message-ID: <422874BC.2010903@eig.unige.ch> Hi, I get some troubles when trying to install the openib userspace gen2... When I try to compile the management part in gen2/trunk/src/userspace/management/libibcommon after the autogen and configure part I get an error with the Makefile : make[2]: rpath : command not found It seems that the LINK variable has no value (I tried with ld and ./libtool --mode=link gcc -g but still nothing...). And for the libibverb part, using autogen configure and make too, I get : make: *** No rule to make target src/libibverbs.la , necessary for all-am . Stop Thanks... From halr at voltaire.com Fri Mar 4 07:00:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Mar 2005 10:00:17 -0500 Subject: [openib-general] Problem when trying to compile management and libibverb In-Reply-To: <422874BC.2010903@eig.unige.ch> References: <52u0o4pfe8.fsf@topspin.com> <20050222194001.GD25382@mellanox.co.il> <422874BC.2010903@eig.unige.ch> Message-ID: <1109948416.4645.1546.camel@localhost.localdomain> On Fri, 2005-03-04 at 09:46, Marc von Wyl wrote: > I get some troubles when trying to install the openib userspace gen2... > > When I try to compile the management part in > gen2/trunk/src/userspace/management/libibcommon after the autogen and > configure part I get an error with the Makefile : > make[2]: rpath : command not found > It seems that the LINK variable has no value (I tried with ld and > ./libtool --mode=link gcc -g but still nothing...). > > And for the libibverb part, using autogen configure and make too, I get : > make: *** No rule to make target src/libibverbs.la , necessary for > all-am . Stop Not sure I totally follow what you did. I presume you followed the instructions in management/README and ran autogen.sh and configure in the library directories to generate your makefiles before running make. If so, not sure why LINK would not be defined. It gets generated in my Makefile as: LINK = $(LIBTOOL) --mode=link $(CCLD) $(AM_CFLAGS) $(CFLAGS) \ $(AM_LDFLAGS) $(LDFLAGS) -o $@ where CCLD = $(CC) Have you built this before or is this the first time ? What distribution are you using ? Can you send your Makefile which was generated for one of these libraries ? Thanks. -- Hal From vonwyl at EIG.UNIGE.CH Fri Mar 4 07:41:40 2005 From: vonwyl at EIG.UNIGE.CH (Marc von Wyl) Date: Fri, 04 Mar 2005 16:41:40 +0100 Subject: [openib-general] Problem when trying to compile management and libibverb In-Reply-To: <1109948416.4645.1546.camel@localhost.localdomain> References: <52u0o4pfe8.fsf@topspin.com> <20050222194001.GD25382@mellanox.co.il> <422874BC.2010903@eig.unige.ch> <1109948416.4645.1546.camel@localhost.localdomain> Message-ID: <422881B4.6070601@eig.unige.ch> Hal Rosenstock a écrit : >Not sure I totally follow what you did. I presume you followed the >instructions in management/README and ran autogen.sh and configure in >the library directories to generate your makefiles before running make. > >If so, not sure why LINK would not be defined. It gets generated in my >Makefile as: >LINK = $(LIBTOOL) --mode=link $(CCLD) $(AM_CFLAGS) $(CFLAGS) \ > $(AM_LDFLAGS) $(LDFLAGS) -o $@ >where >CCLD = $(CC) > >Have you built this before or is this the first time ? > >What distribution are you using ? > >Can you send your Makefile which was generated for one of these >libraries ? > >Thanks. > >-- Hal > > > I found the problem... I was looking in the wrong direction since two days... It was a problem with automake. Thanks and sorry for the disturbance. From greg at kroah.com Fri Mar 4 08:33:58 2005 From: greg at kroah.com (Greg KH) Date: Fri, 4 Mar 2005 08:33:58 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <527jkonryr.fsf@topspin.com> References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> <20050304005824.GA18411@kroah.com> <527jkonryr.fsf@topspin.com> Message-ID: <20050304163357.GB28179@kroah.com> On Thu, Mar 03, 2005 at 05:02:36PM -0800, Roland Dreier wrote: > Greg> Sure, I have no problem accepting that into the pci core. > > What would pci_irq_sync() do exactly? Consolidate common code like this? :) thanks, greg k-h From roland at topspin.com Fri Mar 4 08:34:50 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 08:34:50 -0800 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <20050304143855.GD13804@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 4 Mar 2005 16:38:56 +0200") References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> <4227AEA2.8060007@pobox.com> <52bra0nsi4.fsf@topspin.com> <20050304143855.GD13804@mellanox.co.il> Message-ID: <52acpjmkt1.fsf@topspin.com> Michael> Its not necessarily related to reads. writes must arrive Michael> in order, even if the card is not reading at that time. We're talking about doorbell records in host memory, not MMIO doorbell writes. So there's no way for the HCA to know that host memory was written out of order unless it happens to read in the middle. - R. From roland at topspin.com Fri Mar 4 08:43:06 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 08:43:06 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <20050304163357.GB28179@kroah.com> (Greg KH's message of "Fri, 4 Mar 2005 08:33:58 -0800") References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> <20050304005824.GA18411@kroah.com> <527jkonryr.fsf@topspin.com> <20050304163357.GB28179@kroah.com> Message-ID: <521xavmkf9.fsf@topspin.com> Roland> What would pci_irq_sync() do exactly? Greg> Consolidate common code like this? :) I don't see how one can do that. As I pointed out in my reply to Jeff, it actually requires understanding how the driver uses the different MSI-X vectors to know which vector we need to synchronize against. So it seems pci_irq_sync() would have to be psychic. If we can figure out how to do that, maybe we can consolidate a lot more code into an API like void do_what_i_mean(void); ;) - R. From roland at topspin.com Fri Mar 4 08:58:41 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 08:58:41 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52vf88m7zq.fsf@topspin.com> (Roland Dreier's message of "Thu, 03 Mar 2005 18:59:21 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> Message-ID: <52sm3bl54u.fsf@topspin.com> I thought about this a little more. There is a problem with letting the CM module look up QPs in the userspace verbs table: it becomes very awkward to check that the the QP belongs to a context (== userspace verbs file descriptor) owned by the CM user. I see the following solutions: 1. Don't worry about checking. There's nothing too evil a CM user can do with a QP beyond getting another QP to connect to it, since the CM user can't modify a QP unless it legitimately owns it. And an evil user can always guess the QPN instead of the QP handle anyway. 2. Change the CM API so that it just takes the QPN, QP type, SRQ status and device directly rather than reading it out of the QP. This lets the userspace CM just get the info from userspace without needing to look at the QP at all. Of course it does raise the issue of how userspace should specify the device. 3. Merge the userspace CM into userspace verbs support so they use the same context. Ugh. Personally I would lean slightly towards #2, since it feels to me like even the kernel CM API would be cleaner that way. However I don't have a good answer for how userspace should specify which device to use. - R. From libor at topspin.com Fri Mar 4 10:08:38 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 4 Mar 2005 10:08:38 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52sm3bl54u.fsf@topspin.com>; from roland@topspin.com on Fri, Mar 04, 2005 at 08:58:41AM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> Message-ID: <20050304100838.A19903@topspin.com> On Fri, Mar 04, 2005 at 08:58:41AM -0800, Roland Dreier wrote: > I thought about this a little more. There is a problem with letting > the CM module look up QPs in the userspace verbs table: it becomes > very awkward to check that the the QP belongs to a context (== > userspace verbs file descriptor) owned by the CM user. > > I see the following solutions: > > 1. Don't worry about checking. There's nothing too evil a CM user > can do with a QP beyond getting another QP to connect to it, since > the CM user can't modify a QP unless it legitimately owns it. And > an evil user can always guess the QPN instead of the QP handle anyway. > > 2. Change the CM API so that it just takes the QPN, QP type, SRQ > status and device directly rather than reading it out of the QP. > This lets the userspace CM just get the info from userspace > without needing to look at the QP at all. Of course it does raise > the issue of how userspace should specify the device. As you say, the second solution does not resolve the issue about which you are worried in the first solution. The device issue I think still creates a dependancy between the kernel components of uverbs and ucm, it's just moved down the chain, since the only thing that has the user to kernel mapping of the device handle is uverbs. Unless that code is duplicated, and then it becomes a software maintenance dependency... > Personally I would lean slightly towards #2, since it feels to me like > even the kernel CM API would be cleaner that way. However I don't > have a good answer for how userspace should specify which device to > use. I'm still leaning towards #1, if it comes down to a choice between device or QP that needs to be exposed, the QP seems more intuitive to me. -Libor From roland at topspin.com Fri Mar 4 10:27:39 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 10:27:39 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050304100838.A19903@topspin.com> (Libor Michalek's message of "Fri, 4 Mar 2005 10:08:38 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <20050304100838.A19903@topspin.com> Message-ID: <52k6onl10k.fsf@topspin.com> Libor> As you say, the second solution does not resolve the Libor> issue about which you are worried in the first Libor> solution. The device issue I think still creates a Libor> dependancy between the kernel components of uverbs and ucm, Libor> it's just moved down the chain, since the only thing that Libor> has the user to kernel mapping of the device handle is Libor> uverbs. Unless that code is duplicated, and then it becomes Libor> a software maintenance dependency... Yeah, you're right. OK, let's just export the QP handle lookup/release stuff from uverbs. - R. From mshefty at ichips.intel.com Fri Mar 4 10:50:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Mar 2005 10:50:16 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050304100838.A19903@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <20050304100838.A19903@topspin.com> Message-ID: <4228ADE8.70109@ichips.intel.com> Libor Michalek wrote: >> 2. Change the CM API so that it just takes the QPN, QP type, SRQ >> status and device directly rather than reading it out of the QP. >> This lets the userspace CM just get the info from userspace >> without needing to look at the QP at all. Of course it does raise >> the issue of how userspace should specify the device. > > As you say, the second solution does not resolve the issue about which > you are worried in the first solution. The device issue I think still > creates a dependancy between the kernel components of uverbs and ucm, > it's just moved down the chain, since the only thing that has the user > to kernel mapping of the device handle is uverbs. Unless that code is > duplicated, and then it becomes a software maintenance dependency... The CM needs the device in order to locate which port to send out the connection REQ on. We could let the CM locate the device in the kernel based on the user's path record. This goes back a little to the discussion of having a cm_path field in the REQ parameter that the CM can use when sending the REQ. On the receiving side of the REQ, the CM knows the device based on which port the REQ came in on. In order to make this work, there needs to be a call similar to: ib_find_cached_device_gid(gid, &device, &port_num, &index); The CM already needs something like this call in order to perform SIDR. (See cm_find_device() in cm.c.) Support for this call requires some changes/exposure to the known device list. So, I think that we may be able to change the kernel CM API to take the necessary fields, rather than the QP pointer. I can make doing these changes a priority over RMPP if we decide to go this route. - Sean From tduffy at sun.com Fri Mar 4 11:58:33 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 04 Mar 2005 11:58:33 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> Message-ID: <1109966313.20238.11.camel@duffman> On Wed, 2005-03-02 at 11:26 -0500, James Lentini wrote: > tduffy> > The one thing that ATS provide and is not possible with > tduffy> > ARP is reverse resolution GID->IP, any ideas how to achieve > tduffy> > that without ATS ? > tduffy> > tduffy> RARP. > > Where is the encapsulation of RARP packets on IB defined? The > "Transmission of IP over InfiniBand" IETF draft specifies the > procedure for ARP and Neighbor Discovery, but not RARP. I do see some mention of RARP in the ipoib IETF draft, but it may not be fully flushed out. In any event, I think being able to plop an IB network in an Ethernet world will require things like RARP to work. If there is no spec now, it should be written. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Fri Mar 4 12:53:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Mar 2005 15:53:22 -0500 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] Message-ID: <1109969601.4648.32.camel@erez-s.us.voltaire.com> Hi again Nitin, Finally got a chance to work on this. I have a workaround for you for now. Real patch later... Let me know if this does the trick for you. It did for me. -- Hal Index: osm_sa_mcmember_record.c =================================================================== --- osm_sa_mcmember_record.c (revision 1953) +++ osm_sa_mcmember_record.c (working copy) @@ -1522,9 +1522,11 @@ if ((IB_MCR_COMPMASK_PROXY & comp_mask) && (p_rcvd_rec->proxy_join != p_mgrp->mcmember_rec.proxy_join)) goto Exit; +#if 0 /* if defined MUST match exactly !*/ if ((IB_MCR_COMPMASK_MTU_SEL & comp_mask) && ((p_rcvd_rec->mtu >> 6) != (p_mgrp->mcmember_rec.mtu >> 6))) goto Exit; +#endif if ((IB_MCR_COMPMASK_MTU & comp_mask) && ((p_rcvd_rec->mtu & 0x3F) != (p_mgrp->mcmember_rec.mtu & 0x3F))) goto Exit; -----Forwarded Message----- From: Hal Rosenstock To: Nitin Hande Cc: openib , Tom Duffy Subject: Re: [openib-general] Solaris IPoIB MTU with OpenSM Date: 24 Feb 2005 08:42:23 -0500 Hi Nitin, On Wed, 2005-02-23 at 17:19, Nitin Hande wrote: > Hal, > > [comments below] > On Wed, 2005-02-23 at 02:19, Hal Rosenstock wrote: > > On Tue, 2005-02-22 at 22:56, Nitin Hande wrote: > > > So I tried the latest patches and preliminarily things seem to be > > > working fine. > > > > Yipee. > [snip..] > > > > > > > > So after this test above, I try to run snoop on the solaris interface > > > and get the following error message from the layer below IPoIB: > > > > > > Feb 22 19:50:25 dongon.SFBay.Sun.COM ibd: [ID 517869 kern.info] NOTICE: > > > ibd0: HCA GUID 0002c901097651d0 port 1 PKEY ffff Could not get list of > > > IBA multicast groups > > > > > > My preliminary assumption is that OpenSm is not returning the list of > > > multicast groups that the ibd interface has joined. I will look at the > > > MAD's tomorrow and try to ascertain that. > > > > How does S10 request this ? Remember that if it is a GetTable and > > doesn't fit in a single MAD, it will be broken now. If that is the case, > > we will live with this until we have real RMPP. > Below is an an example of a single GetTable request and response between > Solaris and OpenSM. OpenSM is not reporting the MCgroups in case of a > single request/response. I have also provided a MAD output between > Solaris IPoIB driver and IBSRM single GetTable request response below > this example. > > Here is the MAD trace between solaris and OpenSM: > Outgoing MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x12 - SubnAdmGetTable() > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x97651d1000000ec > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 12 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 00 00 0b 1b 00 00 84 00 ................ > 60: ff ff 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Incoming MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x92 - > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x97651d1000000ec > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 92 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > 10: 00 38 00 00 ff ff ff ff 01 01 77 00 00 00 00 01 .8........w..... > 20: 00 00 00 14 00 00 00 00 00 00 00 00 00 07 00 00 ................ > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ It is likely failing the component checking in osm_sa_mcmember_record.c::__osm_sa_mcm_by_comp_mask_cb due to an endian issue. Either you can debug this code or I will early next week. The component mask in the request is 0x80b4 so the only components checked are QKey (0xb1b), MTU (exactly 2048 (4)), PKey (0xffff), and scope (2). If I don't hear anything by next week, I will work on this then. Thanks. -- Hal > Here is the transaction between IBSRM and Solaris IPoIB driver. > > Outgoing MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x12 - SubnAdmGetTable() > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x8fecc610000009a > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 12 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 81 23 45 68 00 00 84 00 .........#Eh.... > 60: 80 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Incoming MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x92 - > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x8fecc610000009a > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 92 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > 10: 00 38 00 00 00 00 00 00 01 01 73 00 00 00 00 01 .8........s..... > 20: 00 00 01 40 00 00 00 00 00 00 00 00 00 07 00 00 ... at ............ > 30: 00 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 .......... at ..... > 40: 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 81 23 45 68 c0 04 84 00 .........#Eh.... > 60: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > 70: ff 12 40 1b 80 01 00 00 00 00 00 00 00 00 00 01 .. at ............. > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 81 23 45 68 c0 03 84 00 80 01 83 8d 00 00 00 00 .#Eh............ > a0: 20 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 ......... at ..... > b0: 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 81 23 45 68 c0 00 84 00 .........#Eh.... > d0: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > e0: ff 12 60 1b 80 01 00 00 00 00 00 01 ff 76 5b 01 ..`..........v[. > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > Thanks > Nitin From Thomas.Talpey at netapp.com Fri Mar 4 13:06:01 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 04 Mar 2005 16:06:01 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <1109966313.20238.11.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> Message-ID: <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> At 02:58 PM 3/4/2005, Tom Duffy wrote: >In any event, I think being able to plop an IB network in an Ethernet >world will require things like RARP to work. If there is no spec now, >it should be written. I can't remember the last time I saw a machine RARP. Well, maybe I do but it was like 1980-something. Since DHCP, I don't think there's a reason for hosts to do it. Here's what comes out when I type "man rarp" on a 2.6 system. "Obsolete". >RARP(8) Linux Programmer's Manual RARP(8) > > > >NAME > rarp - manipulate the system RARP table > >SYNOPSIS > rarp [-V] [--version] [-h] [--help] > rarp -a > rarp [-v] -d hostname ... > rarp [-v] [-t type] -s hostname hw_addr > >NOTE > This program is obsolete. From version 2.3, the Linux > kernel no longer contains RARP support. For a replacement > RARP daemon, see ftp://ftp.dementia.org/pub/net-tools Tom. From mshefty at ichips.intel.com Fri Mar 4 16:09:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Mar 2005 16:09:14 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> References: <4224B419.1080601@ichips.intel.com> Message-ID: <4228F8AA.2060102@ichips.intel.com> Sean Hefty wrote: > I'm studying the RMPP implementation requirements for reassembly, and > there are a couple of issues/questions. * In order to send RMPP ACKs, etc. the RMPP code needs access to a MR (LKey actually) usable with the registered mad_agent. Both the CM and SA query code call ib_get_dma_mr() after calling ib_register_mad_agent(), and I would expect that other code will be similar. I was considering adding an ib_mr* field to the mad_agent structure and returning it to the user. Any objections or comments? - Sean From roland at topspin.com Fri Mar 4 16:31:06 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 16:31:06 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4228F8AA.2060102@ichips.intel.com> (Sean Hefty's message of "Fri, 04 Mar 2005 16:09:14 -0800") References: <4224B419.1080601@ichips.intel.com> <4228F8AA.2060102@ichips.intel.com> Message-ID: <52y8d3j5md.fsf@topspin.com> Sean> * In order to send RMPP ACKs, etc. the RMPP code needs Sean> access to a MR (LKey actually) usable with the registered Sean> mad_agent. Both the CM and SA query code call Sean> ib_get_dma_mr() after calling ib_register_mad_agent(), and I Sean> would expect that other code will be similar. I was Sean> considering adding an ib_mr* field to the mad_agent Sean> structure and returning it to the user. Any objections or Sean> comments? We discussed this once back in September of last year. For some reason we decided that the consumer was responsible for managing memory registration, but I don't remember why. - R. From mshefty at ichips.intel.com Fri Mar 4 16:46:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Mar 2005 16:46:21 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <52y8d3j5md.fsf@topspin.com> References: <4224B419.1080601@ichips.intel.com> <4228F8AA.2060102@ichips.intel.com> <52y8d3j5md.fsf@topspin.com> Message-ID: <4229015D.6000005@ichips.intel.com> Roland Dreier wrote: > Sean> * In order to send RMPP ACKs, etc. the RMPP code needs > Sean> access to a MR (LKey actually) usable with the registered > Sean> mad_agent. > > We discussed this once back in September of last year. For some > reason we decided that the consumer was responsible for managing > memory registration, but I don't remember why. I can't remember either, but I don't know if we thought about internally generated MADs sent on behalf of the user. I will continue to try to find that discussion. At a minimum, the RMPP code either needs the user to provide a MR that it can use when registering for the MAD service, or it needs to allocate one internally. If the latter approach is used, exposing it to the user seems to make sense. For QP 0/1 traffic, the RMPP layer could cheat a little and allocate a single MR per port, rather than per mad_agent, similar to what the CM and SA module do. But then a different method would be needed if we ever wanted to support RMPP on a redirected QP. - Sean From beng at isilon.com Fri Mar 4 17:01:35 2005 From: beng at isilon.com (Brian Eng) Date: Fri, 04 Mar 2005 17:01:35 -0800 Subject: [openib-general] Re: Incorrect endian in GUID comparison/SM master selection In-Reply-To: <1108607846.27002.113.camel@bengbsd.isilon.com> References: <1108607846.27002.113.camel@bengbsd.isilon.com> Message-ID: <1109984495.62825.4849.camel@bengbsd.isilon.com> Hello again, I found another place in OpenSM where it compares two SM's. I suggest the following both to fix it and to form a common comparison routine: --- osm_sminfo_rcv.c 4 Mar 2005 23:21:53 -0000 1.2.2.4 +++ osm_sminfo_rcv.c 5 Mar 2005 00:45:03 -0000 @@ -156,30 +156,15 @@ osm_sminfo_rcv_init( By higher - we mean: SM with higher priority or with same priority and lower GUID. **********************************************************************/ -boolean_t +inline boolean_t __osm_sminfo_rcv_remote_sm_is_higher ( IN const osm_sminfo_rcv_t* p_rcv, IN const ib_sm_info_t* p_remote_sm ) { - - - if( ib_sminfo_get_priority( p_remote_sm ) > - p_rcv->p_subn->opt.sm_priority ) - { - return( TRUE ); - } - else - { - if( ib_sminfo_get_priority( p_remote_sm ) == - p_rcv->p_subn->opt.sm_priority ) - { - if( p_remote_sm->guid < p_rcv->p_subn->sm_port_guid ) - { - return( TRUE ); - } - } - } - return( FALSE ); + return( osm_sm_is_greater_than( ib_sminfo_get_priority( p_remote_sm ), + p_remote_sm->guid, + p_rcv->p_subn->opt.sm_priority, + p_rcv->p_subn->sm_port_guid) ); } /********************************************************************** --- osm_state_mgr.c 4 Mar 2005 23:15:42 -0000 1.2.2.6 +++ osm_state_mgr.c 5 Mar 2005 00:45:03 -0000 @@ -1563,12 +1563,19 @@ __osm_state_mgr_get_highest_sm( { cl_qmap_t* p_sm_tbl; osm_remote_sm_t* p_sm = NULL; - osm_remote_sm_t* p_highest_sm = NULL; + osm_remote_sm_t* p_highest_sm; + uint8_t highest_sm_priority; + ib_net64_t highest_sm_guid; OSM_LOG_ENTER( p_mgr->p_log, __osm_state_mgr_get_highest_sm ); p_sm_tbl = &p_mgr->p_subn->sm_guid_tbl; + /* Start with the local sm as the standard */ + p_highest_sm = NULL; + highest_sm_priority = p_mgr->p_subn->opt.sm_priority; + highest_sm_guid = p_mgr->p_subn->sm_port_guid; + /* go over all the remote SMs */ for( p_sm = (osm_remote_sm_t*)cl_qmap_head( p_sm_tbl ); p_sm != (osm_remote_sm_t*)cl_qmap_end( p_sm_tbl ); @@ -1579,55 +1586,19 @@ __osm_state_mgr_get_highest_sm( if (ib_sminfo_get_state(&p_sm->smi) == IB_SMINFO_STATE_NOTACTIVE ) continue; - if ( p_highest_sm == NULL) + if ( osm_sm_is_greater_than( ib_sminfo_get_priority(&p_sm->smi), + p_sm->smi.guid, highest_sm_priority, highest_sm_guid ) ) { + /* the new p_sm is with higher priority - update the highest_sm */ + /* to this sm */ p_highest_sm = p_sm; - } - else - { - if ( ib_sminfo_get_priority(&p_sm->smi) > - ib_sminfo_get_priority(&p_highest_sm->smi) ) - { - /* the new p_sm is with higher priority - update the highest_sm */ - /* to this sm */ - p_highest_sm = p_sm; - } - else - { - if ( ib_sminfo_get_priority(&p_sm->smi) == - ib_sminfo_get_priority(&p_highest_sm->smi) ) - { - /* both SMs are with same priority - compare GUIDs */ - if ( p_sm->smi.guid < p_highest_sm->smi.guid ) - { - /* the new p_sm is with same priority but lower GUID - */ - /* update the highest sm to this sm */ - p_highest_sm = p_sm; - } - } - } + highest_sm_priority = ib_sminfo_get_priority(&p_sm->smi); + highest_sm_guid = p_sm->smi.guid; } } - /* compare the p_highest_sm to the local sm */ + if ( p_highest_sm != NULL ) { - /* check if this SM is higher then us */ - if ( p_mgr->p_subn->opt.sm_priority > - ib_sminfo_get_priority(&p_highest_sm->smi) ) - { - /* the local SM has higher priority */ - return( NULL ); - } - else - { - if( ib_sminfo_get_priority(&p_highest_sm->smi) == - p_mgr->p_subn->opt.sm_priority && - p_highest_sm->smi.guid > p_mgr->p_subn->sm_port_guid ) - { - /* they have same priority. Local SM has lower GUID */ - return( NULL ); - } - } osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_state_mgr_get_highest_sm: " "Found higher SM with guid: %016" PRIx64 "\n", --- osm_state_mgr.h 23 Sep 2004 18:43:08 -0000 1.2 +++ osm_state_mgr.h 5 Mar 2005 00:45:03 -0000 @@ -495,6 +495,62 @@ osm_state_mgr_process( * SEE ALSO * State Manager *********/ +/****f* OpenSM: State Manager/osm_sm_is_greater_than +* NAME +* osm_sm_is_greater_than +* +* DESCRIPTION +* Compares two SM's (14.4.1.2) +* +* SYNOPSIS +*/ +static inline boolean_t +osm_sm_is_greater_than ( + IN const uint8_t l_priority, + IN const ib_net64_t l_guid, + IN const uint8_t r_priority, + IN const ib_net64_t r_guid ) +{ + if( l_priority > r_priority ) + { + return( TRUE ); + } + else + { + if( l_priority == r_priority ) + { + if( cl_ntoh64(l_guid) < cl_ntoh64(r_guid) ) + { + return( TRUE ); + } + } + } + return( FALSE ); +} +/* +* PARAMETERS +* l_priority +* [in] Priority of the SM on the "left" +* +* l_guid +* [in] GUID of the SM on the "left" +* +* r_priority +* [in] Priority of the SM on the "right" +* +* r_guid +* [in] GUID of the SM on the "right" +* +* RETURN VALUES +* Return TRUE if an sm with l_priority and l_guid is higher than an sm +* with r_priority and r_guid, +* return FALSE otherwise. +* +* NOTES +* +* SEE ALSO +* State Manager +*********/ From lindahl at pathscale.com Fri Mar 4 18:04:03 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Fri, 4 Mar 2005 18:04:03 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <1109966313.20238.11.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> Message-ID: <20050305020402.GA3297@greglaptop.internal.keyresearch.com> On Fri, Mar 04, 2005 at 11:58:33AM -0800, Tom Duffy wrote: > In any event, I think being able to plop an IB network in an Ethernet > world will require things like RARP to work. If there is no spec now, > it should be written. Much more important is understanding the role of RARP in the ethernet world. It is *not* something you do to find _someone else's_ IP addr from their MAC addr. It's what you do to find your _own_ IP addr because you're booting. Ethernet protocols such as IP include enough IP information to talk back to someone who sent you a packet. So you don't need to find out an IP addr from a MAC for remote nodes on a regular basis. Instead, you find out a MAC addr from an IP address, which is ARP. RARP is little used now that DHCP is popular. Now it would be nice for ethernet broadcast packets to just work(tm) with IPoIB. "ping -b" is an example of a user-level program that generates a broadcast packet. DHCP clients also generate such packets, and DHCP servers listen for them. Getting a RARP client and server to work ought to be the same as a DHCP client and server. -- greg From tduffy at sun.com Fri Mar 4 18:30:48 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 04 Mar 2005 18:30:48 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> Message-ID: <1109989848.20238.16.camel@duffman> On Fri, 2005-03-04 at 16:06 -0500, Talpey, Thomas wrote: > At 02:58 PM 3/4/2005, Tom Duffy wrote: > >In any event, I think being able to plop an IB network in an Ethernet > >world will require things like RARP to work. If there is no spec now, > >it should be written. > > I can't remember the last time I saw a machine RARP. Well, maybe I do > but it was like 1980-something. Since DHCP, I don't think there's a > reason for hosts to do it. I guess Sun is stuck in the 80's -- big hair and new age music. All the sparc openboot systems rarp/bootp to network start. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From cap at nsc.liu.se Sat Mar 5 02:24:18 2005 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Sat, 5 Mar 2005 11:24:18 +0100 Subject: [openib-general] IB Address Translation service In-Reply-To: <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109966313.20238.11.camel@duffman> <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> Message-ID: <200503051124.19440.cap@nsc.liu.se> On Friday 04 March 2005 22.06, Talpey, Thomas wrote: > At 02:58 PM 3/4/2005, Tom Duffy wrote: > >In any event, I think being able to plop an IB network in an Ethernet > >world will require things like RARP to work. If there is no spec now, > >it should be written. > > I can't remember the last time I saw a machine RARP. Well, maybe I do > but it was like 1980-something. Since DHCP, I don't think there's a > reason for hosts to do it. If my memory servers me righ clustermatic uses rarp instead of dhcp since it's much lighter and simpler. I can imagine that if you have to code something that's not in a full OS, implementing a rarp based find-my-ip function will seem alot more fun than implementing a dhcp-client (or porting one...). /Peter -- ------------------------------------------------------------ Peter Kjellström | E-mail: cap at nsc.liu.se National Supercomputer Centre | Sweden | http://www.nsc.liu.se From David.Brean at Sun.COM Sat Mar 5 07:22:08 2005 From: David.Brean at Sun.COM (David M. Brean) Date: Sat, 05 Mar 2005 10:22:08 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <20050305020402.GA3297@greglaptop.internal.keyresearch.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> Message-ID: <4229CEA0.7060904@sun.com> Greg Lindahl wrote: >On Fri, Mar 04, 2005 at 11:58:33AM -0800, Tom Duffy wrote: > > > >>In any event, I think being able to plop an IB network in an Ethernet >>world will require things like RARP to work. If there is no spec now, >>it should be written. >> >> > >Much more important is understanding the role of RARP in the ethernet >world. > >It is *not* something you do to find _someone else's_ IP addr from >their MAC addr. It's what you do to find your _own_ IP addr because >you're booting. Ethernet protocols such as IP include enough IP >information to talk back to someone who sent you a packet. So you >don't need to find out an IP addr from a MAC for remote nodes on a >regular basis. Instead, you find out a MAC addr from an IP address, >which is ARP. > > > Right, RARP won't satisfy the reverse lookup requirement being put forward, so I don't think it's relevant to this address resolution discussion. >RARP is little used now that DHCP is popular. > >Now it would be nice for ethernet broadcast packets to just work(tm) >with IPoIB. "ping -b" is an example of a user-level program that >generates a broadcast packet. DHCP clients also generate such >packets, and DHCP servers listen for them. Getting a RARP client and >server to work ought to be the same as a DHCP client and server. > > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, but as someone else pointed out, the IB hardware address contains a QPN. The I-D for IPoIB says something like: The link-layer address for IPoIB includes the QPN which might not be constant across reboots or even across network interface resets. Cached QPN entries, such as in static ARP entries or in RARP servers will only work if the implementation(s) using these options ensure that the QPN associated with an interface is invariant across reboots/network resets. So, there are requirements on the IPoIB implementation to make RARP work. Folks in the IPoIB work group decided not to go much further than these statements for RARP support since most folks felt that DHCP is (de facto) replacement. -David >-- greg > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From David.Brean at Sun.COM Sat Mar 5 07:52:16 2005 From: David.Brean at Sun.COM (David M. Brean) Date: Sat, 05 Mar 2005 10:52:16 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <1109989848.20238.16.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> <1109989848.20238.16.camel@duffman> Message-ID: <4229D5B0.30205@sun.com> Tom Duffy wrote: >On Fri, 2005-03-04 at 16:06 -0500, Talpey, Thomas wrote: > > >>At 02:58 PM 3/4/2005, Tom Duffy wrote: >> >> >>>In any event, I think being able to plop an IB network in an Ethernet >>>world will require things like RARP to work. If there is no spec now, >>>it should be written. >>> >>> >>I can't remember the last time I saw a machine RARP. Well, maybe I do >>but it was like 1980-something. Since DHCP, I don't think there's a >>reason for hosts to do it. >> >> > >I guess Sun is stuck in the 80's -- big hair and new age music. All the >sparc openboot systems rarp/bootp to network start. > > > And DHCP, too. -David >-tduffy > > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Sat Mar 5 08:13:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Mar 2005 11:13:39 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <20050305020402.GA3297@greglaptop.internal.keyresearch.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> Message-ID: <1110039219.4648.709.camel@localhost.localdomain> On Fri, 2005-03-04 at 21:04, Greg Lindahl wrote: > Now it would be nice for ethernet broadcast packets to just work(tm) > with IPoIB. "ping -b" is an example of a user-level program that > generates a broadcast packet. Isn't ping -b a broadcast at the IP (ICMP) level which indirectly causes a broadcast at the link level ? This is different from the arping case which directly wants to send (and receive) link level broadvcasts from user space rather than have the kernel do it on it's behalf. > DHCP clients also generate such packets, and DHCP servers listen for them. Getting a RARP client and > server to work ought to be the same as a DHCP client and server. DHCP uses UDP so is similar to ping -b in that regard. -- Hal From halr at voltaire.com Sat Mar 5 08:17:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Mar 2005 11:17:39 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <4229CEA0.7060904@sun.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> Message-ID: <1110039458.4648.727.camel@localhost.localdomain> On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > but as someone else pointed out, the IB hardware address contains a > QPN. The I-D for IPoIB says something like: > > The link-layer address for IPoIB includes the QPN which might not be > constant across reboots or even across network interface resets. > Cached QPN entries, such as in static ARP entries or in RARP servers > will only work if the implementation(s) using these options ensure > that the QPN associated with an interface is invariant across > reboots/network resets. That may be the requirement but I think there are some issues with keeping the QPN invariant. Quoting Dror Goldenberg (http://openib.org/pipermail/openib-general/2004-November/006765.html): "Assigning specific QPN for ipoib requires allocation of QPN space which is beyond IB spec verbs. Current verbs do not allow it. I don't have any objection for that, except that you have to hold a set of preallocated QPs with specific numbers and hand them over to privileged consumer when requested to. I wouldn't commit that it will work on any HCA architecture." -- Hal > > So, there are requirements on the IPoIB implementation to make RARP > work. Folks in the IPoIB work group decided not to go much further than > these statements for RARP support since most folks felt that DHCP is (de > facto) replacement. > > -David > > > >-- greg > > > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yaronh at voltaire.com Sat Mar 5 10:42:45 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Sat, 5 Mar 2005 20:42:45 +0200 Subject: [openib-general] IB Address Translation service Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAEA7@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Saturday, March 05, 2005 6:18 PM > To: David M. Brean > Cc: openib-general at openib.org > Subject: Re: [openib-general] IB Address Translation service > > On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > > but as someone else pointed out, the IB hardware address contains a > > QPN. The I-D for IPoIB says something like: > > > > The link-layer address for IPoIB includes the QPN which might not be > > constant across reboots or even across network interface resets. > > Cached QPN entries, such as in static ARP entries or in RARP servers > > will only work if the implementation(s) using these options ensure > > that the QPN associated with an interface is invariant across > > reboots/network resets. > > That may be the requirement but I think there are some issues with > keeping the QPN invariant. Quoting Dror Goldenberg > (http://openib.org/pipermail/openib-general/2004-November/006765.html): > "Assigning specific QPN for ipoib requires allocation of QPN space which > is beyond IB spec verbs. Current verbs do not allow it. I don't have any > objection for that, except that you have to hold a set of preallocated > QPs with specific numbers and hand them over to privileged consumer when > requested to. I wouldn't commit that it will work on any HCA > architecture." > > -- Hal > Just to add to Hal and Dave, it is not only that the QPN may not be constant, you can actually have few valid QPNs, one or more per partition, since each partition reflects the notion of an IP VLAN/Network the RARP should return different IP per partition, and the RARP caller should use different QPN in each case. I believe all the emails in this thread clarify why RARP is not a valid approach Yaron From mst at mellanox.co.il Sun Mar 6 02:38:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Mar 2005 12:38:40 +0200 Subject: [PATCH] might_sleep on con_lock (was Re: [openib-general] SDP_CONN_LOCK) Message-ID: <20050306103840.GR26194@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: [openib-general] SDP_CONN_LOCK > > > They do implement exclusive access to the socket, but they implement > > > exclusive access from both process and irq context, which is why a > > > semaphore was not used. In interrupt context SDP_CONN_LOCK_BH is used > > > to lock the connection, look in sdp_cq_event_handler() for it's use, > > > and in process context SDP_CONN_LOCK is used. > > > > I dont really understand how it works. > > When an interrupt arrives while users != 0, it seems you are > > calling scheduler(). > > What is sdp_conn_internal_lock doing? I understand it is to be called > > from interrupt context, but how can it call scheduler() then? > > SDP_CONN_LOCK and SDP_CONN_UNLOCK are never called from interrupt > context, only from process context. So how about this patch, to make sure we get a stack dump if they are: Signed-off-by: Michael S. Tsirkin Index: sdp_conn.h =================================================================== --- sdp_conn.h (revision 1953) +++ sdp_conn.h (working copy) @@ -477,6 +477,8 @@ static inline void sdp_conn_lock(struct { unsigned long flags; + might_sleep(); + spin_lock_irqsave(&conn->lock.slock, flags); if (conn->lock.users != 0) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Mar 6 08:06:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Mar 2005 18:06:11 +0200 Subject: [openib-general] rev 1954 - new flint uploaded Message-ID: <20050306160611.GX26194@mellanox.co.il> With revision 1954 I have uploaded new flint code, synched with current mellanox code. There are lots of changes, since this adds support for burning more flash types. Unfortunately I was unable to test this code on a big endian machine. I hope nothing was broken. ppc/sparc guys, please test and let me know. thanks, -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Mar 6 12:28:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Mar 2005 22:28:45 +0200 Subject: [openib-general] [PATCH] disable MSI for AMD-8131 Message-ID: <20050306202845.GE8486@mellanox.co.il> Greg, Martin, The AMD-8131 I/O APIC (device id 1022:7450/7451) does not support message signalled interrupts. Thus, if a device driver attempts to enable msi, it will suceed, but interrupts are not actually delivered to the cpu. The Nforce chipsets do not seem to have this limitation. AMD confirmed that MSI mode is unsupported with this APIC. The following patch adds a flag to pci quirks to detect this and disable msi. Please let me know what do you think. Signed-off-by: Michael S. Tsirkin diff -rup linux-2.6.11/drivers/pci/msi.c linux-2.6.11-msi/drivers/pci/msi.c --- linux-2.6.11/drivers/pci/msi.c 2005-03-02 09:38:26.000000000 +0200 +++ linux-2.6.11-msi/drivers/pci/msi.c 2005-05-21 23:29:08.000000000 +0300 @@ -20,6 +20,7 @@ #include #include +#include "pci.h" #include "msi.h" static DEFINE_SPINLOCK(msi_lock); @@ -372,6 +373,13 @@ static int msi_init(void) if (!status) return status; + if (pci_msi_quirk) { + pci_msi_enable = 0; + printk(KERN_WARNING "PCI: MSI quirk detected. MSI disabled.\n"); + status = -EINVAL; + return status; + } + if ((status = msi_cache_init()) < 0) { pci_msi_enable = 0; printk(KERN_WARNING "PCI: MSI cache init failed\n"); diff -rup linux-2.6.11/drivers/pci/pci.h linux-2.6.11-msi/drivers/pci/pci.h --- linux-2.6.11/drivers/pci/pci.h 2005-03-02 09:37:55.000000000 +0200 +++ linux-2.6.11-msi/drivers/pci/pci.h 2005-05-21 22:28:21.000000000 +0300 @@ -65,6 +65,7 @@ extern void pci_remove_legacy_files(stru extern spinlock_t pci_bus_lock; extern int pcie_mch_quirk; +extern int pci_msi_quirk; extern struct device_attribute pci_dev_attrs[]; extern struct class_device_attribute class_device_attr_cpuaffinity; diff -rup linux-2.6.11/drivers/pci/quirks.c linux-2.6.11-msi/drivers/pci/quirks.c --- linux-2.6.11/drivers/pci/quirks.c 2005-03-02 09:37:31.000000000 +0200 +++ linux-2.6.11-msi/drivers/pci/quirks.c 2005-05-21 22:35:45.000000000 +0300 @@ -429,6 +429,8 @@ static void __init quirk_ioapic_rmw(stru } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SI, PCI_ANY_ID, quirk_ioapic_rmw ); +int pci_msi_quirk; + #define AMD8131_revA0 0x01 #define AMD8131_revB0 0x11 #define AMD8131_MISC 0x40 @@ -437,6 +439,9 @@ static void __init quirk_amd_8131_ioapic { unsigned char revid, tmp; + pci_msi_quirk = 1; + printk(KERN_WARNING "PCI: MSI quirk detected. pci_msi_quirk set.\n"); + if (nr_ioapics == 0) return; From mst at mellanox.co.il Mon Mar 7 07:38:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Mar 2005 17:38:49 +0200 Subject: [openib-general] Re: rev 1954 - new flint uploaded In-Reply-To: <20050306160611.GX26194@mellanox.co.il> References: <20050306160611.GX26194@mellanox.co.il> Message-ID: <20050307153849.GI26194@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: rev 1954 - new flint uploaded > > With revision 1954 I have uploaded new flint code, > synched with current mellanox code. I made some last minute fixes and uploaded rev 1957. Enjoy! > There are lots of changes, since this adds support for > burning more flash types. > > Unfortunately I was unable to test this code on > a big endian machine. I hope nothing was broken. > ppc/sparc guys, please test and let me know. -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 7 08:13:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Mar 2005 11:13:45 -0500 Subject: [openib-general] mthca query_device does not fill in struct ib_device_attr Message-ID: <1110212025.4648.38.camel@localhost.localdomain> Hi Roland, It appears that mthca_provider.c::mthca_query_device does not fill in many of the device attributes in struct ib_device_attr. When can the remainder of these be completed ? Thanks. -- Hal From roland at topspin.com Mon Mar 7 08:32:43 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 08:32:43 -0800 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <1110212025.4648.38.camel@localhost.localdomain> (Hal Rosenstock's message of "07 Mar 2005 11:13:45 -0500") References: <1110212025.4648.38.camel@localhost.localdomain> Message-ID: <52fyz7ju1g.fsf@topspin.com> Hal> It appears that mthca_provider.c::mthca_query_device does not Hal> fill in many of the device attributes in struct Hal> ib_device_attr. When can the remainder of these be completed? Any time... which ones are needed now? - R. From halr at voltaire.com Mon Mar 7 08:45:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Mar 2005 11:45:01 -0500 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <52fyz7ju1g.fsf@topspin.com> References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> Message-ID: <1110213900.4648.79.camel@localhost.localdomain> On Mon, 2005-03-07 at 11:32, Roland Dreier wrote: > Hal> It appears that mthca_provider.c::mthca_query_device does not > Hal> fill in many of the device attributes in struct > Hal> ib_device_attr. When can the remainder of these be completed? > > Any time... which ones are needed now? Certainly the following ones: u64 max_mr_size; int max_qp; int max_qp_wr; int max_sge; int max_cq; int max_cqe; int max_mr; int max_pd; int max_qp_rd_atom; Thanks. -- Hal From roland at topspin.com Mon Mar 7 08:55:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 08:55:18 -0800 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <1110213900.4648.79.camel@localhost.localdomain> (Hal Rosenstock's message of "07 Mar 2005 11:45:01 -0500") References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> Message-ID: <52acpfjszt.fsf@topspin.com> Hal> Certainly the following ones: OK, it won't be hard to fill out those entries. What application is using this info? - R. From halr at voltaire.com Mon Mar 7 08:59:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Mar 2005 11:59:03 -0500 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <52acpfjszt.fsf@topspin.com> References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> <52acpfjszt.fsf@topspin.com> Message-ID: <1110214736.4648.90.camel@localhost.localdomain> On Mon, 2005-03-07 at 11:55, Roland Dreier wrote: > OK, it won't be hard to fill out those entries. What application is > using this info? uDAPL (and kDAPL) use these device attributes currently. -- Hal From xma at us.ibm.com Mon Mar 7 09:38:32 2005 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 7 Mar 2005 09:38:32 -0800 Subject: [openib-general] mthca drvier on PPC platform Message-ID: I am starting to test mthca driver on PPC64. I want to know whether someone has tested mthca on any PPC platform? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Mon Mar 7 09:41:41 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 09:41:41 -0800 Subject: [openib-general] userspace verbs UD support Message-ID: <52wtsjica2.fsf@topspin.com> I've just committed support for UD address handles to libibverbs and libmthca on the roland-uverbs branch. This makes it possible to use UD from userspace. libibverbs/examples has a new ud-pingpong.c demo program that shows how this works. There are no new changes to the kernel uverbs support on the roland-uverbs branch to handle UD, so any modules built with last week's code should still work. - R. From roland at topspin.com Mon Mar 7 09:42:33 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 09:42:33 -0800 Subject: [openib-general] mthca drvier on PPC platform In-Reply-To: (Shirley Ma's message of "Mon, 7 Mar 2005 09:38:32 -0800") References: Message-ID: <52sm37ic8m.fsf@topspin.com> Shirley> I am starting to test mthca driver on PPC64. I want to Shirley> know whether someone has tested mthca on any PPC platform? Yes, I have used it successfully on IBM p630 and JS20 systems. - Roland From krause at cup.hp.com Mon Mar 7 09:48:06 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 07 Mar 2005 09:48:06 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <1110039458.4648.727.camel@localhost.localdomain> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1110039458.4648.727.camel@localhost.localdomain> Message-ID: <6.2.0.14.2.20050307094517.020916f8@esmail.cup.hp.com> Just to make this clear: - There are only two QP that are defined with specific intention - QP0 and QP1. All other QP may vary throughout the entire QP space. - All ULP built on top of IB must assume that the QP are variant and must discover these through various protocol such as the service ID protocol or for IPoIB, the ARP / ND exchange. - Multiple QP may be used for a given service allowing both finer grain partitioning as well as scaling opportunities. So, this isn't something open to debate. It is how we designed the technology to allow flexibility and performance. Mike At 08:17 AM 3/5/2005, Hal Rosenstock wrote: >On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > > but as someone else pointed out, the IB hardware address contains a > > QPN. The I-D for IPoIB says something like: > > > > The link-layer address for IPoIB includes the QPN which might not be > > constant across reboots or even across network interface resets. > > Cached QPN entries, such as in static ARP entries or in RARP servers > > will only work if the implementation(s) using these options ensure > > that the QPN associated with an interface is invariant across > > reboots/network resets. > >That may be the requirement but I think there are some issues with >keeping the QPN invariant. Quoting Dror Goldenberg >(http://openib.org/pipermail/openib-general/2004-November/006765.html): >"Assigning specific QPN for ipoib requires allocation of QPN space which >is beyond IB spec verbs. Current verbs do not allow it. I don't have any >objection for that, except that you have to hold a set of preallocated >QPs with specific numbers and hand them over to privileged consumer when >requested to. I wouldn't commit that it will work on any HCA >architecture." > >-- Hal > > > > > > So, there are requirements on the IPoIB implementation to make RARP > > work. Folks in the IPoIB work group decided not to go much further than > > these statements for RARP support since most folks felt that DHCP is (de > > facto) replacement. > > > > -David > > > > > > >-- greg > > > > > > > > >_______________________________________________ > > >openib-general mailing list > > >openib-general at openib.org > > >http://openib.org/mailman/listinfo/openib-general > > > > > >To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Mar 7 11:14:07 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Mar 2005 11:14:07 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52sm3bl54u.fsf@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> Message-ID: <422CA7FF.1070506@ichips.intel.com> Roland Dreier wrote: > 1. Don't worry about checking. There's nothing too evil a CM user > can do with a QP beyond getting another QP to connect to it, since > the CM user can't modify a QP unless it legitimately owns it. And > an evil user can always guess the QPN instead of the QP handle anyway. > > 2. Change the CM API so that it just takes the QPN, QP type, SRQ > status and device directly rather than reading it out of the QP. > This lets the userspace CM just get the info from userspace > without needing to look at the QP at all. Of course it does raise > the issue of how userspace should specify the device. > > 3. Merge the userspace CM into userspace verbs support so they use > the same context. Ugh. > > Personally I would lean slightly towards #2, since it feels to me like > even the kernel CM API would be cleaner that way. However I don't > have a good answer for how userspace should specify which device to > use. Which of these options is being used? It seems like option #2 would work as long as there's a way to locate a device based on a GID. - Sean From libor at topspin.com Mon Mar 7 12:33:46 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 12:33:46 -0800 Subject: [openib-general] Re: [PATCH][SDP] Make sdp compile on 2.6.11 In-Reply-To: <1109787094.4913.7.camel@duffman>; from tduffy@sun.com on Wed, Mar 02, 2005 at 10:11:34AM -0800 References: <1109787094.4913.7.camel@duffman> Message-ID: <20050307123346.A27729@topspin.com> On Wed, Mar 02, 2005 at 10:11:34AM -0800, Tom Duffy wrote: > Now that 2.6.11 is out, need to make sdp compile with 2.6.11. > Thanks Tom, applied and commited. -Libor From libor at topspin.com Mon Mar 7 13:20:12 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 13:20:12 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422CA7FF.1070506@ichips.intel.com>; from mshefty@ichips.intel.com on Mon, Mar 07, 2005 at 11:14:07AM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> Message-ID: <20050307132012.B27729@topspin.com> On Mon, Mar 07, 2005 at 11:14:07AM -0800, Sean Hefty wrote: > Roland Dreier wrote: > > 1. Don't worry about checking. There's nothing too evil a CM user > > can do with a QP beyond getting another QP to connect to it, since > > the CM user can't modify a QP unless it legitimately owns it. And > > an evil user can always guess the QPN instead of the QP handle anyway. > > > > 2. Change the CM API so that it just takes the QPN, QP type, SRQ > > status and device directly rather than reading it out of the QP. > > This lets the userspace CM just get the info from userspace > > without needing to look at the QP at all. Of course it does raise > > the issue of how userspace should specify the device. > > > > 3. Merge the userspace CM into userspace verbs support so they use > > the same context. Ugh. > > > > Personally I would lean slightly towards #2, since it feels to me like > > even the kernel CM API would be cleaner that way. However I don't > > have a good answer for how userspace should specify which device to > > use. > > Which of these options is being used? It seems like option #2 would > work as long as there's a way to locate a device based on a GID. Sean, I'm not sure there's an easy way to perform a port source GID to device lookup, did you have something specific in mind for the lookup? Unless there's an easy way to do this, I was going to go ahead with #1... -Libor From libor at topspin.com Mon Mar 7 13:23:02 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 13:23:02 -0800 Subject: [PATCH] might_sleep on con_lock (was Re: [openib-general] SDP_CONN_LOCK) In-Reply-To: <20050306103840.GR26194@mellanox.co.il>; from mst@mellanox.co.il on Sun, Mar 06, 2005 at 12:38:40PM +0200 References: <20050306103840.GR26194@mellanox.co.il> Message-ID: <20050307132302.C27729@topspin.com> On Sun, Mar 06, 2005 at 12:38:40PM +0200, Michael S. Tsirkin wrote: > Quoting r. Libor Michalek : > > Subject: Re: [openib-general] SDP_CONN_LOCK > > > > They do implement exclusive access to the socket, but they implement > > > > exclusive access from both process and irq context, which is why a > > > > semaphore was not used. In interrupt context SDP_CONN_LOCK_BH is used > > > > to lock the connection, look in sdp_cq_event_handler() for it's use, > > > > and in process context SDP_CONN_LOCK is used. > > > > > > I dont really understand how it works. > > > When an interrupt arrives while users != 0, it seems you are > > > calling scheduler(). > > > What is sdp_conn_internal_lock doing? I understand it is to be called > > > from interrupt context, but how can it call scheduler() then? > > > > SDP_CONN_LOCK and SDP_CONN_UNLOCK are never called from interrupt > > context, only from process context. > > So how about this patch, to make sure we get a stack dump if they are: Seems reasonable, I've applied and committed the patch. Thanks Michael. -Libor From mshefty at ichips.intel.com Mon Mar 7 13:26:19 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Mar 2005 13:26:19 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050307132012.B27729@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> Message-ID: <422CC6FB.9070206@ichips.intel.com> Libor Michalek wrote: > Sean, I'm not sure there's an easy way to perform a port source GID > to device lookup, did you have something specific in mind for the > lookup? Unless there's an easy way to do this, I was going to go ahead > with #1... There's a device_list maintained in device.c that's used when ib_register_client() is called to report all available devices to a client. My thinking was to make this list available to cache.c for use calling a function such as ib_get_cached_gid(). - Sean From libor at topspin.com Mon Mar 7 15:11:31 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 15:11:31 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422CC6FB.9070206@ichips.intel.com>; from mshefty@ichips.intel.com on Mon, Mar 07, 2005 at 01:26:19PM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> Message-ID: <20050307151131.D27729@topspin.com> On Mon, Mar 07, 2005 at 01:26:19PM -0800, Sean Hefty wrote: > Libor Michalek wrote: > > Sean, I'm not sure there's an easy way to perform a port source GID > > to device lookup, did you have something specific in mind for the > > lookup? Unless there's an easy way to do this, I was going to go ahead > > with #1... > > There's a device_list maintained in device.c that's used when > ib_register_client() is called to report all available devices to a > client. My thinking was to make this list available to cache.c for use > calling a function such as ib_get_cached_gid(). OK, that does make sense. There's no other reason to stick with solution #1, since getting the device was the only remaining reason to use the QP. So I'm in favor of going with solution #2 and passing in the necessary values directly. This solution also decreases the number of module dependencies, which is always nice. -Libor From timur.tabi at ammasso.com Mon Mar 7 15:17:11 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 07 Mar 2005 17:17:11 -0600 Subject: [openib-general] Getting the code and locking user space memory regions Message-ID: <422CE0F7.7060000@ammasso.com> Hi, A long time ago, the openib driver used a hack to call sys_mlock() to lock down a user space memory region. This was because get_user_pages() wasn't completely locking the region like it was supposed to. I haven't paid much attention to the openib stuff since then, but now I want to know what the current development status is. I know that a driver is part of the 2.6.11 kernel, but that driver doesn't have any user-mode support in it. I tried to download the latest code from openib.org, but all I could find was a web interface to "Subversion". Obviously, this is too cumbersome for downloading everything, so is there another way to get all the code? Also, I'd like to know what the current code does to lock memory regions. Does the driver still call sys_mlock()? Has get_user_pages() been fixed (my tests show it hasn't). Is there another technique used? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From halr at voltaire.com Tue Mar 8 06:10:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Mar 2005 09:10:19 -0500 Subject: [openib-general] Re: Incorrect endian in GUID comparison/SM master selection In-Reply-To: <1109984495.62825.4849.camel@bengbsd.isilon.com> References: <1108607846.27002.113.camel@bengbsd.isilon.com> <1109984495.62825.4849.camel@bengbsd.isilon.com> Message-ID: <1110291019.4650.796.camel@localhost.localdomain> Hi Brian, On Fri, 2005-03-04 at 20:01, Brian Eng wrote: > I found another place in OpenSM where it compares two SM's. I suggest > the following both to fix it and to form a common comparison routine: I had a couple of problems with this patch but worked around them manually. Please review the changes. In the future, please make sure the patch is preformatted. Also, not sure why the line numbers were so different. Thanks. Applied. -- Hal From mst at mellanox.co.il Tue Mar 8 07:52:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Mar 2005 17:52:16 +0200 Subject: [openib-general] [PATCH] rq formatting for arbel-native Message-ID: <20050308155216.GG26194@mellanox.co.il> For arbel native, the ee_nds field in the receive queue must include the rq max_gs value (plus header), and not according to the wqe shift value. This differs from current documentation, documentation will be updated. This patch is required to make memfree work on MT25204. Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_qp.c =================================================================== --- hw/mthca/mthca_qp.c (revision 1943) +++ hw/mthca/mthca_qp.c (working copy) @@ -1119,10 +1119,15 @@ static int mthca_alloc_qp_common(struct if (dev->hca_type == ARBEL_NATIVE) { for (i = 0; i < qp->rq.max; ++i) { + int size; wqe = get_recv_wqe(qp, i); wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << qp->rq.wqe_shift); - wqe->ee_nds = cpu_to_be32(1 << (qp->rq.wqe_shift - 4)); + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + wqe->ee_nds = cpu_to_be32(size / 16); } for (i = 0; i < qp->sq.max; ++i) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Tue Mar 8 07:58:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Mar 2005 17:58:02 +0200 Subject: [openib-general] [PATCH] register mthca for sinai device id Message-ID: <20050308155802.GH26194@mellanox.co.il> Now that memfree support is merged to trunk, register mthca for MT25204 device ids. Use numeric values to make it work on 2.6.11, until the symbolic names make it upstream. With this and previous patch in place, ip over ib now seems to work for me on MT25204. Signed-off-by: Michael S. Tsirkin Index: mthca_main.c =================================================================== --- mthca_main.c (revision 1964) +++ mthca_main.c (working copy) @@ -1094,6 +1094,10 @@ static struct pci_device_id mthca_pci_ta .driver_data = ARBEL_NATIVE }, { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, 0x5e8c), /* Sinai old */ + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, 0x6274), /* Sinai */ + .driver_data = ARBEL_NATIVE }, { 0, } }; -- MST - Michael S. Tsirkin From mshefty at ichips.intel.com Tue Mar 8 10:21:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 10:21:11 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050307151131.D27729@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> <20050307151131.D27729@topspin.com> Message-ID: <422DED17.5070709@ichips.intel.com> Libor Michalek wrote: >>There's a device_list maintained in device.c that's used when >>ib_register_client() is called to report all available devices to a >>client. My thinking was to make this list available to cache.c for use >>calling a function such as ib_get_cached_gid(). > > > OK, that does make sense. There's no other reason to stick with > solution #1, since getting the device was the only remaining reason > to use the QP. So I'm in favor of going with solution #2 and passing > in the necessary values directly. This solution also decreases the > number of module dependencies, which is always nice. Roland, do you have a preference for exposing the device_list and device_sem in device.c? I can put ib_get_cached_gid() directly in device.c, but that separates the caching functions. I could also change cache.c to maintain its own device list, which encapsulates the changes more, but duplicates the list. - Sean From mshefty at ichips.intel.com Tue Mar 8 10:51:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 10:51:49 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422DED17.5070709@ichips.intel.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> <20050307151131.D27729@topspin.com> <422DED17.5070709@ichips.intel.com> Message-ID: <422DF445.1070806@ichips.intel.com> Sean Hefty wrote: > Roland, do you have a preference for exposing the device_list and > device_sem in device.c? I can put ib_get_cached_gid() directly in > device.c, but that separates the caching functions. I could also change > cache.c to maintain its own device list, which encapsulates the changes > more, but duplicates the list. Uhm... thinking about this more, I think that trying to provide this functionality from the cache exposes the potential for a client to access a device after its removal... We can still make the changes to the CM API; it just requires that the CM maintain a list of devices that it can access. - Sean From halr at voltaire.com Tue Mar 8 11:45:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Mar 2005 14:45:37 -0500 Subject: [openib-general] A Couple of CM Questions Message-ID: <1110311137.4645.28.camel@localhost.localdomain> Hi Sean, My main question has to do with an error path in cm_req_handler. If cm_init_av fails (lines 1098 or 1103), I get the following crash: Mar 8 14:19:04 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Mar 8 14:19:04 localhost kernel: printing eip: Mar 8 14:19:04 localhost kernel: d09db042 Mar 8 14:19:04 localhost kernel: *pde = 0ba1d067 Mar 8 14:19:04 localhost kernel: *pte = 00000000 Mar 8 14:19:04 localhost kernel: Oops: 0000 [#1] Mar 8 14:19:04 localhost kernel: Modules linked in: ib_cm ib_umad ide_cd cdrom lp ipv6 autofs parport_pc parport uhci_hcd ehci_hcd ib_mthca ib_mad ib_core ohci_hcd eepro100 mii evdev usbcore Mar 8 14:19:04 localhost kernel: CPU: 0 Mar 8 14:19:04 localhost kernel: EIP: 0060:[] Tainted: P VLI Mar 8 14:19:04 localhost kernel: EFLAGS: 00010286 (2.6.10) Mar 8 14:19:04 localhost kernel: EIP is at cm_alloc_msg+0x42/0x100 [ib_cm] Mar 8 14:19:04 localhost kernel: eax: 00000000 ebx: cf641800 ecx: 00000000 edx: cfffa340 Mar 8 14:19:04 localhost kernel: esi: c1ca9400 edi: cf641958 ebp: 00000000 esp: c30d5e38 Mar 8 14:19:04 localhost kernel: ds: 007b es: 007b ss: 0068 Mar 8 14:19:04 localhost kernel: Process ib_cm/0 (pid: 4948, threadinfo=c30d4000 task=c2863aa0) Mar 8 14:19:04 localhost kernel: Stack: cffff560 000000d0 00000028 00000000 c1ca9400 00000000 00000004 d09e0330 Mar 8 14:19:04 localhost kernel: c1ca9400 c30d5e80 33215650 000040a9 0000407e 00000296 ffffffc2 00000282 Mar 8 14:19:04 localhost kernel: 0400407e 000040a9 00000246 00000292 c1ca9400 ffffffea c1c90ea8 d09dc531 Mar 8 14:19:04 localhost kernel: Call Trace: Mar 8 14:19:04 localhost kernel: [] ib_send_cm_rej+0x70/0x2d0 [ib_cm] Mar 8 14:19:04 localhost kernel: [] ib_destroy_cm_id+0x3b1/0x780 [ib_cm] Mar 8 14:19:04 localhost kernel: [] rb_erase+0x4b/0xf0 Mar 8 14:19:04 localhost kernel: [] cm_req_handler+0x16c/0x780 [ib_cm] Mar 8 14:19:04 localhost kernel: [] cm_work_handler+0x0/0x130 [ib_cm] Mar 8 14:19:04 localhost kernel: [] cm_work_handler+0x32/0x130 [ib_cm] Mar 8 14:19:04 localhost kernel: [] worker_thread+0x251/0x470 Mar 8 14:19:04 localhost kernel: [] default_wake_function+0x0/0x20 Mar 8 14:19:04 localhost kernel: [] default_wake_function+0x0/0x20 Mar 8 14:19:04 localhost kernel: [] worker_thread+0x0/0x470 Mar 8 14:19:04 localhost kernel: [] kthread+0xaa/0xb0 Mar 8 14:19:04 localhost kernel: [] kthread+0x0/0xb0 Mar 8 14:19:04 localhost kernel: [] kernel_thread_helper+0x5/0x10 Mar 8 14:19:04 localhost kernel: Code: 74 24 20 89 04 24 e8 be 05 77 ef 89 c3 b8 f4 ff ff ff 85 db 0f 84 b5 00 00 00 b9 56 00 00 00 89 df 89 e8 f3 ab 8b 86 8c 00 00 00 <8b> 10 8d 86 a0 00 00 00 89 44 24 04 8b 42 04 8b 40 04 89 04 24 Also, it appears to me that the comm IDs in the CM messages are not endianized on the IB "wire". This causes no issue with interoperability but is slightly less clean to look at. Thanks for your help with this. -- Hal From mshefty at ichips.intel.com Tue Mar 8 12:10:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 12:10:28 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110311137.4645.28.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> Message-ID: <422E06B4.8010608@ichips.intel.com> Hal Rosenstock wrote: > My main question has to do with an error path in cm_req_handler. If > cm_init_av fails (lines 1098 or 1103), I get the following crash: I think I see what's happening here. Since the cm_init_av fails, the cm_id doesn't have the information that it needs in order to send back any sort of reply (including a REJ message) to the sender of the REQ. A quick fix would be to not send the REJ when destroying the cm_id. I'm not sure what the better fix would be at the moment. The CM tries to send replies (including REJ messages) to a received MAD using the path record stored in the REQ. This could be changed to use the path of the received MAD instead. But a failure could still occur, so the destruction of the cm_id in the REQ_RCVD state needs some additional error handling. I will queue up trying to get a fix for this after I finish the modifications for the CM to support user-space. Will that work okay? > Also, it appears to me that the comm IDs in the CM messages are not > endianized on the IB "wire". This causes no issue with interoperability > but is slightly less clean to look at. I didn't swap them simply because they didn't need to be. - Sean From roland at topspin.com Tue Mar 8 12:13:03 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 08 Mar 2005 12:13:03 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422DED17.5070709@ichips.intel.com> (Sean Hefty's message of "Tue, 08 Mar 2005 10:21:11 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> <20050307151131.D27729@topspin.com> <422DED17.5070709@ichips.intel.com> Message-ID: <52acpdhp68.fsf@topspin.com> Sean> Roland, do you have a preference for exposing the Sean> device_list and device_sem in device.c? I can put Sean> ib_get_cached_gid() directly in device.c, but that separates Sean> the caching functions. I could also change cache.c to Sean> maintain its own device list, which encapsulates the changes Sean> more, but duplicates the list. I think it makes sense to expose the list and sem to cache.c. Of course we should change the names so that they're less generic (ie add an "ib_" prefix) to avoid clashes if someone builds IB support into a monolithic kernel. Also it probably makes sense to turn device_sem into an rwsem and use down_read() if cache.c to allow concurrent lookups by the CM. - R. From halr at voltaire.com Tue Mar 8 12:11:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Mar 2005 15:11:58 -0500 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <422E06B4.8010608@ichips.intel.com> References: <1110311137.4645.28.camel@localhost.localdomain> <422E06B4.8010608@ichips.intel.com> Message-ID: <1110312718.4648.8.camel@localhost.localdomain> On Tue, 2005-03-08 at 15:10, Sean Hefty wrote: > I will queue up trying to get a fix for this after I finish the > modifications for the CM to support user-space. Will that work okay? Sure. I can get back to making sure this error path works. Thanks. -- Hal From roland at topspin.com Tue Mar 8 12:43:04 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 08 Mar 2005 12:43:04 -0800 Subject: [openib-general] [PATCH] rq formatting for arbel-native In-Reply-To: <20050308155216.GG26194@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 Mar 2005 17:52:16 +0200") References: <20050308155216.GG26194@mellanox.co.il> Message-ID: <526501hns7.fsf@topspin.com> Thanks... as soon as I have some Sinai HCAs to test with I'll roll Sinai support into mthca. - R. From timur.tabi at ammasso.com Tue Mar 8 15:12:27 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Tue, 08 Mar 2005 17:12:27 -0600 Subject: [openib-general] http://openib.org/downloads/ Message-ID: <422E315B.8010708@ammasso.com> Why is this directory empty? I'm trying to download all the openib code (or at least, all the driver code), but I can't find any tarballs. Can anyone tell me where I can download the OpenIB software? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From libor at topspin.com Tue Mar 8 15:28:08 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 8 Mar 2005 15:28:08 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422E315B.8010708@ammasso.com>; from timur.tabi@ammasso.com on Tue, Mar 08, 2005 at 05:12:27PM -0600 References: <422E315B.8010708@ammasso.com> Message-ID: <20050308152808.B28988@topspin.com> On Tue, Mar 08, 2005 at 05:12:27PM -0600, Timur Tabi wrote: > Why is this directory empty? I'm trying to download all the openib code > (or at least, all the driver code), but I can't find any tarballs. Can > anyone tell me where I can download the OpenIB software? If you want code which is newer then what is in 2.6.11 then you'll need to check it out from the subversion repository. For the head of tree kernel code try this: svn co https://openib.org/svn/gen2/trunk/src/linux-kernel -Libor From mlleinin at hpcn.ca.sandia.gov Tue Mar 8 15:31:48 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Tue, 08 Mar 2005 15:31:48 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422E315B.8010708@ammasso.com> References: <422E315B.8010708@ammasso.com> Message-ID: <1110324708.8595.223.camel@localhost> On Tue, 2005-03-08 at 17:12 -0600, Timur Tabi wrote: > Why is this directory empty? I'm trying to download all the openib code > (or at least, all the driver code), but I can't find any tarballs. Can > anyone tell me where I can download the OpenIB software? > You can grab the openib source code from the subversion repository. See http://www.openib.org/tools.html. If you want everything run 'svn co https://openib.org/svn' Most of the work to date has been for kernel-space IB support (now in the 2.6.11 kernel). At some point, in the near future, the user-space support will be stable/tested enough that we _may_ start posting tar files, but until then subversion checkout is the best way to get the source. - Matt From mshefty at ichips.intel.com Tue Mar 8 15:59:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 8 Mar 2005 15:59:25 -0800 Subject: [openib-general] [PATCH] [CM] Replace QP pointer with necessary values only Message-ID: <20050308155925.43eee30d.mshefty@ichips.intel.com> This patch modifies the CM API to take the necessary QP values, rather than the QP pointer itself. This should simplify the implementation of the usermode CM. I did _not_ change the device caching information for this until I can convince myself that clients should be able to gain access to the device structure in this fashion. As a side effect of this change, SIDR now works in theory. The CM continues to try to send MADs using the same path as that provided in the CM REQ message. Signed-off-by: Sean Hefty Index: infiniband/core/cm.c =================================================================== --- infiniband/core/cm.c (revision 1964) +++ infiniband/core/cm.c (working copy) @@ -61,6 +61,8 @@ static struct ib_client cm_client = { static struct ib_cm { spinlock_t lock; + struct list_head device_list; + rwlock_t device_lock; struct rb_root listen_service_table; /* struct rb_root peer_service_table; todo: fix peer to peer */ struct rb_root remote_qp_table; @@ -71,13 +73,19 @@ static struct ib_cm { } cm; struct cm_port { + struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; - u64 ca_guid; - spinlock_t lock; struct ib_mr *mr; u8 port_num; }; +struct cm_device { + struct list_head list; + struct ib_device *device; + u64 ca_guid; + struct cm_port port[0]; +}; + struct cm_msg { struct cm_id_private *cm_id_priv; struct ib_send_wr send_wr; @@ -214,48 +222,6 @@ static void cm_free_msg(struct cm_msg *m kfree(msg); } -static struct cm_port * cm_find_port(struct ib_device *device, - union ib_gid *gid) -{ - struct cm_port *port; - int ret; - u8 p; - - port = (struct cm_port *)ib_get_client_data(device, &cm_client); - if (!port) - return NULL; - - ret = ib_find_cached_gid(device, gid, &p, NULL); - if (ret) - port = NULL; - else - port = &port[p-1]; - - return port; -} - -static int cm_find_device(union ib_gid *gid, struct ib_device **device, - struct cm_port **port) -{ - int ret; - u8 p; - - /* todo: (high priority if SIDR is needed, low otherwise) - write me - need call in ib_cache_* stuff? */ - /* see static device_list in device.c */ - /* ret = ib_find_cached_device_gid(gid, device, &p, NULL); */ - ret = -EINVAL; - if (ret) - return ret; - - *port = (struct cm_port *)ib_get_client_data(*device, &cm_client); - if (!*port) - return -EINVAL; - - *port = &(*port[p-1]); - return 0; -} - static void cm_set_ah_attr(struct ib_ah_attr *ah_attr, u8 port_num, u16 dlid, u8 sl, u16 src_path_bits) { @@ -266,20 +232,33 @@ static void cm_set_ah_attr(struct ib_ah_ ah_attr->port_num = port_num; } -static int cm_init_av(struct ib_device *device, struct ib_sa_path_rec *path, - struct cm_av *av) +static int cm_init_av(struct ib_sa_path_rec *path, struct cm_av *av) { + struct cm_device *cm_dev; + struct cm_port *port = NULL; + unsigned long flags; int ret; + u8 p; + + read_lock_irqsave(&cm.device_lock, flags); + list_for_each_entry(cm_dev, &cm.device_list, list) { + if (!ib_find_cached_gid(cm_dev->device, &path->sgid, + &p, NULL)) { + port = &cm_dev->port[p-1]; + break; + } + } + read_unlock_irqrestore(&cm.device_lock, flags); - av->port = cm_find_port(device, &path->sgid); - if (!av->port) + if (!port) return -EINVAL; - ret = ib_find_cached_pkey(device, av->port->port_num, path->pkey, - &av->pkey_index); + ret = ib_find_cached_pkey(cm_dev->device, port->port_num, + be16_to_cpu(path->pkey), &av->pkey_index); if (ret) return ret; + av->port = port; av->dgid = path->dgid; cm_set_ah_attr(&av->ah_attr, av->port->port_num, path->dlid, path->sl, path->slid & 0x7F); @@ -660,8 +639,9 @@ retest: case IB_CM_MRA_REP_RCVD: spin_unlock_irqrestore(&cm_id_priv->lock, flags); ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, - &cm_id_priv->av.port->ca_guid, - sizeof &cm_id_priv->av.port->ca_guid, NULL, 0); + &cm_id_priv->av.port->cm_dev->ca_guid, + sizeof &cm_id_priv->av.port->cm_dev->ca_guid, + NULL, 0); break; case IB_CM_ESTABLISHED: spin_unlock_irqrestore(&cm_id_priv->lock, flags); @@ -744,13 +724,13 @@ static void cm_format_req(struct cm_req_ req_msg->local_comm_id = cm_id_priv->id.local_id; req_msg->service_id = param->service_id; - req_msg->local_ca_guid = cm_id_priv->av.port->ca_guid; - cm_req_set_local_qpn(req_msg, cpu_to_be32(param->qp->qp_num)); + req_msg->local_ca_guid = cm_id_priv->av.port->cm_dev->ca_guid; + cm_req_set_local_qpn(req_msg, cpu_to_be32(param->qp_num)); cm_req_set_resp_res(req_msg, param->responder_resources); cm_req_set_init_depth(req_msg, param->initiator_depth); cm_req_set_remote_resp_timeout(req_msg, param->remote_cm_response_timeout); - cm_req_set_qp_type(req_msg, param->qp->qp_type); + cm_req_set_qp_type(req_msg, param->qp_type); cm_req_set_flow_ctrl(req_msg, param->flow_control); cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn)); cm_req_set_local_resp_timeout(req_msg, @@ -760,7 +740,7 @@ static void cm_format_req(struct cm_req_ cm_req_set_path_mtu(req_msg, param->primary_path->mtu); cm_req_set_rnr_retry_count(req_msg, param->rnr_retry_count); cm_req_set_max_cm_retries(req_msg, param->max_cm_retries); - cm_req_set_srq(req_msg, (param->qp->srq != NULL)); + cm_req_set_srq(req_msg, param->srq); req_msg->primary_local_lid = param->primary_path->slid; req_msg->primary_remote_lid = param->primary_path->dlid; @@ -798,10 +778,10 @@ static void cm_format_req(struct cm_req_ static inline int cm_validate_req_param(struct ib_cm_req_param *param) { - if (!param->qp || !param->primary_path) + if (!param->primary_path) return -EINVAL; - if (param->qp->qp_type != IB_QPT_RC && param->qp->qp_type != IB_QPT_UC) + if (param->qp_type != IB_QPT_RC && param->qp_type != IB_QPT_UC) return -EINVAL; if (param->private_data && @@ -839,13 +819,11 @@ int ib_send_cm_req(struct ib_cm_id *cm_i } spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = cm_init_av(param->qp->device, param->primary_path, - &cm_id_priv->av); + ret = cm_init_av(param->primary_path, &cm_id_priv->av); if (ret) goto out; if (param->alternate_path) { - ret = cm_init_av(param->qp->device, param->alternate_path, - &cm_id_priv->alt_av); + ret = cm_init_av(param->alternate_path, &cm_id_priv->alt_av); if (ret) goto out; } @@ -1078,13 +1056,11 @@ static int cm_req_handler(struct cm_work cm_id_priv->id.service_mask = ~0ULL; cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); - ret = cm_init_av(work->port->mad_agent->device, &work->path[0], - &cm_id_priv->av); + ret = cm_init_av(&work->path[0], &cm_id_priv->av); if (ret) goto error3; if (req_msg->alt_local_lid) { - ret = cm_init_av(work->port->mad_agent->device, &work->path[1], - &cm_id_priv->alt_av); + ret = cm_init_av(&work->path[1], &cm_id_priv->alt_av); if (ret) goto error3; } @@ -1124,7 +1100,7 @@ static void cm_format_rep(struct cm_rep_ rep_msg->local_comm_id = cm_id_priv->id.local_id; rep_msg->remote_comm_id = cm_id_priv->id.remote_id; - cm_rep_set_local_qpn(rep_msg, cpu_to_be32(param->qp->qp_num)); + cm_rep_set_local_qpn(rep_msg, cpu_to_be32(param->qp_num)); cm_rep_set_starting_psn(rep_msg, cpu_to_be32(param->starting_psn)); rep_msg->resp_resources = param->responder_resources; rep_msg->initiator_depth = param->initiator_depth; @@ -1132,29 +1108,14 @@ static void cm_format_rep(struct cm_rep_ cm_rep_set_failover(rep_msg, param->failover_accepted); cm_rep_set_flow_ctrl(rep_msg, param->flow_control); cm_rep_set_rnr_retry_count(rep_msg, param->rnr_retry_count); - cm_rep_set_srq(rep_msg, (param->qp->srq != NULL)); - rep_msg->local_ca_guid = cm_id_priv->av.port->ca_guid; + cm_rep_set_srq(rep_msg, param->srq); + rep_msg->local_ca_guid = cm_id_priv->av.port->cm_dev->ca_guid; if (param->private_data && param->private_data_len) memcpy(rep_msg->private_data, param->private_data, param->private_data_len); } -static inline int cm_validate_rep_param(struct ib_cm_rep_param *param) -{ - if (!param->qp) - return -EINVAL; - - if (param->qp->qp_type != IB_QPT_RC && param->qp->qp_type != IB_QPT_UC) - return -EINVAL; - - if (param->private_data && - param->private_data_len > IB_CM_REP_PRIVATE_DATA_SIZE) - return -EINVAL; - - return 0; -} - int ib_send_cm_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param) { @@ -1165,9 +1126,11 @@ int ib_send_cm_rep(struct ib_cm_id *cm_i unsigned long flags; int ret; - ret = cm_validate_rep_param(param); - if (ret) + if (param->private_data && + param->private_data_len > IB_CM_REP_PRIVATE_DATA_SIZE) { + ret = -EINVAL; goto out; + } cm_id_priv = container_of(cm_id, struct cm_id_private, id); ret = cm_alloc_msg(cm_id_priv, &msg); @@ -2313,7 +2276,6 @@ static void cm_format_sidr_req(struct cm int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param) { - struct ib_device *device; struct cm_id_private *cm_id_priv; struct cm_msg *msg; struct ib_send_wr *bad_send_wr; @@ -2325,11 +2287,7 @@ int ib_send_cm_sidr_req(struct ib_cm_id return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - ret = cm_find_device(¶m->path->sgid, &device, &cm_id_priv->av.port); - if (ret) - goto out; - - ret = cm_init_av(device, param->path, &cm_id_priv->av); + ret = cm_init_av(param->path, &cm_id_priv->av); if (ret) goto out; @@ -2966,7 +2924,8 @@ static u64 cm_get_ca_guid(struct ib_devi static void cm_add_one(struct ib_device *device) { - struct cm_port *port_array, *port; + struct cm_device *cm_dev; + struct cm_port *port; struct ib_mad_reg_req reg_req = { .mgmt_class = IB_MGMT_CLASS_CM, .mgmt_class_version = IB_CM_CLASS_VERSION @@ -2974,24 +2933,25 @@ static void cm_add_one(struct ib_device struct ib_port_modify port_modify = { .set_port_cap_mask = IB_PORT_CM_SUP }; - u64 ca_guid; - u8 i; + unsigned long flags; int ret; + u8 i; - ca_guid = cm_get_ca_guid(device); - if (!ca_guid) + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * + device->phys_port_cnt, GFP_KERNEL); + if (!cm_dev) return; - port_array = kmalloc(sizeof *port * device->phys_port_cnt, GFP_KERNEL); - if (!port_array) - return; + cm_dev->device = device; + cm_dev->ca_guid = cm_get_ca_guid(device); + if (!cm_dev->ca_guid) + goto error1; set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); - for (i = 1, port = port_array; i <= device->phys_port_cnt; i++, port++){ - spin_lock_init(&port->lock); - port->ca_guid = ca_guid; + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &cm_dev->port[i-1]; + port->cm_dev = cm_dev; port->port_num = i; - port->mad_agent = ib_register_mad_agent(device, i, IB_QPT_GSI, ®_req, @@ -3000,54 +2960,64 @@ static void cm_add_one(struct ib_device cm_recv_handler, port); if (IS_ERR(port->mad_agent)) - goto error1; + goto error2; port->mr = ib_get_dma_mr(port->mad_agent->qp->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(port->mr)) - goto error2; + goto error3; ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error3; + goto error4; } - ib_set_client_data(device, &cm_client, port_array); + ib_set_client_data(device, &cm_client, cm_dev); + + write_lock_irqsave(&cm.device_lock, flags); + list_add_tail(&cm_dev->list, &cm.device_list); + write_unlock_irqrestore(&cm.device_lock, flags); return; -error3: +error4: ib_dereg_mr(port->mr); -error2: +error3: ib_unregister_mad_agent(port->mad_agent); -error1: +error2: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; - while (port != port_array) { - --port; + while (--i) { + port = &cm_dev->port[i]; ib_modify_port(device, port->port_num, 0, &port_modify); - ib_dereg_mr(port->mr); ib_unregister_mad_agent(port->mad_agent); } - kfree(port_array); +error1: + kfree(cm_dev); } static void cm_remove_one(struct ib_device *device) { - struct cm_port *port_array, *port; + struct cm_device *cm_dev; + struct cm_port *port; struct ib_port_modify port_modify = { .clr_port_cap_mask = IB_PORT_CM_SUP }; + unsigned long flags; int i; - port_array = (struct cm_port *)ib_get_client_data(device, &cm_client); - if (!port_array) + cm_dev = ib_get_client_data(device, &cm_client); + if (!cm_dev) return; - for (i = 1, port = port_array; i <= device->phys_port_cnt; i++, port++){ + write_lock_irqsave(&cm.device_lock, flags); + list_del(&cm_dev->list); + write_unlock_irqrestore(&cm.device_lock, flags); + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &cm_dev->port[i]; ib_modify_port(device, port->port_num, 0, &port_modify); - ib_dereg_mr(port->mr); ib_unregister_mad_agent(port->mad_agent); } - kfree(port_array); + kfree(cm_dev); } static int __init ib_cm_init(void) @@ -3055,6 +3025,8 @@ static int __init ib_cm_init(void) int ret; memset(&cm, 0, sizeof cm); + INIT_LIST_HEAD(&cm.device_list); + rwlock_init(&cm.device_lock); spin_lock_init(&cm.lock); cm.listen_service_table = RB_ROOT; cm.remote_id_table = RB_ROOT; From mshefty at ichips.intel.com Tue Mar 8 16:08:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 8 Mar 2005 16:08:48 -0800 Subject: [openib-general] [PATCH] [SDP] Updated to new CM API Message-ID: <20050308160848.010c39ff.mshefty@ichips.intel.com> Libor, Here's a patch that updates SDP to the new CM API. I didn't actually test this though. (I did test the CM changes, just not SDP.) Signed-off-by: Sean Hefty Index: infiniband/ulp/sdp/sdp_actv.c =================================================================== --- infiniband/ulp/sdp/sdp_actv.c (revision 1964) +++ infiniband/ulp/sdp/sdp_actv.c (working copy) @@ -472,9 +472,11 @@ static void sdp_cm_path_complete(u64 id, /* * set QP/CM parameters. */ - memset(¶m, 0, sizeof(struct ib_cm_req_param)); + memset(¶m, 0, sizeof param); - param.qp = conn->qp; + param.qp_num = conn->qp->qp_num; + param.qp_type = conn->qp->qp_type; + param.srq = (conn->qp->srq != NULL); param.primary_path = path; param.alternate_path = NULL; param.service_id = cpu_to_be64(SDP_PORT_TO_SID(conn->dst_port)); Index: infiniband/ulp/sdp/sdp_pass.c =================================================================== --- infiniband/ulp/sdp/sdp_pass.c (revision 1964) +++ infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -263,7 +263,8 @@ static int sdp_cm_accept(struct sdp_opt /* * send REP message to remote CM to continue connection. */ - param.qp = conn->qp; + param.qp_num = conn->qp->qp_num; + param.srq = (conn->qp->srq != NULL); param.starting_psn = conn->rq_psn; param.private_data = hello_ack; /* From libor at topspin.com Tue Mar 8 16:28:06 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 8 Mar 2005 16:28:06 -0800 Subject: [openib-general] Re: [PATCH] [SDP] Updated to new CM API In-Reply-To: <20050308160848.010c39ff.mshefty@ichips.intel.com>; from mshefty@ichips.intel.com on Tue, Mar 08, 2005 at 04:08:48PM -0800 References: <20050308160848.010c39ff.mshefty@ichips.intel.com> Message-ID: <20050308162806.C28988@topspin.com> On Tue, Mar 08, 2005 at 04:08:48PM -0800, Sean Hefty wrote: > Libor, > > Here's a patch that updates SDP to the new CM API. I didn't actually > test this though. (I did test the CM changes, just not SDP.) Thanks. I just tested it and works correctly. Feel free to commit it at the same time that you commit the CM changes. -Libor From tduffy at sun.com Tue Mar 8 16:27:55 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 08 Mar 2005 16:27:55 -0800 Subject: [openib-general] [PATCH][libsdp] update to use correct protocol family number (27) Message-ID: <1110328075.20262.69.camel@duffman> Signed-off-by: Tom Duffy Index: gen2/trunk/src/userspace/libsdp/src/sdp_inet.h =================================================================== --- gen2/trunk/src/userspace/libsdp/src/sdp_inet.h (revision 1966) +++ gen2/trunk/src/userspace/libsdp/src/sdp_inet.h (working copy) @@ -27,7 +27,7 @@ /* * constants shared between user and kernel space. */ -#define AF_INET_SDP 26 /* SDP socket protocol family */ +#define AF_INET_SDP 27 /* SDP socket protocol family */ #define AF_INET_STR "AF_INET_SDP" /* SDP enabled environment variable */ #endif /* _TS_SDP_INET_H */ From mshefty at ichips.intel.com Tue Mar 8 16:31:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 16:31:00 -0800 Subject: [openib-general] Re: [PATCH] [SDP] Updated to new CM API In-Reply-To: <20050308162806.C28988@topspin.com> References: <20050308160848.010c39ff.mshefty@ichips.intel.com> <20050308162806.C28988@topspin.com> Message-ID: <422E43C4.9050909@ichips.intel.com> Libor Michalek wrote: > On Tue, Mar 08, 2005 at 04:08:48PM -0800, Sean Hefty wrote: > >>Libor, >> >>Here's a patch that updates SDP to the new CM API. I didn't actually >>test this though. (I did test the CM changes, just not SDP.) > > > Thanks. I just tested it and works correctly. Feel free to commit it > at the same time that you commit the CM changes. All CM changes related to this have been committed. - Sean From mshefty at ichips.intel.com Tue Mar 8 17:01:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 17:01:46 -0800 Subject: [openib-general] Re: [PATCH] [SDP] Updated to new CM API In-Reply-To: <20050308162806.C28988@topspin.com> References: <20050308160848.010c39ff.mshefty@ichips.intel.com> <20050308162806.C28988@topspin.com> Message-ID: <422E4AFA.1090507@ichips.intel.com> Libor Michalek wrote: > On Tue, Mar 08, 2005 at 04:08:48PM -0800, Sean Hefty wrote: > >>Libor, >> >>Here's a patch that updates SDP to the new CM API. I didn't actually >>test this though. (I did test the CM changes, just not SDP.) > > > Thanks. I just tested it and works correctly. Feel free to commit it > at the same time that you commit the CM changes. FYI - I just found a bug in the CM module unload code. I'll submit a patch shortly. - Sean From iod00d at hp.com Tue Mar 8 18:56:47 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 8 Mar 2005 18:56:47 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <1110324708.8595.223.camel@localhost> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> Message-ID: <20050309025647.GN5502@esmail.cup.hp.com> On Tue, Mar 08, 2005 at 03:31:48PM -0800, Matt Leininger wrote: > You can grab the openib source code from the subversion repository. > See http://www.openib.org/tools.html. If you want everything run 'svn > co https://openib.org/svn' Matt, probably best to just add a short blurb to tools.html that includes an example using gen2 branch. That's what we want people to focus on I think. grant From halr at voltaire.com Wed Mar 9 02:06:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 05:06:35 -0500 Subject: [openib-general] Kernel oops when unloading ib_cm module with latest CM Message-ID: <1110362795.4645.16.camel@localhost.localdomain> This didn't occur before yerterday's CM change. Mar 9 05:03:34 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000018 Mar 9 05:03:34 localhost kernel: printing eip: Mar 9 05:03:34 localhost kernel: d09472a7 Mar 9 05:03:34 localhost kernel: *pde = 00000000 Mar 9 05:03:34 localhost kernel: Oops: 0000 [#1] Mar 9 05:03:34 localhost kernel: Modules linked in: ib_cm ib_umad ide_cd cdrom lp ipv6 autofs parport_pc parport uhci_hcd ehci_hcd ib_mthca ib_mad ib_core ohci_hcd eepro100 mii evdev usbcore Mar 9 05:03:34 localhost kernel: CPU: 0 Mar 9 05:03:34 localhost kernel: EIP: 0060:[] Tainted: P VLI Mar 9 05:03:34 localhost kernel: EFLAGS: 00010286 (2.6.10) Mar 9 05:03:34 localhost kernel: EIP is at ib_unregister_mad_agent+0x7/0x30 [ib_mad] Mar 9 05:03:34 localhost kernel: eax: 00000000 ebx: c2eefa24 ecx: c1054660 edx: 00000000 Mar 9 05:03:34 localhost kernel: esi: 00000003 edi: c2eef9e0 ebp: cee7f000 esp: ce783ef0 Mar 9 05:03:34 localhost kernel: ds: 007b es: 007b ss: 0068 Mar 9 05:03:34 localhost kernel: Process modprobe (pid: 5105, threadinfo=ce782000 task=c147f550) Mar 9 05:03:34 localhost kernel: Stack: ce783f08 d09e42cf 00000000 00000001 00000000 ce783f08 00000000 00010000 Mar 9 05:03:34 localhost kernel: 00000000 00000000 d09e6200 cee7f000 08049178 d09e61e4 d08c0cb1 cee7f000 Mar 9 05:03:34 localhost kernel: cedb0160 c012f4b8 c2e55000 00000000 c015892d 00000286 00000000 d09e6200 Mar 9 05:03:34 localhost kernel: Call Trace: Mar 9 05:03:34 localhost kernel: [] cm_remove_one+0x9f/0xd0 [ib_cm] Mar 9 05:03:34 localhost kernel: [] ib_unregister_client+0x211/0x220 [ib_core] Mar 9 05:03:34 localhost kernel: [] destroy_workqueue+0x58/0x1c0 Mar 9 05:03:34 localhost kernel: [] unmap_region+0x9d/0xf0 Mar 9 05:03:34 localhost kernel: [] ib_cm_cleanup+0x26/0x28 [ib_cm] Mar 9 05:03:34 localhost kernel: [] sys_delete_module+0x158/0x190 Mar 9 05:03:34 localhost kernel: [] sys_munmap+0x44/0x70 Mar 9 05:03:34 localhost kernel: [] sysenter_past_esp+0x52/0x75 Mar 9 05:03:34 localhost kernel: Code: 40 df 94 d0 c7 44 24 08 09 02 00 00 c7 44 24 04 40 de 94 d0 89 44 24 0c e8 b7 0c 7d ef e9 60 fe ff ff 89 f6 83 ec 04 8b 44 24 08 <8b> 48 18 85 c9 74 12 83 e8 08 89 04 24 e8 a7 fb ff ff 5a 31 c0 From halr at voltaire.com Wed Mar 9 02:35:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 05:35:17 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] SDP: Eliminate uneeded initialization and fix some typos Message-ID: <1110364516.4645.22.camel@localhost.localdomain> SDP: Eliminate uneeded initialization and fix some typos Signed-off-by: Hal Rosenstock Index: sdp_link.c =================================================================== --- sdp_link.c (revision 1967) +++ sdp_link.c (working copy) @@ -232,7 +232,7 @@ if (!status) { /* - * on sucess save path record, stop waiting for info, + * on success save path record, stop waiting for info, * and complete all waiting IOs */ info->flags &= ~SDP_LINK_F_PATH; @@ -447,7 +447,6 @@ goto path; } - if ((NUD_CONNECTED|NUD_DELAY|NUD_PROBE) & rt->u.dst.neighbour->nud_state) { memcpy(&info->path.dgid, Index: sdp_actv.c =================================================================== --- sdp_actv.c (revision 1967) +++ sdp_actv.c (working copy) @@ -486,7 +486,6 @@ * no endian swap needed for single byte values. */ param.private_data_len = (u8)(buff->tail - buff->data); - param.peer_to_peer = 0; param.responder_resources = 4; param.initiator_depth = 4; param.remote_cm_response_timeout = 20; Index: sdp_pass.c =================================================================== --- sdp_pass.c (revision 1967) +++ sdp_pass.c (working copy) @@ -143,7 +143,7 @@ return result; } /* - * Functions to handle incomming passive connection requests. (REQ) + * Functions to handle incoming passive connection requests. (REQ) */ static int sdp_cm_accept(struct sdp_opt *conn) { From halr at voltaire.com Wed Mar 9 02:42:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 05:42:22 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ Message-ID: <1110364671.4645.27.camel@localhost.localdomain> libsdp: Change TS_ to OPENIB_ Signed-off-by: Hal Rosenstock Index: src/port.c =================================================================== --- src/port.c (revision 1967) +++ src/port.c (working copy) @@ -398,7 +398,7 @@ int protocol ) { -#ifdef _TS_VERBOSE_PRELOAD +#ifdef _OPENIB_VERBOSE_PRELOAD FILE *fd; #endif struct sdp_socket_info *sdp_sock_info; Index: src/socket.c =================================================================== --- src/socket.c (revision 1967) +++ src/socket.c (working copy) @@ -55,7 +55,7 @@ #include #if 0 -#define _TS_VERBOSE_PRELOAD +#define _OPENIB_VERBOSE_PRELOAD #endif #define SOCKOP_socket 1 @@ -98,7 +98,7 @@ char *inet; char **tenviron; -#ifdef _TS_VERBOSE_PRELOAD +#ifdef _OPENIB_VERBOSE_PRELOAD FILE *fd; #endif /* @@ -128,7 +128,7 @@ } /* if */ } /* if */ -#ifdef _TS_VERBOSE_PRELOAD +#ifdef _OPENIB_VERBOSE_PRELOAD fd = fopen("/tmp/libsdp.log.txt", "a+"); fprintf(fd, "SOCKET: <%s> domain <%d> type <%d> protocol <%d>\n", Index: src/sdp_inet.h =================================================================== --- src/sdp_inet.h (revision 1967) +++ src/sdp_inet.h (working copy) @@ -21,8 +21,8 @@ $Id$ */ -#ifndef _TS_SDP_INET_H -#define _TS_SDP_INET_H +#ifndef _SDP_INET_H +#define _SDP_INET_H /* * constants shared between user and kernel space. @@ -30,4 +30,4 @@ #define AF_INET_SDP 26 /* SDP socket protocol family */ #define AF_INET_STR "AF_INET_SDP" /* SDP enabled environment variable */ -#endif /* _TS_SDP_INET_H */ +#endif /* _SDP_INET_H */ From mst at mellanox.co.il Wed Mar 9 04:11:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:11:50 +0200 Subject: [openib-general] Re: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ In-Reply-To: <1110364671.4645.27.camel@localhost.localdomain> References: <1110364671.4645.27.camel@localhost.localdomain> Message-ID: <20050309121150.GC1826@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ > > libsdp: Change TS_ to OPENIB_ > > Signed-off-by: Hal Rosenstock Thanks! -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 9 04:12:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:12:08 +0200 Subject: [openib-general] Re: [PATCH][libsdp] update to use correct protocol family number (27) In-Reply-To: <1110328075.20262.69.camel@duffman> References: <1110328075.20262.69.camel@duffman> Message-ID: <20050309121208.GD1826@mellanox.co.il> Quoting r. Tom Duffy : > Subject: [PATCH][libsdp] update to use correct protocol family number (27) > > Signed-off-by: Tom Duffy Thanks! -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 9 04:27:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:27:00 +0200 Subject: [openib-general] [PATCH] uverbs rdma example Message-ID: <20050309122700.GA2352@mellanox.co.il> Here is a small test for the rdma functionality. I based it on the pingpong test, the main change being polling on data instead of receive completions. This is useful as an example of using rdma, and is also useful as a post send latency benchmark, for tuning (nicer than the send test in that it let us measure post send separately from poll cq). Code is originally based on the pingping test. I intentionally did not rename functions from pingpong_ to rdma_ to make it easier to share some code later if we decide it is useful. Roland, I also noticed a race in the pingpong test: you exchange connection data over socket when the qp is still in INIT. Then, the client immediately may move it to RTR, to RTS and start posting work requests. If the client is fast enough, send may arrive at the server when the server qp is till in INIT, an error wil be generated. For now I gave up on measuring time with rdtsc for benchmarking, since the results seem quite close to what I get with simple gettimeofday, and the later is more portable. I have the relevant code available if someone wants it. I fixed this in this new test, by calling the exch routine the second time (and I had to split this routine, to avoid closing the socket). I guess the pingpong test must be fixed, too. Signed-off-by: Michael S. Tsirkin Index: Makefile.am =================================================================== --- Makefile.am (revision 1967) +++ Makefile.am (working copy) @@ -20,7 +20,8 @@ src_libibverbs_la_LDFLAGS = -version-inf src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map bin_PROGRAMS = examples/ibv_devices examples/ibv_asyncwatch \ - examples/ibv_pingpong examples/ibv_ud_pingpong + examples/ibv_pingpong examples/ibv_ud_pingpong \ + examples/ibv_rdma examples_ibv_devices_SOURCES = examples/device_list.c examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_pingpong_SOURCES = examples/pingpong.c @@ -29,6 +30,8 @@ examples_ibv_ud_pingpong_SOURCES = examp examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_asyncwatch_SOURCES = examples/asyncwatch.c examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la +examples_ibv_rdma_SOURCES = examples/rdma.c +examples_ibv_rdma_LDADD = $(top_builddir)/src/libibverbs.la libibverbsincludedir = $(includedir)/infiniband Index: examples/rdma.c =================================================================== --- examples/rdma.c (revision 0) +++ examples/rdma.c (revision 0) @@ -0,0 +1,686 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +enum { + PINGPONG_RDMA_WRID = 3, +}; + +static int page_size; + +struct pingpong_context { + struct ibv_context *context; + struct ibv_pd *pd; + struct ibv_mr *mr; + struct ibv_cq *cq; + struct ibv_qp *qp; + void *buf; + volatile char *post_buf; + volatile char *poll_buf; + int size; + int rx_depth; + int tx_depth; +}; + +struct pingpong_dest { + int lid; + int qpn; + int psn; + unsigned rkey; + unsigned long long vaddr; +}; + +/* + * pp_get_local_lid() uses a pretty bogus method for finding the LID + * of a local port. Please don't copy this into your app (or if you + * do, please rip it out soon). + */ +static uint16_t pp_get_local_lid(struct ibv_device *dev, int port) +{ + char path[256]; + char val[16]; + char *name; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return 0; + } + + asprintf(&name, "%s/class/infiniband/%s/ports/%d/lid", path, + ibv_get_device_name(dev), port); + + if (sysfs_read_attribute_value(name, val, sizeof val)) { + fprintf(stderr, "Couldn't read LID at %s\n", name); + return 0; + } + + return strtol(val, NULL, 0); +} + +static int pp_client_connect(const char *servername, int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int n; + int sockfd = -1; + + asprintf(&service, "%d", port); + n = getaddrinfo(servername, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); + return sockfd; + } + return sockfd; +} + +struct pingpong_dest * pp_client_exch_dest(int sockfd, + const struct pingpong_dest *my_dest) +{ + struct pingpong_dest *rem_dest = NULL; + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn,my_dest->rkey,my_dest->vaddr); + if (write(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client write"); + fprintf(stderr, "Couldn't send local address\n"); + goto out; + } + + if (read(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client read"); + fprintf(stderr, "Couldn't read remote address\n"); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); + +out: + return rem_dest; +} + +int pp_server_connect(int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_flags = AI_PASSIVE, + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int sockfd = -1, connfd; + int n; + + asprintf(&service, "%d", port); + n = getaddrinfo(NULL, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + n = 1; + + setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + + if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't listen to port %d\n", port); + return sockfd; + } + + listen(sockfd, 1); + connfd = accept(sockfd, NULL, 0); + if (connfd < 0) { + perror("server accept"); + fprintf(stderr, "accept() failed\n"); + close(sockfd); + return connfd; + } + + close(sockfd); + return connfd; +} + +static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) +{ + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + struct pingpong_dest *rem_dest = NULL; + int parsed; + int n; + + n = read(connfd, msg, sizeof msg); + if (n != sizeof msg) { + perror("server read"); + fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); + if (parsed != 5) { + fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, + msg); + free(rem_dest); + rem_dest = NULL; + goto out; + } + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn,my_dest->rkey,rem_dest->vaddr); + if (write(connfd, msg, sizeof msg) != sizeof msg) { + perror("server write"); + fprintf(stderr, "Couldn't send local address\n"); + free(rem_dest); + rem_dest = NULL; + goto out; + } +out: + return rem_dest; +} + +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, + int tx_depth, int rx_depth, int port) +{ + struct pingpong_context *ctx; + + ctx = malloc(sizeof *ctx); + if (!ctx) + return NULL; + + ctx->size = size; + ctx->rx_depth = rx_depth; + ctx->tx_depth = tx_depth; + + ctx->buf = memalign(page_size, size * 2); + if (!ctx->buf) { + fprintf(stderr, "Couldn't allocate work buf.\n"); + return NULL; + } + + memset(ctx->buf, 0, size * 2); + + ctx->post_buf = (char*)ctx->buf + (size - 1); + ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); + + ctx->context = ibv_open_device(ib_dev); + if (!ctx->context) { + fprintf(stderr, "Couldn't get context for %s\n", + ibv_get_device_name(ib_dev)); + return NULL; + } + + ctx->pd = ibv_alloc_pd(ctx->context); + if (!ctx->pd) { + fprintf(stderr, "Couldn't allocate PD\n"); + return NULL; + } + + ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, + IBV_ACCESS_REMOTE_WRITE); + if (!ctx->mr) { + fprintf(stderr, "Couldn't allocate MR\n"); + return NULL; + } + + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); + if (!ctx->cq) { + fprintf(stderr, "Couldn't create CQ\n"); + return NULL; + } + + { + struct ibv_qp_init_attr attr = { + .send_cq = ctx->cq, + .recv_cq = ctx->cq, + .cap = { + .max_send_wr = tx_depth, + .max_recv_wr = rx_depth, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .qp_type = IBV_QPT_RC + }; + + ctx->qp = ibv_create_qp(ctx->pd, &attr); + if (!ctx->qp) { + fprintf(stderr, "Couldn't create QP\n"); + return NULL; + } + } + + { + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_INIT; + attr.pkey_index = 0; + attr.port_num = port; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS)) { + fprintf(stderr, "Failed to modify QP to INIT\n"); + return NULL; + } + } + + return ctx; +} + +static int pp_post_rdma(struct pingpong_context *ctx, + struct pingpong_dest* rem_dest) +{ + struct ibv_sge list = { + .addr = (uintptr_t) ctx->buf, + .length = ctx->size, + .lkey = ctx->mr->lkey + }; + struct ibv_send_wr wr = { + .wr_id = PINGPONG_RDMA_WRID, + .sg_list = &list, + .num_sge = 1, + .opcode = IBV_WR_RDMA_WRITE, + .send_flags = IBV_SEND_SIGNALED, + .wr.rdma.remote_addr = rem_dest->vaddr, + .wr.rdma.rkey = rem_dest->rkey + }; + struct ibv_send_wr *bad_wr; + + return ibv_post_send(ctx->qp, &wr, &bad_wr); +} + +static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, + struct pingpong_dest *dest) +{ + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_RTR; + attr.path_mtu = IBV_MTU_1024; + attr.dest_qp_num = dest->qpn; + attr.rq_psn = dest->psn; + attr.max_dest_rd_atomic = 1; + attr.min_rnr_timer = 12; + attr.ah_attr.is_global = 0; + attr.ah_attr.dlid = dest->lid; + attr.ah_attr.sl = 0; + attr.ah_attr.src_path_bits = 0; + attr.ah_attr.port_num = port; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER)) { + fprintf(stderr, "Failed to modify QP to RTR\n"); + return 1; + } + + attr.qp_state = IBV_QPS_RTS; + attr.timeout = 14; + attr.retry_cnt = 7; + attr.rnr_retry = 7; + attr.sq_psn = my_psn; + attr.max_rd_atomic = 1; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC)) { + fprintf(stderr, "Failed to modify QP to RTS\n"); + return 1; + } + + return 0; +} + +static void usage(const char *argv0) +{ + printf("Usage:\n"); + printf(" %s start a server and wait for connection\n", argv0); + printf(" %s connect to server at \n", argv0); + printf("\n"); + printf("Options:\n"); + printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -d, --ib-dev= use IB device (default first device found)\n"); + printf(" -i, --ib-port= use port of IB device (default 1)\n"); + printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -t, --tx-depth= size of tx queue (default 50)\n"); + printf(" -n, --iters= number of exchanges (default 1000)\n"); +} + +int main(int argc, char *argv[]) +{ + struct dlist *dev_list; + struct ibv_device *ib_dev; + struct pingpong_context *ctx; + struct pingpong_dest my_dest; + struct pingpong_dest *rem_dest; + struct timeval start, end; + char *ib_devname = NULL; + char *servername = NULL; + int port = 18515; + int ib_port = 1; + int size = 1; + int rx_depth = 1; + int tx_depth = 50; + int iters = 1000; + int scnt, rcnt, ccnt; + int client_first_post; + int sockfd; + + srand48(getpid() * time(NULL)); + + while (1) { + int c; + + static struct option long_options[] = { + { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "ib-dev", .has_arg = 1, .val = 'd' }, + { .name = "ib-port", .has_arg = 1, .val = 'i' }, + { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "tx-depth",.has_arg = 1, .val = 't' }, + { 0 } + }; + + c = getopt_long(argc, argv, "p:d:i:s:t:n:e", long_options, NULL); + if (c == -1) + break; + + switch (c) { + case 'p': + port = strtol(optarg, NULL, 0); + if (port < 0 || port > 65535) { + usage(argv[0]); + return 1; + } + break; + + case 'd': + ib_devname = strdupa(optarg); + break; + + case 'i': + ib_port = strtol(optarg, NULL, 0); + if (port < 0) { + usage(argv[0]); + return 1; + } + break; + + case 's': + size = strtol(optarg, NULL, 0); + break; + + case 't': + tx_depth = strtol(optarg, NULL, 0); + break; + + case 'n': + iters = strtol(optarg, NULL, 0); + break; + + default: + usage(argv[0]); + return 1; + } + } + + if (optind == argc - 1) + servername = strdupa(argv[optind]); + else if (optind < argc) { + usage(argv[0]); + return 1; + } + + page_size = sysconf(_SC_PAGESIZE); + + dev_list = ibv_get_devices(); + + dlist_start(dev_list); + if (!ib_devname) { + ib_dev = dlist_next(dev_list); + if (!ib_dev) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } + } else { + dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + break; + if (!ib_dev) { + fprintf(stderr, "IB device %s not found\n", ib_devname); + return 1; + } + } + + ctx = pp_init_ctx(ib_dev, size, iters, rx_depth, ib_port); + if (!ctx) + return 1; + + my_dest.lid = pp_get_local_lid(ib_dev, ib_port); + my_dest.qpn = ctx->qp->qp_num; + my_dest.psn = lrand48() & 0xffffff; + if (!my_dest.lid) { + fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); + return 1; + } + my_dest.rkey = ctx->mr->rkey; + my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + + printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " + "RKey %#08x VAddr %#016Lx\n", + my_dest.lid, my_dest.qpn, my_dest.psn, + my_dest.rkey, my_dest.vaddr); + + + if (servername) { + sockfd = pp_client_connect(servername, port); + } else { + sockfd = pp_server_connect(port); + } + if (sockfd < 0) + return 1; + + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + if (!rem_dest) + return 1; + + printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " + "RKey %#08x VAddr %#016Lx\n", + rem_dest->lid, rem_dest->qpn, rem_dest->psn, + rem_dest->rkey, rem_dest->vaddr); + + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) + return 1; + + /* An additional handshake is required *after* moving qp to RTR. + Arbitrarily reuse exch_dest for this purpose. */ + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + write(sockfd, "done", sizeof "done"); + close(sockfd); + + if (gettimeofday(&start, NULL)) { + perror("gettimeofday"); + return 1; + } + + scnt = 0; + rcnt = 0; + ccnt = 0; + if (servername) + client_first_post = 1; + else + client_first_post = 0; + + while (scnt < iters || ccnt < iters || rcnt < iters) { + + /* Wait till buffer changes. */ + if (rcnt < iters && ! client_first_post) { + ++rcnt; + while (*ctx->poll_buf != (char)rcnt) { + } + /* Here the data is already in the physical memory. + If we wanted to actually use it, we may need + a read memory barrier here. */ + } else + client_first_post = 0; + + if (scnt < iters) { + *ctx->post_buf = (char)++scnt; + if (pp_post_rdma(ctx, rem_dest)) { + fprintf(stderr, "Couldn't post send: scnt=%d\n", + scnt); + return 1; + } + } + + if (ccnt < iters) { + struct ibv_wc wc; + int ne; + ++ccnt; + do { + ne = ibv_poll_cq(ctx->cq, 1, &wc); + } while (ne == 0); + + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } + if (wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Completion wth error at %s:\n", + servername?"client":"server"); + fprintf(stderr, "Failed status %d: wr_id %d\n", + wc.status, (int) wc.wr_id); + fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", + scnt, rcnt, ccnt); + return 1; + } + } + } + + if (gettimeofday(&end, NULL)) { + perror("gettimeofday"); + return 1; + } + + { + float usec = (end.tv_sec - start.tv_sec) * 1000000 + + (end.tv_usec - start.tv_usec); + long long bytes = (long long) size * iters; + + printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", + bytes, usec / 1000000., bytes * 8. / usec); + printf("%d iters in %.2f seconds = %.2f usec/iter\n", + iters, usec / 1000000., usec / iters); + } + + return 0; +} -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 9 04:34:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:34:36 +0200 Subject: [openib-general] Re: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ In-Reply-To: <1110364671.4645.27.camel@localhost.localdomain> References: <1110364671.4645.27.camel@localhost.localdomain> Message-ID: <20050309123436.GA2586@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ > > libsdp: Change TS_ to OPENIB_ > > Signed-off-by: Hal Rosenstock > > Index: src/port.c > =================================================================== > --- src/port.c (revision 1967) > +++ src/port.c (working copy) > @@ -398,7 +398,7 @@ > int protocol > ) > { > -#ifdef _TS_VERBOSE_PRELOAD > +#ifdef _OPENIB_VERBOSE_PRELOAD > FILE *fd; > #endif I decided to rename this one to _SDP_VERBOSE_PRELOAD. -- MST - Michael S. Tsirkin From halr at voltaire.com Wed Mar 9 06:43:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 09:43:18 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] SDP: sdp_actv.c remove redundant initialization Message-ID: <1110379398.4645.46.camel@localhost.localdomain> SDP: sdp_actv.c remove redundant initialization qp_attr->min_rnr_timer is already initialized to 0 by cm_init_qp_rtr_attr in cm.c Is this really intended to be IB_RNR_TIMER_122_88 instead ? Signed-off-by: Hal Rosenstock Index: sdp_actv.c =================================================================== --- sdp_actv.c (revision 1970) +++ sdp_actv.c (working copy) @@ -133,10 +133,9 @@ goto done; } - qp_attr->min_rnr_timer = 0; /* IB_RNR_TIMER_122_88; */ qp_attr->rq_psn = conn->rq_psn; - attr_mask |= (IB_QP_MIN_RNR_TIMER | IB_QP_RQ_PSN); + attr_mask |= IB_QP_RQ_PSN; result = ib_modify_qp(conn->qp, qp_attr, attr_mask); if (result) { From halr at voltaire.com Wed Mar 9 07:25:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 10:25:57 -0500 Subject: [openib-general] PATCH] [TRIVIAL] SDP: sdp_pass.c remove redundant initialization Message-ID: <1110381957.4645.1.camel@localhost.localdomain> SDP: sdp_pass.c remove redundant initialization (Similar to previous sdp_actv.c patch) qp_attr->min_rnr_timer is already initialized to 0 by cm_init_qp_rtr_attr in cm.c Is this really intended to be IB_RNR_TIMER_122_88 instead ? Signed-off-by: Hal Rosenstock Index: sdp_pass.c =================================================================== --- sdp_pass.c (revision 1970) +++ sdp_pass.c (working copy) @@ -235,10 +235,9 @@ goto error; } - qp_attr->min_rnr_timer = 0; /* IB_RNR_TIMER_122_88; */ qp_attr->rq_psn = conn->rq_psn; - qp_mask |= (IB_QP_MIN_RNR_TIMER | IB_QP_RQ_PSN); + qp_mask |= IB_QP_RQ_PSN; result = ib_modify_qp(conn->qp, qp_attr, qp_mask); kfree(qp_attr); From halr at voltaire.com Wed Mar 9 08:00:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 11:00:42 -0500 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <52acpfjszt.fsf@topspin.com> References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> <52acpfjszt.fsf@topspin.com> Message-ID: <1110383801.4648.1.camel@localhost.localdomain> On Mon, 2005-03-07 at 11:55, Roland Dreier wrote: > OK, it won't be hard to fill out those entries. Any idea on when this change will be made ? Thanks. -- Hal From timur.tabi at ammasso.com Wed Mar 9 08:58:47 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 10:58:47 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <1110324708.8595.223.camel@localhost> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> Message-ID: <422F2B47.80809@ammasso.com> Matt Leininger wrote: > Most of the work to date has been for kernel-space IB support (now in > the 2.6.11 kernel). At some point, in the near future, the user-space > support will be stable/tested enough that we _may_ start posting tar > files, but until then subversion checkout is the best way to get the > source. Just to be clear - the current user-space stuff, whatever it is, is in the subversion repository? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From halr at voltaire.com Wed Mar 9 09:12:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 12:12:53 -0500 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F2B47.80809@ammasso.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> Message-ID: <1110388190.4647.3.camel@localhost.localdomain> On Wed, 2005-03-09 at 11:58, Timur Tabi wrote: > Just to be clear - the current user-space stuff, whatever it is, is in > the subversion repository? The latest user space verbs is on the roland-uverbs branch in the repository (https://openib.org/svn/gen2/branches/roland-uverbs/). It will be merged back to the mainline (https://openib.org/svn/gen2/trunk/src/userspace/) but an earlier version is there presently. -- Hal From roland at topspin.com Wed Mar 9 09:28:17 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 09:28:17 -0800 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <1110383801.4648.1.camel@localhost.localdomain> (Hal Rosenstock's message of "09 Mar 2005 11:00:42 -0500") References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> <52acpfjszt.fsf@topspin.com> <1110383801.4648.1.camel@localhost.localdomain> Message-ID: <527jkgenke.fsf@topspin.com> Hal> Any idea on when this change will be made ? I should be able to get to it before the end of this week. Keep in mind that this won't help uDAPL, since filling out this function in the kernel does nothing to get the information to userspace. If this is blocking you, you should be able to fill in reasonable defaults to make progress. I can't imagine an application depends on knowing the exact values of these limits. - R. From mshefty at ichips.intel.com Wed Mar 9 09:38:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 09:38:48 -0800 Subject: [openib-general] Re: Kernel oops when unloading ib_cm module with latest CM In-Reply-To: <1110362795.4645.16.camel@localhost.localdomain> References: <1110362795.4645.16.camel@localhost.localdomain> Message-ID: <422F34A8.2080200@ichips.intel.com> Hal Rosenstock wrote: > This didn't occur before yerterday's CM change. Yeah, I saw this right before I left work yesterday. I'll have a fix in a couple of hours. - Sean From timur.tabi at ammasso.com Wed Mar 9 09:51:51 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 11:51:51 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <1110388190.4647.3.camel@localhost.localdomain> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> Message-ID: <422F37B7.7060609@ammasso.com> Hal Rosenstock wrote: > The latest user space verbs is on the roland-uverbs branch in the > repository (https://openib.org/svn/gen2/branches/roland-uverbs/). > > It will be merged back to the mainline > (https://openib.org/svn/gen2/trunk/src/userspace/) but an earlier > version is there presently. I see that function ibv_lock_range() in libibverbs calls the mlock() system call. mlock() can only be called by a process that has root privileges. Does this mean the user-space verbs support is only available to applications that run as root? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From roland at topspin.com Wed Mar 9 09:56:25 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 09:56:25 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F37B7.7060609@ammasso.com> (Timur Tabi's message of "Wed, 09 Mar 2005 11:51:51 -0600") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> Message-ID: <52y8cwd7p2.fsf@topspin.com> Timur> I see that function ibv_lock_range() in libibverbs calls Timur> the mlock() system call. mlock() can only be called by a Timur> process that has root privileges. Does this mean the Timur> user-space verbs support is only available to applications Timur> that run as root? Actually this isn't true. Any process can call mlock() (try it and see). - R. From timur.tabi at ammasso.com Wed Mar 9 09:56:23 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 11:56:23 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <52y8cwd7p2.fsf@topspin.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> Message-ID: <422F38C7.805@ammasso.com> Roland Dreier wrote: > Timur> I see that function ibv_lock_range() in libibverbs calls > Timur> the mlock() system call. mlock() can only be called by a > Timur> process that has root privileges. Does this mean the > Timur> user-space verbs support is only available to applications > Timur> that run as root? > > Actually this isn't true. Any process can call mlock() (try it and see). Since when? man mlock: ERRORS EPERM The calling process does not have appropriate privileges. Only root processes are allowed to lock pages. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From roland at topspin.com Wed Mar 9 10:01:15 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 10:01:15 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F38C7.805@ammasso.com> (Timur Tabi's message of "Wed, 09 Mar 2005 11:56:23 -0600") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> Message-ID: <52u0nkd7h0.fsf@topspin.com> Timur> Since when? According to my kernel tree, a change "rlimit-based mlocks for unprivileged users" was committed around August of last year. Timur> man mlock: I guess the man page is out of date. As I said, try it and see. - R. From timur.tabi at ammasso.com Wed Mar 9 10:00:56 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 12:00:56 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <52u0nkd7h0.fsf@topspin.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <52u0nkd7h0.fsf@topspin.com> Message-ID: <422F39D8.8050302@ammasso.com> Roland Dreier wrote: > Timur> Since when? > > According to my kernel tree, a change "rlimit-based mlocks for > unprivileged users" was committed around August of last year. Can you tell me which kernel version in particular? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From bill at strahm.net Wed Mar 9 10:11:17 2005 From: bill at strahm.net (Bill Strahm) Date: Wed, 09 Mar 2005 10:11:17 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F38C7.805@ammasso.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> Message-ID: <422F3C45.4010003@strahm.net> Timur Tabi wrote: > Roland Dreier wrote: > >> Timur> I see that function ibv_lock_range() in libibverbs calls >> Timur> the mlock() system call. mlock() can only be called by a >> Timur> process that has root privileges. Does this mean the >> Timur> user-space verbs support is only available to applications >> Timur> that run as root? >> >> Actually this isn't true. Any process can call mlock() (try it and >> see). > > > Since when? > > man mlock: > > ERRORS > > EPERM The calling process does not have appropriate > privileges. Only root processes are allowed to lock pages. > Well, you can CALL it, it just returns an expected error code. Bill From mshefty at ichips.intel.com Wed Mar 9 10:20:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 9 Mar 2005 10:20:11 -0800 Subject: [openib-general] [PATCH] [CM] fix unload issue, crash rejecting a REQ after an error Message-ID: <20050309102011.1183256b.mshefty@ichips.intel.com> This patch fixes the CM unload issue added by the previous patch. It should also allow sending a REJ message in response to a REQ after an error has occurred. Signed-off-by: Sean Hefty Index: infiniband/core/cm.c =================================================================== --- infiniband/core/cm.c (revision 1965) +++ infiniband/core/cm.c (working copy) @@ -232,7 +232,16 @@ static void cm_set_ah_attr(struct ib_ah_ ah_attr->port_num = port_num; } -static int cm_init_av(struct ib_sa_path_rec *path, struct cm_av *av) +static void cm_init_av_for_response(struct cm_port *port, + struct ib_wc *wc, struct cm_av *av) +{ + av->port = port; + av->pkey_index = wc->pkey_index; + cm_set_ah_attr(&av->ah_attr, port->port_num, wc->slid, wc->sl, + wc->dlid_path_bits); +} + +static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) { struct cm_device *cm_dev; struct cm_port *port = NULL; @@ -259,7 +268,6 @@ static int cm_init_av(struct ib_sa_path_ return ret; av->port = port; - av->dgid = path->dgid; cm_set_ah_attr(&av->ah_attr, av->port->port_num, path->dlid, path->sl, path->slid & 0x7F); return 0; @@ -819,11 +827,12 @@ int ib_send_cm_req(struct ib_cm_id *cm_i } spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = cm_init_av(param->primary_path, &cm_id_priv->av); + ret = cm_init_av_by_path(param->primary_path, &cm_id_priv->av); if (ret) goto out; if (param->alternate_path) { - ret = cm_init_av(param->alternate_path, &cm_id_priv->alt_av); + ret = cm_init_av_by_path(param->alternate_path, + &cm_id_priv->alt_av); if (ret) goto out; } @@ -1012,6 +1021,8 @@ static int cm_req_handler(struct cm_work cm_id_priv = container_of(cm_id, struct cm_id_private, id); cm_id_priv->id.remote_id = req_msg->local_comm_id; + cm_init_av_for_response(work->port, work->mad_recv_wc->wc, + &cm_id_priv->av); cm_id_priv->timewait_info = cm_create_timewait_info( cm_id_priv->id.local_id, cm_id_priv->id.remote_id, @@ -1056,11 +1067,11 @@ static int cm_req_handler(struct cm_work cm_id_priv->id.service_mask = ~0ULL; cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); - ret = cm_init_av(&work->path[0], &cm_id_priv->av); + ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) goto error3; if (req_msg->alt_local_lid) { - ret = cm_init_av(&work->path[1], &cm_id_priv->alt_av); + ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av); if (ret) goto error3; } @@ -2287,7 +2298,7 @@ int ib_send_cm_sidr_req(struct ib_cm_id return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - ret = cm_init_av(param->path, &cm_id_priv->av); + ret = cm_init_av_by_path(param->path, &cm_id_priv->av); if (ret) goto out; @@ -2359,6 +2370,8 @@ static int cm_sidr_req_handler(struct cm wc = work->mad_recv_wc->wc; cm_id_priv->av.dgid.global.subnet_prefix = wc->slid; cm_id_priv->av.dgid.global.interface_id = 0; + cm_init_av_for_response(work->port, work->mad_recv_wc->wc, + &cm_id_priv->av); cm_id_priv->id.remote_id = sidr_req_msg->request_id; cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; atomic_inc(&cm_id_priv->work_count); @@ -2383,10 +2396,6 @@ static int cm_sidr_req_handler(struct cm cm_id_priv->id.context = cur_cm_id_priv->id.context; cm_id_priv->id.service_id = sidr_req_msg->service_id; cm_id_priv->id.service_mask = ~0ULL; - cm_id_priv->av.port = work->port; - cm_id_priv->av.pkey_index = wc->pkey_index; - cm_set_ah_attr(&cm_id_priv->av.ah_attr, work->port->port_num, - wc->slid, wc->sl, wc->dlid_path_bits); cm_format_sidr_req_event(work, &cur_cm_id_priv->id); cm_process_work(cm_id_priv, work); @@ -3013,7 +3022,8 @@ static void cm_remove_one(struct ib_devi write_unlock_irqrestore(&cm.device_lock, flags); for (i = 1; i <= device->phys_port_cnt; i++) { - port = &cm_dev->port[i]; + port = &cm_dev->port[i-1]; + ib_dereg_mr(port->mr); ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } From mshefty at ichips.intel.com Wed Mar 9 10:25:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 10:25:26 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110311137.4645.28.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> Message-ID: <422F3F96.4050803@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > My main question has to do with an error path in cm_req_handler. If > cm_init_av fails (lines 1098 or 1103), I get the following crash: I don't have an easy way to test this, but the patch that I just submitted for the CM should correct this problem. Please let me know if you run into any more problems, or if this patch doesn't work. - Sean From iod00d at hp.com Wed Mar 9 10:30:55 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 9 Mar 2005 10:30:55 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F38C7.805@ammasso.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> Message-ID: <20050309183055.GH10338@esmail.cup.hp.com> On Wed, Mar 09, 2005 at 11:56:23AM -0600, Timur Tabi wrote: > man mlock: > > ERRORS > > EPERM The calling process does not have appropriate > privileges. Only root processes are allowed to lock pages. mine says: EPERM The calling process has insufficient privilege to call mlock. Under Linux the CAP_IPC_LOCK capability is required. Assuming roland tried it, I'll guess that all processes are given CAP_IPC_LOCK by default. Further, ulimit -a output is enlightening: grundler at gsyprf11:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited grant From halr at voltaire.com Wed Mar 9 11:20:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 14:20:18 -0500 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <422F3F96.4050803@ichips.intel.com> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> Message-ID: <1110396018.4645.33.camel@localhost.localdomain> On Wed, 2005-03-09 at 13:25, Sean Hefty wrote: > > My main question has to do with an error path in cm_req_handler. If > > cm_init_av fails (lines 1098 or 1103), I get the following crash: > > I don't have an easy way to test this, but the patch that I just > submitted for the CM should correct this problem. Please let me know > if you run into any more problems, or if this patch doesn't work. CM unload problem appears to be fixed. Also, this fixes the crash when this occurs but the removal of the CM module now hangs. Any easy way to reproduce this is to clear out the path record DGID before sending REP. Thanks. -- Hal From mshefty at ichips.intel.com Wed Mar 9 11:31:03 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 11:31:03 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110396018.4645.33.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> <1110396018.4645.33.camel@localhost.localdomain> Message-ID: <422F4EF7.9070500@ichips.intel.com> Hal Rosenstock wrote: > Also, this fixes the crash when this occurs but the removal of the CM > module now hangs. > > Any easy way to reproduce this is to clear out the path record DGID > before sending REP. Thanks for the info. I'll try to isolate what's causing the hang. - Sean From roland at topspin.com Wed Mar 9 11:37:45 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 11:37:45 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F39D8.8050302@ammasso.com> (Timur Tabi's message of "Wed, 09 Mar 2005 12:00:56 -0600") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <52u0nkd7h0.fsf@topspin.com> <422F39D8.8050302@ammasso.com> Message-ID: <52ll8wd306.fsf@topspin.com> Timur> Can you tell me which kernel version in particular? Sorry, I don't have that info handy, but if you have a bk tree it should be easy to figure out. Or searching the web for the string "rlimit-based mlocks for unprivileged users" should work. - R. From roland at topspin.com Wed Mar 9 11:41:46 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 11:41:46 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <20050309183055.GH10338@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 9 Mar 2005 10:30:55 -0800") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <20050309183055.GH10338@esmail.cup.hp.com> Message-ID: <52fyz4d2th.fsf@topspin.com> Actually, my mlock(2) man page says: EPERM (Linux 2.6.9 and later) the caller was not privileged (CAP_IPC_LOCK) and its RLIMIT_MEMLOCK soft resource limit was 0. EPERM (Linux 2.6.8 and earlier) The calling process has insufficient privilege to call munlockall. Under Linux the CAP_IPC_LOCK capability is required. so I guess the change was made in version 2.6.9. - R. From mshefty at ichips.intel.com Wed Mar 9 13:49:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 13:49:49 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110396018.4645.33.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> <1110396018.4645.33.camel@localhost.localdomain> Message-ID: <422F6F7D.4040000@ichips.intel.com> Hal Rosenstock wrote: >>>My main question has to do with an error path in cm_req_handler. If >>>cm_init_av fails (lines 1098 or 1103), I get the following crash: >> > Also, this fixes the crash when this occurs but the removal of the CM > module now hangs. Hal, how many CPUs is your system running on? - Sean From tduffy at sun.com Wed Mar 9 15:22:31 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 09 Mar 2005 15:22:31 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F3C45.4010003@strahm.net> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <422F3C45.4010003@strahm.net> Message-ID: <1110410551.7648.8.camel@duffman> On Wed, 2005-03-09 at 10:11 -0800, Bill Strahm wrote: > Well, you can CALL it, it just returns an expected error code. Bill took a big dose of LFP's for breakfast today as required by the IETF ;) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Wed Mar 9 16:37:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 16:37:44 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110396018.4645.33.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> <1110396018.4645.33.camel@localhost.localdomain> Message-ID: <422F96D8.3070207@ichips.intel.com> Hal Rosenstock wrote: >>>My main question has to do with an error path in cm_req_handler. If >>>cm_init_av fails (lines 1098 or 1103), I get the following crash: >> > Also, this fixes the crash when this occurs but the removal of the CM > module now hangs. > > Any easy way to reproduce this is to clear out the path record DGID > before sending REP. an update... I've been able to reproduce this, and what's happening is that the cm_id that the CM created to handle the REQ is hanging waiting for its reference count to go to 0, but I'm not entirely sure why yet. The REQ is received and processed in a CM controlled work queue. After seeing the error, the CM sends a REJ message to the sender. (The code to set the proper reject code is not there yet, but a REJ should still be delivered.) As a result of sending the REJ, the reference count on the cm_id is incremented. The CM then waits in the CM work queue thread for the send to complete, which would decrement the reference count. The send completion should be processed from the context of the MAD layer controlled work queue, so I'm not sure why it's not getting called. My planned long term fix is to allow the REJ to be sent without holding a reference on the cm_id. But there's a similar issue sending a DREQ or DREP when destroying a cm_id. So, I'm trying to understand this more. - Sean From mplee at hpcn.ca.sandia.gov Thu Mar 10 00:05:54 2005 From: mplee at hpcn.ca.sandia.gov (Michael Lee) Date: Thu, 10 Mar 2005 00:05:54 -0800 Subject: [openib-general] openib.org services unavailable on 3/10/2005 Message-ID: <1110441954.9018.9.camel@acheron.ca.sandia.gov> Due to a last-minute planned power outage in our computer lab, the openib.org server will be unavailable from 7:00am to 7:30am PST on Thursday, 3/10/2005. -- Michael Lee HPCN Sandia From mst at mellanox.co.il Thu Mar 10 02:42:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 12:42:11 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52acpotwd9.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> Message-ID: <20050310104211.GF2586@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > Michael> Interestingly, if I rebuild libmthca with -O3 compiler > Michael> flag, the pingpong program does not make progress. > Michael> Building libibverbs or the test itself with -O3 has no > Michael> such effect. > > I can't reproduce this on either i386 or x86-64 (Intel Nocona system). > > $ gcc --version > gcc (GCC) 3.3.5 (Debian 1:3.3.5-8) > Copyright (C) 2003 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > - R. > I think I have discovered the problem. It seems that with -O3 my compiler may reorder the WQE (and possibly CQE) write with respect to the doorbell. This wont happen on i386 with consistent i/o ordering since the doorbell is done in assembly, and probably not on other 32 bit architectures since the mutex is likely to include a memory barrier. Applying the folowing patch fixes the problem for me for x86_64. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libmthca/src/doorbell.h =================================================================== --- src/userspace/libmthca/src/doorbell.h (revision 1972) +++ src/userspace/libmthca/src/doorbell.h (working copy) @@ -56,6 +56,9 @@ static inline void mthca_write64(uint32_ static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset) { + /* Sufficient for x86_64. + * Other architectures may need a memory barrier here. */ + asm volatile("" ::: "memory"); *(volatile uint64_t *) (ctx->uar + offset) = *(uint64_t *) val; } -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 02:47:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 12:47:22 +0200 Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? Message-ID: <20050310104722.GG2586@mellanox.co.il> I am able to shave about 200ns off the rdma post latency, by using pthread_spinlock instead of pthread_mutex for protecting the qp post op in libmthca. I'm aware of course that a context switch when spinlock is held may waste a whole timeslice , but maybe for short operations such as this it's reasonable to use spinlocks? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 04:19:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 14:19:41 +0200 Subject: [openib-general] [PATCH] AIO code to use get_user_pages Message-ID: <20050310121941.GI2586@mellanox.co.il> Well, I went ahead and modified the AIO code to use get_user_pages. Since we dont yet have fmr support, this patch is untested, but it does compile :) Please let me know what do you think. Another approach (instead of waiting for fmr support) could be to add a fall-back option to use a regular memory region. A todo item is to add zcopy support for synchronous operations. Signed-off-by: Michael S. Tsirkin Index: sdp_send.c =================================================================== --- sdp_send.c (revision 1972) +++ sdp_send.c (working copy) @@ -2195,6 +2195,7 @@ skip: /* entry point for IOCB based tran iocb->req = req; iocb->key = req->ki_key; iocb->addr = (unsigned long)msg->msg_iov->iov_base; + iocb->is_receive = 0; req->ki_cancel = sdp_inet_write_cancel; Index: sdp_recv.c =================================================================== --- sdp_recv.c (revision 1972) +++ sdp_recv.c (working copy) @@ -1459,6 +1459,7 @@ int sdp_inet_recv(struct kiocb *req, st iocb->req = req; iocb->key = req->ki_key; iocb->addr = (unsigned long)msg->msg_iov->iov_base; + iocb->is_receive = 1; req->ki_cancel = sdp_inet_read_cancel; Index: sdp_iocb.c =================================================================== --- sdp_iocb.c (revision 1972) +++ sdp_iocb.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -31,89 +32,107 @@ * * $Id$ */ - +#include #include "sdp_main.h" static kmem_cache_t *sdp_iocb_cache = NULL; -/* - * memory locking functions - */ -#include - -typedef int (*do_mlock_ptr_t)(unsigned long, size_t, int); -static do_mlock_ptr_t mlock_ptr = NULL; +static void sdp_copy_one_page(struct page *from, struct page* to, + unsigned long iocb_addr, size_t iocb_size, + unsigned long uaddr) +{ + size_t size_left = iocb_addr + iocb_size - uaddr; + size_t size = min(size_left,PAGE_SIZE); + unsigned long offset = uaddr % PAGE_SIZE; + void* fptr; + void* tptr; + + fptr = kmap_atomic(from, KM_USER0); + tptr = kmap_atomic(to, KM_USER0); + + memcpy(tptr + offset, fptr + offset, size); + + kunmap_atomic(tptr, KM_USER0); + kunmap_atomic(fptr, KM_USER0); + set_page_dirty_lock(to); +} /* - * do_iocb_unlock - unlock the memory for an IOCB + * sdp_iocb_unlock - unlock the memory for an IOCB + * Copy if pages moved since. */ -static int do_iocb_unlock(struct sdpc_iocb *iocb) +int sdp_iocb_unlock(struct sdpc_iocb *iocb) { - struct vm_area_struct *vma; + int result = 0; + struct page ** pages = NULL; + unsigned long uaddr; + int i; - vma = find_vma(iocb->mm, (iocb->addr & PAGE_MASK)); - if (!vma) - sdp_warn("No VMA for IOCB <%lx:%Zu> unlock", - iocb->addr, iocb->size); + if (!(iocb->flags & SDP_IOCB_F_LOCKED)) + return 0; - while (vma) { - sdp_dbg_data(NULL, - "unmark <%lx> <%p> <%08lx:%08lx> <%08lx> <%ld>", - iocb->addr, vma, vma->vm_start, vma->vm_end, - vma->vm_flags, (long)vma->vm_private_data); - - spin_lock(&iocb->mm->page_table_lock); - /* - * if there are no more references to the vma - */ - vma->vm_private_data--; - - if (!vma->vm_private_data) { - /* - * modify VM flags. - */ - vma->vm_flags &= ~(VM_DONTCOPY|VM_LOCKED); - /* - * adjust locked page count - */ - vma->vm_mm->locked_vm -= ((vma->vm_end - - vma->vm_start) >> - PAGE_SHIFT); - } + /* For read, unlock and we are done */ + if (!iocb->is_receive) { + for (i = 0;i < iocb->page_count; ++i) + page_cache_release(iocb->page_array[i]); + goto done; + } - spin_unlock(&iocb->mm->page_table_lock); - /* - * continue if the buffer continues onto the next vma - */ - if ((iocb->addr + iocb->size) > vma->vm_end) - vma = vma->vm_next; - else - vma = NULL; + /* For write, we must check the virtual pages did not get remapped */ + + /* As an optimisation (to avoid scanning the vma tree each time), + * try to get all pages in one go. */ + /* TODO: use cache for allocations? Allocate by chunks? */ + + pages = kmalloc((sizeof(struct page *) * + iocb->page_count), GFP_KERNEL); + + down_read(&iocb->mm->mmap_sem); + + if (pages) { + result=get_user_pages(iocb->tsk, iocb->mm, + iocb->addr, + iocb->page_count , iocb->is_receive, 0, + pages, NULL); + + if (result != iocb->page_count) { + kfree(pages); + pages = NULL; + } } - return 0; -} + for (i = 0, uaddr = iocb->addr; i < iocb->page_count; + ++i, uaddr = (uaddr & PAGE_MASK) + PAGE_SIZE) + { + struct page* page; + set_page_dirty_lock(iocb->page_array[i]); + + if (pages) + page = pages[i]; + else { + result=get_user_pages(iocb->tsk, iocb->mm, + uaddr & PAGE_MASK, + 1 , 1, 0, &page, NULL); + if (result != 1) { + page = NULL; + } + } -/* - * sdp_iocb_unlock - unlock the memory for an IOCB - */ -int sdp_iocb_unlock(struct sdpc_iocb *iocb) -{ - int result; + if (page && iocb->page_array[i] != page) + sdp_copy_one_page(iocb->page_array[i], page, + iocb->addr, iocb->size, uaddr); - /* - * check if IOCB is locked. - */ - if (!(iocb->flags & SDP_IOCB_F_LOCKED)) - return 0; - /* - * spin lock since this could be from interrupt context. - */ - down_write(&iocb->mm->mmap_sem); - - result = do_iocb_unlock(iocb); + if (page) + page_cache_release(page); + page_cache_release(iocb->page_array[i]); + } + + up_read(&iocb->mm->mmap_sem); - up_write(&iocb->mm->mmap_sem); + if (pages) + kfree(pages); + +done: kfree(iocb->page_array); kfree(iocb->addr_array); @@ -121,37 +140,41 @@ int sdp_iocb_unlock(struct sdpc_iocb *io iocb->page_array = NULL; iocb->addr_array = NULL; iocb->mm = NULL; - /* - * mark IOCB unlocked. - */ + iocb->tsk = NULL; + iocb->flags &= ~SDP_IOCB_F_LOCKED; return result; } /* - * sdp_iocb_page_save - save page information for an IOCB + * sdp_iocb_lock - lock the memory for an IOCB + * We do not take a reference on the mm, AIO handles this for us. */ -static int sdp_iocb_page_save(struct sdpc_iocb *iocb) +int sdp_iocb_lock(struct sdpc_iocb *iocb) { - unsigned int counter; + int result = -ENOMEM; unsigned long addr; size_t size; - int result = -ENOMEM; - struct page *page; - unsigned long pfn; - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; + int i; + /* + * iocb->addr - buffer start address + * iocb->size - buffer length + * addr - page aligned + * size - page multiple + */ + addr = iocb->addr & PAGE_MASK; + size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - if (iocb->page_count <= 0 || iocb->size <= 0 || !iocb->addr) - return -EINVAL; + iocb->page_offset = iocb->addr - addr; + + iocb->page_count = size >> PAGE_SHIFT; /* * create array to hold page value which are later needed to register * the buffer with the HCA */ + + /* TODO: use cache for allocations? Allocate by chunks? */ iocb->addr_array = kmalloc((sizeof(u64) * iocb->page_count), GFP_KERNEL); if (!iocb->addr_array) @@ -161,259 +184,41 @@ static int sdp_iocb_page_save(struct sdp GFP_KERNEL); if (!iocb->page_array) goto err_page; - /* - * iocb->addr - buffer start address - * iocb->size - buffer length - * addr - page aligned - * size - page multiple - */ - addr = iocb->addr & PAGE_MASK; - size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - iocb->page_offset = iocb->addr - addr; - /* - * Find pages used within the buffer which will then be registered - * for RDMA - */ - spin_lock(&iocb->mm->page_table_lock); + down_write(¤t->mm->mmap_sem); - for (counter = 0; - size > 0; - counter++, addr += PAGE_SIZE, size -= PAGE_SIZE) { - pgd = pgd_offset_gate(iocb->mm, addr); - if (!pgd || pgd_none(*pgd)) - break; - - pud = pud_offset(pgd, addr); - if (!pud || pud_none(*pud)) - break; - - pmd = pmd_offset(pud, addr); - if (!pmd || pmd_none(*pmd)) - break; - - ptep = pte_offset_map(pmd, addr); - if (!ptep) - break; - - pte = *ptep; - pte_unmap(ptep); - - if (!pte_present(pte)) - break; - - pfn = pte_pfn(pte); - if (!pfn_valid(pfn)) - break; - - page = pfn_to_page(pfn); - - iocb->page_array[counter] = page; - iocb->addr_array[counter] = page_to_phys(page); + result=get_user_pages(current, current->mm, iocb->addr, + iocb->page_count , iocb->is_receive, 0, + iocb->page_array, NULL); + + up_read(¤t->mm->mmap_sem); + + if (result != iocb->page_count) { + sdp_dbg_err("unable to lock <%lx:%Zu> error <%d> <%d>", + iocb->addr, iocb->size, result, iocb->page_count); + goto err_get; } - spin_unlock(&iocb->mm->page_table_lock); - - if (size > 0) { - result = -EFAULT; - goto err_find; - } - - return 0; -err_find: - - kfree(iocb->page_array); - iocb->page_array = NULL; -err_page: - - kfree(iocb->addr_array); - iocb->addr_array = NULL; -err_addr: - - return result; -} - -/* - * sdp_iocb_lock - lock the memory for an IOCB - */ -int sdp_iocb_lock(struct sdpc_iocb *iocb) -{ - struct vm_area_struct *vma; - kernel_cap_t real_cap; - unsigned long limit; - int result = -ENOMEM; - unsigned long addr; - size_t size; - - /* - * mark IOCB as locked. We do not take a reference on the mm, AIO - * handles this for us. - */ iocb->flags |= SDP_IOCB_F_LOCKED; iocb->mm = current->mm; - /* - * save and raise capabilities - */ - real_cap = cap_t(current->cap_effective); - cap_raise(current->cap_effective, CAP_IPC_LOCK); - - size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - addr = iocb->addr & PAGE_MASK; - - iocb->page_count = size >> PAGE_SHIFT; + iocb->tsk = current; - limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur; - limit >>= PAGE_SHIFT; - /* - * lock the mm, if within the limit lock the address range. - */ - down_write(&iocb->mm->mmap_sem); - if (!((iocb->page_count + current->mm->locked_vm) > limit)) - result = (*mlock_ptr)(addr, size, 1); - /* - * process result - */ - if (result) { - sdp_dbg_err("VMA lock <%lx:%Zu> error <%d> <%d:%lu:%lu>", - iocb->addr, iocb->size, result, - iocb->page_count, iocb->mm->locked_vm, limit); - goto err_lock; + for (i = 0; i< iocb->page_count; ++i) { + iocb->addr_array[i] = page_to_phys(iocb->page_array[i]); } - /* - * look up the head of the vma queue, loop through the vmas, marking - * them do not copy, reference counting, and saving them. - */ - vma = find_vma(iocb->mm, addr); - if (!vma) - /* - * sanity check. - */ - sdp_warn("No VMA for IOCB! <%lx:%Zu> lock", - iocb->addr, iocb->size); - - while (vma) { - spin_lock(&iocb->mm->page_table_lock); - - if (!(VM_LOCKED & vma->vm_flags)) - sdp_warn("Unlocked vma! <%08lx>", vma->vm_flags); - - if (PAGE_SIZE < (unsigned long)vma->vm_private_data) - sdp_dbg_err("VMA: private daya in use! <%08lx>", - (unsigned long)vma->vm_private_data); - - vma->vm_flags |= VM_DONTCOPY; - vma->vm_private_data++; - - spin_unlock(&iocb->mm->page_table_lock); - - sdp_dbg_data(NULL, - "mark <%lx> <0x%p> <%08lx:%08lx> <%08lx> <%ld>", - iocb->addr, vma, vma->vm_start, vma->vm_end, - vma->vm_flags, (long)vma->vm_private_data); - - if ((addr + size) > vma->vm_end) - vma = vma->vm_next; - else - vma = NULL; - } - - result = sdp_iocb_page_save(iocb); - if (result) { - sdp_dbg_err("Error <%d> saving pages for IOCB <%lx:%Zu>", - result, iocb->addr, iocb->size); - goto err_save; - } - - up_write(&iocb->mm->mmap_sem); - cap_t(current->cap_effective) = real_cap; return 0; -err_save: - - (void)do_iocb_unlock(iocb); -err_lock: - /* - * unlock the mm and restore capabilities. - */ - up_write(&iocb->mm->mmap_sem); - cap_t(current->cap_effective) = real_cap; - - iocb->flags &= ~SDP_IOCB_F_LOCKED; - iocb->mm = NULL; +err_get: + kfree(iocb->page_array); +err_page: + kfree(iocb->addr_array); +err_addr: return result; } /* - * IOCB memory locking init functions - */ -struct kallsym_iter { - loff_t pos; - struct module *owner; - unsigned long value; - unsigned int nameoff; /* If iterating in core kernel symbols */ - char type; - char name[128]; -}; - -/* - * sdp_mem_lock_init - initialize the userspace memory locking - */ -static int sdp_mem_lock_init(void) -{ - struct file *kallsyms; - struct seq_file *seq; - struct kallsym_iter *iter; - loff_t pos = 0; - int ret = -EINVAL; - - sdp_dbg_init("Memory Locking initialization."); - - kallsyms = filp_open("/proc/kallsyms", O_RDONLY, 0); - if (!kallsyms) { - sdp_warn("Failed to open /proc/kallsyms"); - goto done; - } - - seq = (struct seq_file *)kallsyms->private_data; - if (!seq) { - sdp_warn("Failed to fetch sequential file."); - goto err_close; - } - - for (iter = seq->op->start(seq, &pos); - iter != NULL; - iter = seq->op->next(seq, iter, &pos)) - if (!strcmp(iter->name, "do_mlock")) - mlock_ptr = (do_mlock_ptr_t)iter->value; - - if (!mlock_ptr) - sdp_warn("Failed to find lock pointer."); - else - ret = 0; - -err_close: - filp_close(kallsyms, NULL); -done: - return ret; -} - -/* - * sdp_mem_lock_cleanup - cleanup the memory locking tables - */ -static int sdp_mem_lock_cleanup(void) -{ - sdp_dbg_init("Memory Locking cleanup."); - /* - * null out entries. - */ - mlock_ptr = NULL; - - return 0; -} - -/* * IOCB memory registration functions */ @@ -831,28 +636,12 @@ void sdp_iocb_q_clear(struct sdpc_iocb_q } /* - * primary initialization/cleanup functions - */ - -/* * sdp_main_iocb_init - initialize the advertisment caches */ int sdp_main_iocb_init(void) { - int result; - sdp_dbg_init("IOCB cache initialization."); - /* - * initialize locking code. - */ - result = sdp_mem_lock_init(); - if (result < 0) { - sdp_warn("Error <%d> initializing memory locking.", result); - return result; - } - /* - * initialize the caches only once. - */ + if (sdp_iocb_cache) { sdp_warn("IOCB caches already initialized."); return -EINVAL; @@ -862,15 +651,10 @@ int sdp_main_iocb_init(void) sizeof(struct sdpc_iocb), 0, SLAB_HWCACHE_ALIGN, NULL, NULL); - if (!sdp_iocb_cache) { - result = -ENOMEM; - goto error_iocb_c; - } + if (!sdp_iocb_cache) + return -ENOMEM; return 0; -error_iocb_c: - (void)sdp_mem_lock_cleanup(); - return result; } /* @@ -879,16 +663,6 @@ error_iocb_c: void sdp_main_iocb_cleanup(void) { sdp_dbg_init("IOCB cache cleanup."); - /* - * cleanup the caches - */ kmem_cache_destroy(sdp_iocb_cache); - /* - * null out entries. - */ sdp_iocb_cache = NULL; - /* - * cleanup memory locking - */ - (void)sdp_mem_lock_cleanup(); } Index: sdp_iocb.h =================================================================== --- sdp_iocb.h (revision 1972) +++ sdp_iocb.h (working copy) @@ -99,9 +99,11 @@ struct sdpc_iocb { /* * page list. data for locking/registering userspace */ - struct mm_struct *mm; /* user mm struct */ - unsigned long addr; /* user space address */ - size_t size; /* total size of the user buffer */ + struct mm_struct *mm; /* user mm struct */ + struct task_struct *tsk; + unsigned long addr; /* user space address */ + size_t size; /* total size of the user buffer */ + int is_receive; struct page **page_array; /* list of page structure pointers. */ u64 *addr_array; /* list of physical page addresses. */ -- MST - Michael S. Tsirkin From tziporet at mellanox.co.il Thu Mar 10 04:18:26 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 10 Mar 2005 14:18:26 +0200 Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF104@mtlex01.yok.mtl.com> In VAPI we used spinlocks from this reason on all data-path verbs and it gave us better performance. Tziporet -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Thursday, March 10, 2005 12:47 PM To: openib-general at openib.org Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? I am able to shave about 200ns off the rdma post latency, by using pthread_spinlock instead of pthread_mutex for protecting the qp post op in libmthca. I'm aware of course that a context switch when spinlock is held may waste a whole timeslice , but maybe for short operations such as this it's reasonable to use spinlocks? -- MST - Michael S. Tsirkin _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Mar 10 04:31:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 14:31:29 +0200 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <20050309122700.GA2352@mellanox.co.il> References: <20050309122700.GA2352@mellanox.co.il> Message-ID: <20050310123129.GA12542@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] uverbs rdma example > > Here is a small test for the rdma functionality. > I based it on the pingpong test, the main change being polling on data > instead of receive completions. > > This is useful as an example of using rdma, and is also useful > as a post send latency benchmark, for tuning (nicer than the send test > in that it let us measure post send separately from poll cq). > > Code is originally based on the pingping test. > I intentionally did not rename functions from pingpong_ to rdma_ > to make it easier to share some code later if we decide it is useful. > > [...] That code had a typo, and some whitespace badness. sscanf result also has to be checked. Since that patch wasnt yet committed, here's an updated version to replace it. Let me know what do you think. Signed-off-by: Michael S. Tsirkin Index: Makefile.am =================================================================== --- Makefile.am (revision 1970) +++ Makefile.am (working copy) @@ -20,7 +20,8 @@ src_libibverbs_la_LDFLAGS = -version-inf src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map bin_PROGRAMS = examples/ibv_devices examples/ibv_asyncwatch \ - examples/ibv_pingpong examples/ibv_ud_pingpong + examples/ibv_pingpong examples/ibv_ud_pingpong \ + examples/ibv_rdma examples_ibv_devices_SOURCES = examples/device_list.c examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_pingpong_SOURCES = examples/pingpong.c @@ -29,6 +30,8 @@ examples_ibv_ud_pingpong_SOURCES = examp examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_asyncwatch_SOURCES = examples/asyncwatch.c examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la +examples_ibv_rdma_SOURCES = examples/rdma.c +examples_ibv_rdma_LDADD = $(top_builddir)/src/libibverbs.la libibverbsincludedir = $(includedir)/infiniband Index: examples/rdma.c =================================================================== --- examples/rdma.c (revision 0) +++ examples/rdma.c (revision 0) @@ -0,0 +1,698 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +enum { + PINGPONG_RDMA_WRID = 3, +}; + +static int page_size; + +struct pingpong_context { + struct ibv_context *context; + struct ibv_pd *pd; + struct ibv_mr *mr; + struct ibv_cq *cq; + struct ibv_qp *qp; + void *buf; + volatile char *post_buf; + volatile char *poll_buf; + int size; + int rx_depth; + int tx_depth; +}; + +struct pingpong_dest { + int lid; + int qpn; + int psn; + unsigned rkey; + unsigned long long vaddr; +}; + +/* + * pp_get_local_lid() uses a pretty bogus method for finding the LID + * of a local port. Please don't copy this into your app (or if you + * do, please rip it out soon). + */ +static uint16_t pp_get_local_lid(struct ibv_device *dev, int port) +{ + char path[256]; + char val[16]; + char *name; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return 0; + } + + asprintf(&name, "%s/class/infiniband/%s/ports/%d/lid", path, + ibv_get_device_name(dev), port); + + if (sysfs_read_attribute_value(name, val, sizeof val)) { + fprintf(stderr, "Couldn't read LID at %s\n", name); + return 0; + } + + return strtol(val, NULL, 0); +} + +static int pp_client_connect(const char *servername, int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int n; + int sockfd = -1; + + asprintf(&service, "%d", port); + n = getaddrinfo(servername, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); + return sockfd; + } + return sockfd; +} + +struct pingpong_dest * pp_client_exch_dest(int sockfd, + const struct pingpong_dest *my_dest) +{ + struct pingpong_dest *rem_dest = NULL; + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + int parsed; + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, + my_dest->psn,my_dest->rkey,my_dest->vaddr); + if (write(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client write"); + fprintf(stderr, "Couldn't send local address\n"); + goto out; + } + + if (read(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client read"); + fprintf(stderr, "Couldn't read remote address\n"); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, + &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); + + if (parsed != 5) { + fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, + msg); + free(rem_dest); + rem_dest = NULL; + goto out; + } +out: + return rem_dest; +} + +int pp_server_connect(int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_flags = AI_PASSIVE, + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int sockfd = -1, connfd; + int n; + + asprintf(&service, "%d", port); + n = getaddrinfo(NULL, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + n = 1; + + setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + + if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't listen to port %d\n", port); + return sockfd; + } + + listen(sockfd, 1); + connfd = accept(sockfd, NULL, 0); + if (connfd < 0) { + perror("server accept"); + fprintf(stderr, "accept() failed\n"); + close(sockfd); + return connfd; + } + + close(sockfd); + return connfd; +} + +static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) +{ + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + struct pingpong_dest *rem_dest = NULL; + int parsed; + int n; + + n = read(connfd, msg, sizeof msg); + if (n != sizeof msg) { + perror("server read"); + fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, + &rem_dest->psn, &rem_dest->rkey, &rem_dest->vaddr); + if (parsed != 5) { + fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, + msg); + free(rem_dest); + rem_dest = NULL; + goto out; + } + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, + my_dest->psn, my_dest->rkey, my_dest->vaddr); + if (write(connfd, msg, sizeof msg) != sizeof msg) { + perror("server write"); + fprintf(stderr, "Couldn't send local address\n"); + free(rem_dest); + rem_dest = NULL; + goto out; + } +out: + return rem_dest; +} + +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, + int tx_depth, int rx_depth, int port) +{ + struct pingpong_context *ctx; + + ctx = malloc(sizeof *ctx); + if (!ctx) + return NULL; + + ctx->size = size; + ctx->rx_depth = rx_depth; + ctx->tx_depth = tx_depth; + + ctx->buf = memalign(page_size, size * 2); + if (!ctx->buf) { + fprintf(stderr, "Couldn't allocate work buf.\n"); + return NULL; + } + + memset(ctx->buf, 0, size * 2); + + ctx->post_buf = (char*)ctx->buf + (size - 1); + ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); + + ctx->context = ibv_open_device(ib_dev); + if (!ctx->context) { + fprintf(stderr, "Couldn't get context for %s\n", + ibv_get_device_name(ib_dev)); + return NULL; + } + + ctx->pd = ibv_alloc_pd(ctx->context); + if (!ctx->pd) { + fprintf(stderr, "Couldn't allocate PD\n"); + return NULL; + } + + ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, + IBV_ACCESS_REMOTE_WRITE); + if (!ctx->mr) { + fprintf(stderr, "Couldn't allocate MR\n"); + return NULL; + } + + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); + if (!ctx->cq) { + fprintf(stderr, "Couldn't create CQ\n"); + return NULL; + } + + { + struct ibv_qp_init_attr attr = { + .send_cq = ctx->cq, + .recv_cq = ctx->cq, + .cap = { + .max_send_wr = tx_depth, + .max_recv_wr = rx_depth, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .qp_type = IBV_QPT_RC + }; + + ctx->qp = ibv_create_qp(ctx->pd, &attr); + if (!ctx->qp) { + fprintf(stderr, "Couldn't create QP\n"); + return NULL; + } + } + + { + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_INIT; + attr.pkey_index = 0; + attr.port_num = port; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS)) { + fprintf(stderr, "Failed to modify QP to INIT\n"); + return NULL; + } + } + + return ctx; +} + +static int pp_post_rdma(struct pingpong_context *ctx, + struct pingpong_dest* rem_dest) +{ + struct ibv_sge list = { + .addr = (uintptr_t) ctx->buf, + .length = ctx->size, + .lkey = ctx->mr->lkey + }; + struct ibv_send_wr wr = { + .wr_id = PINGPONG_RDMA_WRID, + .sg_list = &list, + .num_sge = 1, + .opcode = IBV_WR_RDMA_WRITE, + .send_flags = IBV_SEND_SIGNALED, + .wr.rdma.remote_addr = rem_dest->vaddr, + .wr.rdma.rkey = rem_dest->rkey + }; + struct ibv_send_wr *bad_wr; + + return ibv_post_send(ctx->qp, &wr, &bad_wr); +} + +static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, + struct pingpong_dest *dest) +{ + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_RTR; + attr.path_mtu = IBV_MTU_1024; + attr.dest_qp_num = dest->qpn; + attr.rq_psn = dest->psn; + attr.max_dest_rd_atomic = 1; + attr.min_rnr_timer = 12; + attr.ah_attr.is_global = 0; + attr.ah_attr.dlid = dest->lid; + attr.ah_attr.sl = 0; + attr.ah_attr.src_path_bits = 0; + attr.ah_attr.port_num = port; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER)) { + fprintf(stderr, "Failed to modify QP to RTR\n"); + return 1; + } + + attr.qp_state = IBV_QPS_RTS; + attr.timeout = 14; + attr.retry_cnt = 7; + attr.rnr_retry = 7; + attr.sq_psn = my_psn; + attr.max_rd_atomic = 1; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC)) { + fprintf(stderr, "Failed to modify QP to RTS\n"); + return 1; + } + + return 0; +} + +static void usage(const char *argv0) +{ + printf("Usage:\n"); + printf(" %s start a server and wait for connection\n", argv0); + printf(" %s connect to server at \n", argv0); + printf("\n"); + printf("Options:\n"); + printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -d, --ib-dev= use IB device (default first device found)\n"); + printf(" -i, --ib-port= use port of IB device (default 1)\n"); + printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -t, --tx-depth= size of tx queue (default 50)\n"); + printf(" -n, --iters= number of exchanges (default 1000)\n"); +} + +int main(int argc, char *argv[]) +{ + struct dlist *dev_list; + struct ibv_device *ib_dev; + struct pingpong_context *ctx; + struct pingpong_dest my_dest; + struct pingpong_dest *rem_dest; + struct timeval start, end; + char *ib_devname = NULL; + char *servername = NULL; + int port = 18515; + int ib_port = 1; + int size = 1; + int rx_depth = 1; + int tx_depth = 50; + int iters = 1000; + int scnt, rcnt, ccnt; + int client_first_post; + int sockfd; + + srand48(getpid() * time(NULL)); + + while (1) { + int c; + + static struct option long_options[] = { + { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "ib-dev", .has_arg = 1, .val = 'd' }, + { .name = "ib-port", .has_arg = 1, .val = 'i' }, + { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "tx-depth",.has_arg = 1, .val = 't' }, + { 0 } + }; + + c = getopt_long(argc, argv, "p:d:i:s:t:n:e", long_options, NULL); + if (c == -1) + break; + + switch (c) { + case 'p': + port = strtol(optarg, NULL, 0); + if (port < 0 || port > 65535) { + usage(argv[0]); + return 1; + } + break; + + case 'd': + ib_devname = strdupa(optarg); + break; + + case 'i': + ib_port = strtol(optarg, NULL, 0); + if (port < 0) { + usage(argv[0]); + return 1; + } + break; + + case 's': + size = strtol(optarg, NULL, 0); + break; + + case 't': + tx_depth = strtol(optarg, NULL, 0); + break; + + case 'n': + iters = strtol(optarg, NULL, 0); + break; + + default: + usage(argv[0]); + return 1; + } + } + + if (optind == argc - 1) + servername = strdupa(argv[optind]); + else if (optind < argc) { + usage(argv[0]); + return 1; + } + + page_size = sysconf(_SC_PAGESIZE); + + dev_list = ibv_get_devices(); + + dlist_start(dev_list); + if (!ib_devname) { + ib_dev = dlist_next(dev_list); + if (!ib_dev) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } + } else { + dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + break; + if (!ib_dev) { + fprintf(stderr, "IB device %s not found\n", ib_devname); + return 1; + } + } + + ctx = pp_init_ctx(ib_dev, size, iters, rx_depth, ib_port); + if (!ctx) + return 1; + + my_dest.lid = pp_get_local_lid(ib_dev, ib_port); + my_dest.qpn = ctx->qp->qp_num; + my_dest.psn = lrand48() & 0xffffff; + if (!my_dest.lid) { + fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); + return 1; + } + my_dest.rkey = ctx->mr->rkey; + my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + + printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " + "RKey %#08x VAddr %#016Lx\n", + my_dest.lid, my_dest.qpn, my_dest.psn, + my_dest.rkey, my_dest.vaddr); + + + if (servername) { + sockfd = pp_client_connect(servername, port); + } else { + sockfd = pp_server_connect(port); + } + if (sockfd < 0) + return 1; + + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + if (!rem_dest) + return 1; + + printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " + "RKey %#08x VAddr %#016Lx\n", + rem_dest->lid, rem_dest->qpn, rem_dest->psn, + rem_dest->rkey, rem_dest->vaddr); + + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) + return 1; + + /* An additional handshake is required *after* moving qp to RTR. + Arbitrarily reuse exch_dest for this purpose. */ + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + write(sockfd, "done", sizeof "done"); + close(sockfd); + + if (gettimeofday(&start, NULL)) { + perror("gettimeofday"); + return 1; + } + + scnt = 0; + rcnt = 0; + ccnt = 0; + if (servername) + client_first_post = 1; + else + client_first_post = 0; + + while (scnt < iters || ccnt < iters || rcnt < iters) { + + /* Wait till buffer changes. */ + if (rcnt < iters && ! client_first_post) { + ++rcnt; + while (*ctx->poll_buf != (char)rcnt) { + } + /* Here the data is already in the physical memory. + If we wanted to actually use it, we may need + a read memory barrier here. */ + } else + client_first_post = 0; + + if (scnt < iters) { + *ctx->post_buf = (char)++scnt; + if (pp_post_rdma(ctx, rem_dest)) { + fprintf(stderr, "Couldn't post send: scnt=%d\n", + scnt); + return 1; + } + } + + if (ccnt < iters) { + struct ibv_wc wc; + int ne; + ++ccnt; + do { + ne = ibv_poll_cq(ctx->cq, 1, &wc); + } while (ne == 0); + + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } + if (wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Completion wth error at %s:\n", + servername?"client":"server"); + fprintf(stderr, "Failed status %d: wr_id %d\n", + wc.status, (int) wc.wr_id); + fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", + scnt, rcnt, ccnt); + return 1; + } + } + } + + if (gettimeofday(&end, NULL)) { + perror("gettimeofday"); + return 1; + } + + { + float usec = (end.tv_sec - start.tv_sec) * 1000000 + + (end.tv_usec - start.tv_usec); + long long bytes = (long long) size * iters; + + printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", + bytes, usec / 1000000., bytes * 8. / usec); + printf("%d iters in %.2f seconds = %.2f usec/iter\n", + iters, usec / 1000000., usec / iters); + } + + return 0; +} -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 06:17:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 16:17:01 +0200 Subject: [openib-general] Re: [PATCH] AIO code to use get_user_pages In-Reply-To: <20050310121941.GI2586@mellanox.co.il> References: <20050310121941.GI2586@mellanox.co.il> Message-ID: <20050310141701.GB12542@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] AIO code to use get_user_pages > > Well, I went ahead and modified the AIO code to use get_user_pages. > Since we dont yet have fmr support, this patch is untested, but it > does compile :) Please let me know what do you think. > > Another approach (instead of waiting for fmr support) could be > to add a fall-back option to use a regular memory region. > A todo item is to add zcopy support for synchronous operations. It seems I tried to use the same kmap slot twice. Thats not right. Here's a small patch to apply on top of that. Signed-off-by: Michael S. Tsirkin --- sdp/sdp_iocb.c 2005-03-10 11:28:02.000000000 +0200 +++ sdp-fix//sdp_iocb.c 2005-03-10 13:36:28.000000000 +0200 @@ -48,11 +48,11 @@ static void sdp_copy_one_page(struct pag void* tptr; fptr = kmap_atomic(from, KM_USER0); - tptr = kmap_atomic(to, KM_USER0); + tptr = kmap_atomic(to, KM_USER1); memcpy(tptr + offset, fptr + offset, size); - kunmap_atomic(tptr, KM_USER0); + kunmap_atomic(tptr, KM_USER1); kunmap_atomic(fptr, KM_USER0); set_page_dirty_lock(to); } -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 10 07:31:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Mar 2005 10:31:38 -0500 Subject: [openib-general] CM REP sent and local QPN setting Message-ID: <1110468698.4659.19.camel@localhost.localdomain> Hi Sean, Doesn't the passive CM need to set the local_qpn in cm_id_private when sending the REP ? Inside of ib_send_cm_rep(), shouldn't there be the following line of code: cm_id_priv->local_qpn = cm_rep_get_local_qpn(rep_msg); Otherwise if the active side disconnects, the remote QPN in the DREQ is cannot be matched. -- Hal From mshefty at ichips.intel.com Thu Mar 10 09:20:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 09:20:51 -0800 Subject: [openib-general] Re: CM REP sent and local QPN setting In-Reply-To: <1110468698.4659.19.camel@localhost.localdomain> References: <1110468698.4659.19.camel@localhost.localdomain> Message-ID: <423081F3.7040003@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > Doesn't the passive CM need to set the local_qpn in cm_id_private when > sending the REP ? Inside of ib_send_cm_rep(), shouldn't there be the > following line of code: > > cm_id_priv->local_qpn = cm_rep_get_local_qpn(rep_msg); The passive sides needs to store the local QPN somewhere. I'll take a look to see if this is done anywhere else (I'm assuming not by your message), and if not add it in. Thanks for the info. - Sean From mshefty at ichips.intel.com Thu Mar 10 09:34:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 09:34:45 -0800 Subject: [openib-general] Re: CM REP sent and local QPN setting In-Reply-To: <423081F3.7040003@ichips.intel.com> References: <1110468698.4659.19.camel@localhost.localdomain> <423081F3.7040003@ichips.intel.com> Message-ID: <42308535.4020100@ichips.intel.com> >> Doesn't the passive CM need to set the local_qpn in cm_id_private when >> sending the REP ? Inside of ib_send_cm_rep(), shouldn't there be the >> following line of code: >> >> cm_id_priv->local_qpn = cm_rep_get_local_qpn(rep_msg); > > > The passive sides needs to store the local QPN somewhere. I'll take a > look to see if this is done anywhere else (I'm assuming not by your > message), and if not add it in. Thanks for the info. I've committed a patch to fix this using your suggestion above. - Sean From roland at topspin.com Thu Mar 10 09:49:16 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 09:49:16 -0800 Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? In-Reply-To: <20050310104722.GG2586@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Mar 2005 12:47:22 +0200") References: <20050310104722.GG2586@mellanox.co.il> Message-ID: <52acpbbdcz.fsf@topspin.com> Michael> I am able to shave about 200ns off the rdma post latency, Michael> by using pthread_spinlock instead of pthread_mutex for Michael> protecting the qp post op in libmthca. Michael> I'm aware of course that a context switch when spinlock Michael> is held may waste a whole timeslice , but maybe for short Michael> operations such as this it's reasonable to use spinlocks? You're right, this is a significant performance boost. I had believed that since pthread_mutex_lock and pthread_mutex_unlock can be done completely in userspace with NPTL and futexes (with only a single locked instruction when there is no contention), then doing pthread_spin_lock/pthread_spin_unlock instead would be roughly equivalent. However, a quick synthetic benchmark shows that I was completely wrong: uncontended pthread_mutex_t operations are much slower than uncontended pthread_spin_t operations on i386, x86_64, ppc64 and ia64. I've committed a changeset for libmthca that converts from pthread_mutex_t to pthread_spinlock_t, which results in a measurable improvement on real pingpong tests. Thanks, Roland From roland at topspin.com Thu Mar 10 10:29:20 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 10:29:20 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050310104211.GF2586@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Mar 2005 12:42:11 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> Message-ID: <521xanbbi7.fsf@topspin.com> Michael> I think I have discovered the problem. It seems that with Michael> -O3 my compiler may reorder the WQE (and possibly CQE) Michael> write with respect to the doorbell. This wont happen on Michael> i386 with consistent i/o ordering since the doorbell is Michael> done in assembly, and probably not on other 32 bit Michael> architectures since the mutex is likely to include a Michael> memory barrier. Michael> Applying the folowing patch fixes the problem for me for Michael> x86_64. Thanks for diagnosing this. I think I want to work on a more general fix though. - R. From mst at mellanox.co.il Thu Mar 10 10:42:30 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 20:42:30 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <521xanbbi7.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> Message-ID: <20050310184230.GA13051@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > Michael> I think I have discovered the problem. It seems that with > Michael> -O3 my compiler may reorder the WQE (and possibly CQE) > Michael> write with respect to the doorbell. This wont happen on > Michael> i386 with consistent i/o ordering since the doorbell is > Michael> done in assembly, and probably not on other 32 bit > Michael> architectures since the mutex is likely to include a > Michael> memory barrier. > > Michael> Applying the folowing patch fixes the problem for me for > Michael> x86_64. > > Thanks for diagnosing this. I think I want to work on a more general > fix though. > > - R. > Generally I think you'll need to implement a write memory barrier, and use it before each doorbell. I didnt find an efficient portable way to do this. I suggest implementing it for ppc with eioio, and simply use a spinlock instead of a barrier for anything else. -- MST - Michael S. Tsirkin From mshefty at ichips.intel.com Thu Mar 10 15:37:08 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 15:37:08 -0800 Subject: [openib-general] MAD receive work completion Message-ID: <4230DA24.3040801@ichips.intel.com> I'm hitting an issue in the CM where I need to access work completion information about a received MAD. The CM takes the received MAD and queues it to a CM owned work queue for processing. It then accesses the wc field from ib_mad_recv_wc shown below. struct ib_mad_recv_wc { struct ib_wc *wc; struct ib_mad_recv_buf recv_buf; int mad_len; }; The ib_mad_recv_wc and referenced data buffers are owned by the CM until it calls ib_free_recv_mad(), however the wc field references an item that is declared on the stack. I see two main solutions. The CM can allocate its own ib_wc structure and copy the contents of the returned work completion. Or the MAD layer can avoid allocating the work completion on the stack. Thoughts? - Sean From mshefty at ichips.intel.com Thu Mar 10 15:48:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 15:48:57 -0800 Subject: [openib-general] [PATCH] [CM] fix CM unload after receiving a bad REQ Message-ID: <20050310154857.64f71b8a.mshefty@ichips.intel.com> The following patch fixes the issue of unloading the CM after receiving a bad REQ. Signed-off-by: Sean Hefty Index: cm.c =================================================================== --- cm.c (revision 1974) +++ cm.c (working copy) @@ -237,8 +237,8 @@ { av->port = port; av->pkey_index = wc->pkey_index; - cm_set_ah_attr(&av->ah_attr, port->port_num, wc->slid, wc->sl, - wc->dlid_path_bits); + cm_set_ah_attr(&av->ah_attr, port->port_num, cpu_to_be16(wc->slid), + wc->sl, wc->dlid_path_bits); } static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) @@ -648,7 +648,7 @@ spin_unlock_irqrestore(&cm_id_priv->lock, flags); ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, &cm_id_priv->av.port->cm_dev->ca_guid, - sizeof &cm_id_priv->av.port->cm_dev->ca_guid, + sizeof cm_id_priv->av.port->cm_dev->ca_guid, NULL, 0); break; case IB_CM_ESTABLISHED: @@ -1038,14 +1038,14 @@ if (cm_insert_remote_id(cm_id_priv->timewait_info)) { spin_unlock_irqrestore(&cm.lock, flags); ret = -EINVAL; - goto error2; + goto error1; } /* Check for a stale connection. */ if (cm_insert_remote_qpn(cm_id_priv->timewait_info)) { spin_unlock_irqrestore(&cm.lock, flags); /* todo: reject as stale */ ret = -EINVAL; - goto error2; + goto error1; } /* Find matching listen request. */ listen_cm_id_priv = cm_find_listen(req_msg->service_id); @@ -1053,7 +1053,7 @@ spin_unlock_irqrestore(&cm.lock, flags); /* todo: reject with no match */ ret = -EINVAL; - goto error2; + goto error1; } atomic_inc(&listen_cm_id_priv->refcount); atomic_inc(&cm_id_priv->refcount); @@ -1069,11 +1069,11 @@ cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) - goto error3; + goto error2; if (req_msg->alt_local_lid) { ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av); if (ret) - goto error3; + goto error2; } cm_id_priv->timeout_ms = cm_convert_to_ms( cm_req_get_local_resp_timeout(req_msg)); @@ -1092,11 +1092,9 @@ cm_process_work(cm_id_priv, work); cm_deref_id(listen_cm_id_priv); return 0; -error3: - cm_deref_id(listen_cm_id_priv); error2: - cm_cleanup_timewait(cm_id_priv->timewait_info); - kfree(cm_id_priv->timewait_info); + atomic_dec(&cm_id_priv->refcount); + cm_deref_id(listen_cm_id_priv); error1: ib_destroy_cm_id(&cm_id_priv->id); return ret; @@ -2592,7 +2590,7 @@ } static void cm_send_handler(struct ib_mad_agent *mad_agent, - struct ib_mad_send_wc *mad_send_wc) + struct ib_mad_send_wc *mad_send_wc) { struct cm_msg *msg; From ftillier at infiniconsys.com Thu Mar 10 15:54:15 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 10 Mar 2005 15:54:15 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230DA24.3040801@ichips.intel.com> Message-ID: <000201c525cc$78423c50$8d5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > I'm hitting an issue in the CM where I need to access work completion > information about a received MAD. The CM takes the received MAD and > queues it to a CM owned work queue for processing. It then accesses > the wc field from ib_mad_recv_wc shown below. > > struct ib_mad_recv_wc { > struct ib_wc *wc; > struct ib_mad_recv_buf recv_buf; > int mad_len; > }; > > The ib_mad_recv_wc and referenced data buffers are owned by the CM > until it calls ib_free_recv_mad(), however the wc field references an > item that is declared on the stack. > > I see two main solutions. The CM can allocate its own ib_wc structure > and copy the contents of the returned work completion. Or the MAD > layer can avoid allocating the work completion on the stack. Thoughts? I assume you need the WC information to help you reply, right? I would say change the ib_wc embedded in ib_mad_recv_wc from a pointer to the structure, and then use that when you poll. That way you avoid an extra allocation in the MAD layer, and avoid the data copy in the CM. Note that I may have completely missed your point, in which case ignore me. - Fab From mshefty at ichips.intel.com Thu Mar 10 16:03:07 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 16:03:07 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <000201c525cc$78423c50$8d5aa8c0@infiniconsys.com> References: <000201c525cc$78423c50$8d5aa8c0@infiniconsys.com> Message-ID: <4230E03B.9070008@ichips.intel.com> Fab Tillier wrote: > I assume you need the WC information to help you reply, right? > > I would say change the ib_wc embedded in ib_mad_recv_wc from a pointer to > the structure, and then use that when you poll. That way you avoid an extra > allocation in the MAD layer, and avoid the data copy in the CM. > > Note that I may have completely missed your point, in which case ignore me. You got the point. I need to generate a reply. Moving the wc from a pointer to a structure means that the MAD layer needs to know a.) that the next completion is a receive, and b.) which data buffer received the data. Currently, the MAD layer uses a single CQ for sends and receives on QP 0 and 1. (This shouldn't be overly difficult to change, however.) - Sean From roland at topspin.com Thu Mar 10 16:04:53 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 16:04:53 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230DA24.3040801@ichips.intel.com> (Sean Hefty's message of "Thu, 10 Mar 2005 15:37:08 -0800") References: <4230DA24.3040801@ichips.intel.com> Message-ID: <5264zz9hei.fsf@topspin.com> Sean> I'm hitting an issue in the CM where I need to access work Sean> completion information about a received MAD. The CM takes Sean> the received MAD and queues it to a CM owned work queue for Sean> processing. It then accesses the wc field from Sean> ib_mad_recv_wc shown below. This doesn't seem worth an API change to me. I think the simplest and best solution is just to copy the work completion information you need into the work structure you put onto your workqueue. - R. From mshefty at ichips.intel.com Thu Mar 10 16:35:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 16:35:39 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <5264zz9hei.fsf@topspin.com> References: <4230DA24.3040801@ichips.intel.com> <5264zz9hei.fsf@topspin.com> Message-ID: <4230E7DB.7020905@ichips.intel.com> Roland Dreier wrote: > Sean> I'm hitting an issue in the CM where I need to access work > Sean> completion information about a received MAD. The CM takes > Sean> the received MAD and queues it to a CM owned work queue for > Sean> processing. It then accesses the wc field from > Sean> ib_mad_recv_wc shown below. > > This doesn't seem worth an API change to me. I think the simplest and > best solution is just to copy the work completion information you need > into the work structure you put onto your workqueue. It could be done without an API change, likely changing 3-4 lines of code, with the result that the work completion would be copied for all received MADs. (The copy could be avoided with a more extensive change, but I would go with a simpler solution for now.) To me, it seems that the behavior isn't what a user would expect given the current API. The ib_mad_recv_wc belongs to the user until it is freed, but one of the fields in it exists only during the callback. Is this the behavior that we want? - Sean From ftillier at infiniconsys.com Thu Mar 10 16:49:41 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 10 Mar 2005 16:49:41 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230E7DB.7020905@ichips.intel.com> Message-ID: <000301c525d4$36d17bc0$8d5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > To me, it seems that the behavior isn't what a user would expect given > the current API. The ib_mad_recv_wc belongs to the user until it is > freed, but one of the fields in it exists only during the callback. Is > this the behavior that we want? It depends on the usage model. If the majority of clients always process the completion in a different thread context (i.e. using a workqueue like the CM does), then I would say make the copy always. Otherwise, I agree with Roland and it should be up to the client. However, if the CM needs to have the WC information beyond the work it does in its work queue *and* saves the received mad structure already, then it will copy it once from the receive callback into the work structure, and then again into the cid structure. Note that this only applies to the case where the received MAD is saved beyond both the MAD completion callback and the CM workqueue handler. - Fab From iod00d at hp.com Thu Mar 10 17:01:04 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 10 Mar 2005 17:01:04 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <521xanbbi7.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> Message-ID: <20050311010104.GC16523@esmail.cup.hp.com> On Thu, Mar 10, 2005 at 10:29:20AM -0800, Roland Dreier wrote: > Michael> Applying the folowing patch fixes the problem for me for > Michael> x86_64. > > Thanks for diagnosing this. I think I want to work on a more general > fix though. Crap. /usr/include/linux/compiler.h shows "barrier" but on debian, it's #ifdef __KERNEL__ only. I'm wondering if xf86 has the same problem for graphics drivers talking to cards. I'm not certain barrier would be the "perfect" solution, but expect it should DTRT. grant From roland at topspin.com Thu Mar 10 17:00:21 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 17:00:21 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230E7DB.7020905@ichips.intel.com> (Sean Hefty's message of "Thu, 10 Mar 2005 16:35:39 -0800") References: <4230DA24.3040801@ichips.intel.com> <5264zz9hei.fsf@topspin.com> <4230E7DB.7020905@ichips.intel.com> Message-ID: <521xan9eu2.fsf@topspin.com> Sean> It could be done without an API change, likely changing 3-4 Sean> lines of code, with the result that the work completion Sean> would be copied for all received MADs. (The copy could be Sean> avoided with a more extensive change, but I would go with a Sean> simpler solution for now.) Sorry, you're right. I hadn't gone back and looked at the actual API, and I didn't remember that ib_free_recv_mad() takes the whole struct ib_mad_recv_wc (I thought it just took the ib_mad_recv_buf). Sean> To me, it seems that the behavior isn't what a user would Sean> expect given the current API. The ib_mad_recv_wc belongs to Sean> the user until it is freed, but one of the fields in it Sean> exists only during the callback. Is this the behavior that Sean> we want? I agree, this is rather ugly. Now I'm not sure whether it makes sense to change the wc member of struct ib_mad_recv_wc from a struct ib_wc * to just a struct ib_wc. On the one hand it makes the API slightly cleaner, but on the other hand it is an incompatible change that may limit the internal implementation of handling MAD receives. - Roland From roland at topspin.com Thu Mar 10 20:50:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 20:50:27 -0800 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() Message-ID: <52oedq946k.fsf@topspin.com> drivers/infiniband/core/mad.c is in Andrew's list... >From looking at the code, the best fix I can come up with is just to always use GFP_ATOMIC ... worst case we drop a MAD under memory pressure. The other option is to change ib_post_send_mad() to take a GFP_ mask as a parameter, but that doesn't seem worth doing... --- infiniband/core/mad.c (revision 1975) +++ infiniband/core/mad.c (working copy) @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); -------------- next part -------------- An embedded message was scrubbed... From: Andrew Morton Subject: inappropriate use of in_atomic() Date: Thu, 10 Mar 2005 20:40:06 -0800 Size: 3123 URL: From mst at mellanox.co.il Thu Mar 10 23:17:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 09:17:02 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050311010104.GC16523@esmail.cup.hp.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050311010104.GC16523@esmail.cup.hp.com> Message-ID: <20050311071702.GA20891@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs > > On Thu, Mar 10, 2005 at 10:29:20AM -0800, Roland Dreier wrote: > > Michael> Applying the folowing patch fixes the problem for me for > > Michael> x86_64. > > > > Thanks for diagnosing this. I think I want to work on a more general > > fix though. > > Crap. /usr/include/linux/compiler.h shows "barrier" but on debian, > it's #ifdef __KERNEL__ only. I think its because its a kernel header. current distributions seem to put a copy of kernel headers under /usr/include/linux and /usr/include/asm. I suspect using these isnt allowed. > I'm wondering if xf86 has the same problem > for graphics drivers talking to cards. > > I'm not certain barrier would be the "perfect" solution, but expect > it should DTRT. > > grant > Actually I think we need a wmb here. It just happens to be equivalent to barrier on x86_64. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 23:31:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 09:31:08 +0200 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <52oedq946k.fsf@topspin.com> References: <52oedq946k.fsf@topspin.com> Message-ID: <20050311073108.GA20989@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [Andrew Morton] inappropriate use of in_atomic() > > drivers/infiniband/core/mad.c is in Andrew's list... > > >From looking at the code, the best fix I can come up with is just to > always use GFP_ATOMIC ... worst case we drop a MAD under memory > pressure. The other option is to change ib_post_send_mad() to take a > GFP_ mask as a parameter, but that doesn't seem worth doing... > > --- infiniband/core/mad.c (revision 1975) > +++ infiniband/core/mad.c (working copy) > @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct > if (!ret || !device->process_mad) > goto out; > > - if (in_atomic() || irqs_disabled()) > - alloc_flags = GFP_ATOMIC; > - else > - alloc_flags = GFP_KERNEL; > - local = kmalloc(sizeof *local, alloc_flags); > + local = kmalloc(sizeof *local, GFP_ATOMIC); > if (!local) { > ret = -ENOMEM; > printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); > @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent > } > > /* Allocate MAD send WR tracking structure */ > - mad_send_wr = kmalloc(sizeof *mad_send_wr, > - (in_atomic() || irqs_disabled()) ? > - GFP_ATOMIC : GFP_KERNEL); > + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); > if (!mad_send_wr) { > printk(KERN_ERR PFX "No memory for " > "ib_mad_send_wr_private\n"); > > > > > > Date: Thu, 10 Mar 2005 20:40:06 -0800 > From: Andrew Morton > Subject: inappropriate use of in_atomic() > > > in_atomic() is not a reliable indication of whether it is currently safe > to call schedule(). > > This is because the lockdepth beancounting which in_atomic() uses is only > accumulated if CONFIG_PREEMPT=y. in_atomic() will return false inside > spinlocks if CONFIG_PREEMPT=n. > > Consequently the use of in_atomic() in the below files is probably > deadlocky if CONFIG_PREEMPT=n: > > arch/ppc64/kernel/viopath.c > drivers/net/irda/sir_kthread.c > drivers/net/wireless/airo.c > drivers/video/amba-clcd.c > drivers/acpi/osl.c > drivers/ieee1394/ieee1394_transactions.c > drivers/infiniband/core/mad.c > > Note that the same beancounting is used for the "scheduling while atomic" > warning, so if the code calls schedule with locks held, we won't get a > warning. Both are tied to CONFIG_PREEMPT=y. > > The kernel provides no reliable runtime way of detecting whether or not it > is safe to call schedule(). > > Can we please find ways to change the above code to not use in_atomic()? > Then we can whack #ifndef MODULE around its definition to reduce > reoccurrences. Will probably rename it to something more scary as well. > > Thanks. > Sdp also has a couple of uses. Maybe we can use the atomic branch in all cases here, as well? Libor? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Fri Mar 11 05:14:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 15:14:46 +0200 Subject: [openib-general] fmr support in mthca In-Reply-To: <526507mkmm.fsf@topspin.com> References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> Message-ID: <20050311131446.GC20989@mellanox.co.il> Roland, would you like me to implement FMRs in mthca? It is needed by SDP for zero copy support. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Fri Mar 11 05:23:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 15:23:12 +0200 Subject: [openib-general] Re: Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050311071702.GA20891@mellanox.co.il> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050311010104.GC16523@esmail.cup.hp.com> <20050311071702.GA20891@mellanox.co.il> Message-ID: <20050311132311.GD20989@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: Re: ANNOUNCE: First usable version of userspace verbs > > Quoting r. Grant Grundler : > > Subject: Re: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs > > > > On Thu, Mar 10, 2005 at 10:29:20AM -0800, Roland Dreier wrote: > > > Michael> Applying the folowing patch fixes the problem for me for > > > Michael> x86_64. > > > > > > Thanks for diagnosing this. I think I want to work on a more general > > > fix though. > > > > Crap. /usr/include/linux/compiler.h shows "barrier" but on debian, > > it's #ifdef __KERNEL__ only. > I think its because its a kernel header. current distributions seem to > put a copy of kernel headers under /usr/include/linux and /usr/include/asm. > I suspect using these isnt allowed. For example, wmb is in /usr/include/asm/system.h, including this on SuSe 9.1 gives you: ~>cat foo.c #include ~>gcc foo.c In file included from /usr/include/asm/system.h:4, from foo.c:1: /usr/include/asm-x86_64/system.h: In function `__cmpxchg': /usr/include/asm-x86_64/system.h:249: error: `LOCK_PREFIX' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:249: error: (Each undeclared identifier is reported only once /usr/include/asm-x86_64/system.h:249: error: for each function it appears in.) /usr/include/asm-x86_64/system.h:249: error: parse error before string constant /usr/include/asm-x86_64/system.h:255: error: parse error before string constant /usr/include/asm-x86_64/system.h:261: error: parse error before string constant /usr/include/asm-x86_64/system.h:267: error: parse error before string constant /usr/include/asm-x86_64/system.h: At top level: /usr/include/asm-x86_64/system.h:279: error: parse error before "cmpxchg4_locked" /usr/include/asm-x86_64/system.h:279: error: parse error before '*' token /usr/include/asm-x86_64/system.h: In function `cmpxchg4_locked': /usr/include/asm-x86_64/system.h:282: error: `new' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:282: error: `old' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:282: error: `__u32' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:282: error: parse error before ')' token -- MST - Michael S. Tsirkin From halr at voltaire.com Fri Mar 11 05:17:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Mar 2005 08:17:15 -0500 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <52oedq946k.fsf@topspin.com> References: <52oedq946k.fsf@topspin.com> Message-ID: <1110547035.4659.130.camel@localhost.localdomain> On Thu, 2005-03-10 at 23:50, Roland Dreier wrote: > drivers/infiniband/core/mad.c is in Andrew's list... > > >From looking at the code, the best fix I can come up with is just to > always use GFP_ATOMIC ... worst case we drop a MAD under memory > pressure. That could be bad if this persists but I suppose there are other ill effects of this. > The other option is to change ib_post_send_mad() to take a > GFP_ mask as a parameter, but that doesn't seem worth doing... There aren't that many places this is called. Also, it appears to me that sa_query.c is already doing this for some of it's memory allocation and this could be passed down to ib_post_send_mad. int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int gfp_mask, ... This approach seems better to me from a robustness standpoint. Is the difficulty determing what to set the mask to for each call ? If they all end up being GFP_ATOMIC, this reduces to your preferred solution. The biggest impact appears to be on CM (at least currently). -- Hal > --- infiniband/core/mad.c (revision 1975) > +++ infiniband/core/mad.c (working copy) > @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct > if (!ret || !device->process_mad) > goto out; > > - if (in_atomic() || irqs_disabled()) > - alloc_flags = GFP_ATOMIC; > - else > - alloc_flags = GFP_KERNEL; > - local = kmalloc(sizeof *local, alloc_flags); > + local = kmalloc(sizeof *local, GFP_ATOMIC); > if (!local) { > ret = -ENOMEM; > printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); > @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent > } > > /* Allocate MAD send WR tracking structure */ > - mad_send_wr = kmalloc(sizeof *mad_send_wr, > - (in_atomic() || irqs_disabled()) ? > - GFP_ATOMIC : GFP_KERNEL); > + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); > if (!mad_send_wr) { > printk(KERN_ERR PFX "No memory for " > "ib_mad_send_wr_private\n"); > > > > > ______________________________________________________________________ > > From: Andrew Morton > To: Paul Mackerras , Jean Tourrilhes , javier at tudela.mad.ttd.net, linux-fbdev-devel at lists.sourceforge.net, acpi-devel at lists.sourceforge.net, linux1394-devel at lists.sourceforge.net, Roland Dreier > Cc: linux-kernel at vger.kernel.org > Subject: inappropriate use of in_atomic() > Date: 10 Mar 2005 20:40:06 -0800 > > > in_atomic() is not a reliable indication of whether it is currently safe > to call schedule(). > > This is because the lockdepth beancounting which in_atomic() uses is only > accumulated if CONFIG_PREEMPT=y. in_atomic() will return false inside > spinlocks if CONFIG_PREEMPT=n. > > Consequently the use of in_atomic() in the below files is probably > deadlocky if CONFIG_PREEMPT=n: > > arch/ppc64/kernel/viopath.c > drivers/net/irda/sir_kthread.c > drivers/net/wireless/airo.c > drivers/video/amba-clcd.c > drivers/acpi/osl.c > drivers/ieee1394/ieee1394_transactions.c > drivers/infiniband/core/mad.c > > Note that the same beancounting is used for the "scheduling while atomic" > warning, so if the code calls schedule with locks held, we won't get a > warning. Both are tied to CONFIG_PREEMPT=y. > > The kernel provides no reliable runtime way of detecting whether or not it > is safe to call schedule(). > > Can we please find ways to change the above code to not use in_atomic()? > Then we can whack #ifndef MODULE around its definition to reduce > reoccurrences. Will probably rename it to something more scary as well. > > Thanks. > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Fri Mar 11 05:28:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 15:28:50 +0200 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <1110547035.4659.130.camel@localhost.localdomain> References: <52oedq946k.fsf@topspin.com> <1110547035.4659.130.camel@localhost.localdomain> Message-ID: <20050311132850.GE20989@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [Andrew Morton] inappropriate use of in_atomic() > > On Thu, 2005-03-10 at 23:50, Roland Dreier wrote: > > drivers/infiniband/core/mad.c is in Andrew's list... > > > > >From looking at the code, the best fix I can come up with is just to > > always use GFP_ATOMIC ... worst case we drop a MAD under memory > > pressure. > > That could be bad if this persists but I suppose there are other ill > effects of this. > > > The other option is to change ib_post_send_mad() to take a > > GFP_ mask as a parameter, but that doesn't seem worth doing... > > There aren't that many places this is called. Also, it appears to me > that sa_query.c is already doing this for some of it's memory allocation > and this could be passed down to ib_post_send_mad. > > int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, > struct ib_sa_path_rec *rec, > ib_sa_comp_mask comp_mask, > int timeout_ms, int gfp_mask, > ... > > This approach seems better to me from a robustness standpoint. > > Is the difficulty determing what to set the mask to for each call ? If > they all end up being GFP_ATOMIC, this reduces to your preferred > solution. > > The biggest impact appears to be on CM (at least currently). As far as I remember most CM code is thread level anyway, isnt it? -- MST - Michael S. Tsirkin From halr at voltaire.com Fri Mar 11 05:27:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Mar 2005 08:27:53 -0500 Subject: [openib-general] [PATCH] [CM] fix CM unload after receiving a bad REQ In-Reply-To: <20050310154857.64f71b8a.mshefty@ichips.intel.com> References: <20050310154857.64f71b8a.mshefty@ichips.intel.com> Message-ID: <1110547672.4659.138.camel@localhost.localdomain> On Thu, 2005-03-10 at 18:48, Sean Hefty wrote: > The following patch fixes the issue of unloading the CM after > receiving a bad REQ. Works for me :-) Thanks. -- Hal From roland at topspin.com Fri Mar 11 08:56:37 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 08:56:37 -0800 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <1110547035.4659.130.camel@localhost.localdomain> (Hal Rosenstock's message of "11 Mar 2005 08:17:15 -0500") References: <52oedq946k.fsf@topspin.com> <1110547035.4659.130.camel@localhost.localdomain> Message-ID: <527jke86ka.fsf@topspin.com> Hal> There aren't that many places this is called. Also, it Hal> appears to me that sa_query.c is already doing this for some Hal> of it's memory allocation and this could be passed down to Hal> ib_post_send_mad. Yes, that's an alternate solution. The question is whether it's worth changing the API so that some callers can use GFP_KERNEL. - R. From roland at topspin.com Fri Mar 11 08:57:43 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 08:57:43 -0800 Subject: [openib-general] fmr support in mthca In-Reply-To: <20050311131446.GC20989@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 11 Mar 2005 15:14:46 +0200") References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> <20050311131446.GC20989@mellanox.co.il> Message-ID: <523bv286ig.fsf@topspin.com> Michael> Roland, would you like me to implement FMRs in mthca? It Michael> is needed by SDP for zero copy support. Yes, that would be great. BTW, for mem-free mode I put the MPT and MTT in lowmem to make FMRs simpler to use. - R. From mshefty at ichips.intel.com Fri Mar 11 09:36:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Mar 2005 09:36:02 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <521xan9eu2.fsf@topspin.com> References: <4230DA24.3040801@ichips.intel.com> <5264zz9hei.fsf@topspin.com> <4230E7DB.7020905@ichips.intel.com> <521xan9eu2.fsf@topspin.com> Message-ID: <4231D702.2070709@ichips.intel.com> Roland Dreier wrote: > Now I'm not sure whether it makes sense to change the wc member of > struct ib_mad_recv_wc from a struct ib_wc * to just a struct ib_wc. > On the one hand it makes the API slightly cleaner, but on the other > hand it is an incompatible change that may limit the internal > implementation of handling MAD receives. Right now, I'm leaning towards no API change, but changing the implementation in the MAD layer to ensure that the ib_wc is valid after the callback returns. This would avoid limiting the MAD layer implementation, while also preventing changes to the existing users. I'll generate a patch for this to clarify the idea. - Sean From mshefty at ichips.intel.com Fri Mar 11 09:59:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Mar 2005 09:59:40 -0800 Subject: [openib-general] [PATCH] [MAD] fix handing user WC structure on the stack Message-ID: <20050311095940.1edb0475.mshefty@ichips.intel.com> This patch replaces the ib_wc *wc field in ib_mad_recv_wc from pointing to a structure on the stack to one allocated with the received MAD buffer. This allows client to access the field after their receive completion handler has returned. Signed-off-by: Sean Hefty Index: mad.c =================================================================== --- mad.c (revision 1964) +++ mad.c (working copy) @@ -1606,7 +1606,8 @@ static void ib_mad_recv_done_handler(str DMA_FROM_DEVICE); /* Setup MAD receive work completion from "normal" work completion */ - recv->header.recv_wc.wc = wc; + recv->header.wc = *wc; + recv->header.recv_wc.wc = &recv->header.wc; recv->header.recv_wc.mad_len = sizeof(struct ib_mad); recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; recv->header.recv_wc.recv_buf.grh = &recv->grh; Index: mad_priv.h =================================================================== --- mad_priv.h (revision 1964) +++ mad_priv.h (working copy) @@ -69,6 +69,7 @@ struct ib_mad_list_head { struct ib_mad_private_header { struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; + struct ib_wc wc; DECLARE_PCI_UNMAP_ADDR(mapping) } __attribute__ ((packed)); From roland at topspin.com Fri Mar 11 13:07:45 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 13:07:45 -0800 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <52oedq946k.fsf@topspin.com> (Roland Dreier's message of "Thu, 10 Mar 2005 20:50:27 -0800") References: <52oedq946k.fsf@topspin.com> Message-ID: <52mzt97uxq.fsf@topspin.com> Does anyone have a patch that they would prefer to this solution (unconditionally use GFP_ATOMIC)? If not I'll send this upstream so that we at least don't have the potential for deadlock with CONFIG_PREEMPT=n. We can always update the API to pass in a gfp_mask later, if this causes problems. - R. Here's the diff I have: Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1975) +++ infiniband/core/mad.c (working copy) @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); From roland at topspin.com Fri Mar 11 13:08:16 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 13:08:16 -0800 Subject: [openib-general] [PATCH] [MAD] fix handing user WC structure on the stack In-Reply-To: <20050311095940.1edb0475.mshefty@ichips.intel.com> (Sean Hefty's message of "Fri, 11 Mar 2005 09:59:40 -0800") References: <20050311095940.1edb0475.mshefty@ichips.intel.com> Message-ID: <52is3x7uwv.fsf@topspin.com> This looks good to me. - R. From roland at topspin.com Fri Mar 11 13:35:34 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 13:35:34 -0800 Subject: PATCH: use GFP_ATOMIC instead of in_atomic() (was Re: [openib-general] [Andrew Morton] inappropriate use of in_atomic()) In-Reply-To: <52mzt97uxq.fsf@topspin.com> (Roland Dreier's message of "Fri, 11 Mar 2005 13:07:45 -0800") References: <52oedq946k.fsf@topspin.com> <52mzt97uxq.fsf@topspin.com> Message-ID: <52d5u57tnd.fsf_-_@topspin.com> Err, here's a fixed diff that doesn't use an unitialized alloc_flags. Any comments? - R. Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1977) +++ infiniband/core/mad.c (working copy) @@ -646,7 +646,7 @@ static int handle_outgoing_dr_smp(struct struct ib_smp *smp, struct ib_send_wr *send_wr) { - int ret, alloc_flags, solicited; + int ret, solicited; unsigned long flags; struct ib_mad_local_private *local; struct ib_mad_private *mad_priv; @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -678,7 +674,7 @@ static int handle_outgoing_dr_smp(struct } local->mad_priv = NULL; local->recv_mad_agent = NULL; - mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_ATOMIC); if (!mad_priv) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for local response MAD\n"); @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); From sean.hefty at intel.com Fri Mar 11 14:27:25 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 11 Mar 2005 14:27:25 -0800 Subject: PATCH: use GFP_ATOMIC instead of in_atomic() (was Re:[openib-general] [Andrew Morton] inappropriate use of in_atomic()) In-Reply-To: <52d5u57tnd.fsf_-_@topspin.com> Message-ID: >Err, here's a fixed diff that doesn't use an unitialized alloc_flags. > >Any comments? Looks fine to me. I agree that we can change the API later if it becomes an issue. - Sean From libor at topspin.com Fri Mar 11 15:09:46 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:09:46 -0800 Subject: [openib-general] [RFC] Userspace CM access. Message-ID: <20050311150946.A31689@topspin.com> Below is the source for the kernel portion of the userspace CM. I've got enough of the userspace library to verify the basic functionality, but it's not yet ready for general use. However, I wanted to get the kernel portion posted for comment and checked-in now that the bulk of it is complete. The code is for the most part a pass through from userspace to the kernel CM, plus synchronization, sanity checking, and the event model is turned into a "get next event" interface. Next step is the userspace library. -Libor Signed-off-by: Libor Michalek Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 1979) +++ infiniband/core/Makefile (working copy) @@ -1,6 +1,7 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o \ + ib_ucm.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -12,3 +13,5 @@ ib_sa-y := sa_query.o ib_umad-y := user_mad.o + +ib_ucm-y := ucm.o Index: infiniband/include/ib_user_cm.h =================================================================== --- infiniband/include/ib_user_cm.h (revision 0) +++ infiniband/include/ib_user_cm.h (revision 0) @@ -0,0 +1,326 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_user_verbs.h 1852 2005-02-21 22:21:01Z roland $ + */ + +#ifndef IB_USER_CM_H +#define IB_USER_CM_H + +#include + +#define IB_USER_CM_ABI_VERSION 1 + +enum { + IB_USER_CM_CMD_CREATE_ID, + IB_USER_CM_CMD_DESTORY_ID, + IB_USER_CM_CMD_ATTR_ID, + + IB_USER_CM_CMD_LISTEN, + IB_USER_CM_CMD_ESTABLISH, + + IB_USER_CM_CMD_SEND_REQ, + IB_USER_CM_CMD_SEND_REP, + IB_USER_CM_CMD_SEND_RTU, + IB_USER_CM_CMD_SEND_DREQ, + IB_USER_CM_CMD_SEND_DREP, + IB_USER_CM_CMD_SEND_REJ, + IB_USER_CM_CMD_SEND_MRA, + IB_USER_CM_CMD_SEND_LAP, + IB_USER_CM_CMD_SEND_APR, + IB_USER_CM_CMD_SEND_SIDR_REQ, + IB_USER_CM_CMD_SEND_SIDR_REP, + IB_USER_CM_CMD_QP_ATTR, + + IB_USER_CM_CMD_EVENT, +}; +/* + * command ABI structures. + */ +struct ib_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ib_ucm_create_id { + __u64 response; +}; + +struct ib_ucm_create_id_resp { + __u32 id; +}; + +struct ib_ucm_destroy_id { + __u32 id; +}; + +struct ib_ucm_attr_id { + __u64 response; + __u32 id; +}; + +struct ib_ucm_attr_id_resp { + __u64 service_id; + __u64 service_mask; + __u32 state; + __u32 lap_state; + __u32 local_id; + __u32 remote_id; +}; + +struct ib_ucm_listen { + __u64 service_id; + __u64 service_mask; + __u32 id; +}; + +struct ib_ucm_establish { + __u32 id; +}; + +struct ib_ucm_private_data { + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __u16 dlid; + __u16 slid; + __u32 raw_traffic; + __u32 flow_label; + __u32 reversible; + __u32 mtu; + __u16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +struct ib_ucm_req { + __u32 id; + __u32 qpn; + __u32 qp_type; + __u32 psn; + __u64 sid; + + __u64 primary_path; + __u64 alternate_path; + __u8 len; + __u8 peer_to_peer; + __u8 responder_resources; + __u8 initiator_depth; + __u8 remote_cm_response_timeout; + __u8 flow_control; + __u8 local_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 max_cm_retries; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rep { + __u64 data; + __u32 id; + __u32 qpn; + __u32 psn; + __u8 len; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_info { + __u32 id; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; + +struct ib_ucm_mra { + __u64 data; + __u32 id; + __u8 len; + __u8 timeout; + __u8 reserved[2]; +}; + +struct ib_ucm_lap { + __u64 path; + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_sidr_req { + __u32 id; + __u32 timeout; + __u64 sid; + __u64 data; + __u64 path; + __u16 pkey; + __u8 len; + __u8 max_cm_retries; +}; + +struct ib_ucm_sidr_rep { + __u32 id; + __u32 qpn; + __u32 qkey; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; +/* + * event notification ABI structures. + */ +struct ib_ucm_event_get { + __u64 response; + __u64 data; + __u64 info; + __u8 data_len; + __u8 info_len; + __u8 reserved[2]; +}; + +struct ib_ucm_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + struct ib_ucm_path_rec primary_path; + struct ib_ucm_path_rec alternate_path; + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 qp_type; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 local_cm_response_timeout; + __u8 flow_control; + __u8 remote_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_rep_event_resp { + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rej_event_resp { + __u32 reason; + /* ari in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_mra_event_resp { + __u8 timeout; + __u8 reserved[3]; +}; + +struct ib_ucm_lap_event_resp { + struct ib_ucm_path_rec path; +}; + +struct ib_ucm_apr_event_resp { + __u32 status; + /* apr info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_sidr_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + __u16 pkey; + __u8 reserved[2]; +}; + +struct ib_ucm_sidr_rep_event_resp { + __u32 status; + __u32 qkey; + __u32 qpn; + /* info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_event_resp { + __u32 id; + __u32 state; + __u32 event; + union { + struct ib_ucm_req_event_resp req_resp; + struct ib_ucm_rep_event_resp rep_resp; + struct ib_ucm_rej_event_resp rej_resp; + struct ib_ucm_mra_event_resp mra_resp; + struct ib_ucm_lap_event_resp lap_resp; + struct ib_ucm_apr_event_resp apr_resp; + + struct ib_ucm_sidr_req_event_resp sidr_req_resp; + struct ib_ucm_sidr_rep_event_resp sidr_rep_resp; + + __u32 send_status; + } u; +}; + +#endif /* IB_USER_CM_H */ Index: infiniband/core/ucm.c =================================================================== --- infiniband/core/ucm.c (revision 0) +++ infiniband/core/ucm.c (revision 0) @@ -0,0 +1,1388 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ucm.h" + +MODULE_AUTHOR("Libor Michalek"); +MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UCM_MAJOR = 231, + IB_UCM_MINOR = 255 +}; + +#define IB_UCM_DEV MKDEV(IB_UCM_MAJOR, IB_UCM_MINOR) + +static struct semaphore ctx_id_mutex; +static struct idr ctx_id_table; +static int ctx_id_rover = 0; + +static struct ib_ucm_context *ib_ucm_ctx_get(int id) +{ + struct ib_ucm_context *ctx; + + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, id); + if (ctx) + ctx->ref++; + up(&ctx_id_mutex); + + return ctx; +} + +static void ib_ucm_ctx_put(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; + + down(&ctx_id_mutex); + + ctx->ref--; + if (!ctx->ref) + idr_remove(&ctx_id_table, ctx->id); + + up(&ctx_id_mutex); + + if (ctx->ref) + return; + + down(&ctx->file->mutex); + + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, + struct ib_ucm_event, ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + kfree(uevent); + } + + up(&ctx->file->mutex); + + printk(KERN_ERR "UCM: Destroyed CM ID <%d>\n", ctx->id); + + (void)ib_destroy_cm_id(ctx->cm_id); + kfree(ctx); +} + +static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) +{ + struct ib_ucm_context *ctx; + int result; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->ref = 1; /* user reference */ + ctx->file = file; + + INIT_LIST_HEAD(&ctx->events); + init_MUTEX(&ctx->mutex); + + list_add_tail(&ctx->file_list, &file->ctxs); + + ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; +retry: + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); + up(&ctx_id_mutex); + + if (result == -EAGAIN) + goto retry; + if (result) + goto error; + + printk(KERN_ERR "UCM: Allocated CM ID <%d>\n", ctx->id); + + return ctx; +error: + list_del(&ctx->file_list); + kfree(ctx); + + return NULL; +} +/* + * Event portion of the API, handle CM events + * and allow event polling. + */ +static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, + struct ib_sa_path_rec *kpath) +{ + memcpy(upath->dgid, kpath->dgid.raw, sizeof(union ib_gid)); + memcpy(upath->sgid, kpath->sgid.raw, sizeof(union ib_gid)); + + upath->dlid = kpath->dlid; + upath->slid = kpath->slid; + upath->raw_traffic = kpath->raw_traffic; + upath->flow_label = kpath->flow_label; + upath->hop_limit = kpath->hop_limit; + upath->traffic_class = kpath->traffic_class; + upath->reversible = kpath->reversible; + upath->numb_path = kpath->numb_path; + upath->pkey = kpath->pkey; + upath->sl = kpath->sl; + upath->mtu_selector = kpath->mtu_selector; + upath->mtu = kpath->mtu; + upath->rate_selector = kpath->rate_selector; + upath->rate = kpath->rate; + upath->packet_life_time = kpath->packet_life_time; + upath->preference = kpath->preference; + + upath->packet_life_time_selector = + kpath->packet_life_time_selector; +} + +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, + struct ib_cm_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + + ureq->remote_ca_guid = kreq->remote_ca_guid; + ureq->remote_qkey = kreq->remote_qkey; + ureq->remote_qpn = kreq->remote_qpn; + ureq->qp_type = kreq->qp_type; + ureq->starting_psn = kreq->starting_psn; + ureq->responder_resources = kreq->responder_resources; + ureq->initiator_depth = kreq->initiator_depth; + ureq->local_cm_response_timeout = kreq->local_cm_response_timeout; + ureq->flow_control = kreq->flow_control; + ureq->remote_cm_response_timeout = kreq->remote_cm_response_timeout; + ureq->retry_count = kreq->retry_count; + ureq->rnr_retry_count = kreq->rnr_retry_count; + ureq->srq = kreq->srq; + + ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); + ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); +} + +static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, + struct ib_cm_rep_event_param *krep) +{ + urep->remote_ca_guid = krep->remote_ca_guid; + urep->remote_qkey = krep->remote_qkey; + urep->remote_qpn = krep->remote_qpn; + urep->starting_psn = krep->starting_psn; + urep->responder_resources = krep->responder_resources; + urep->initiator_depth = krep->initiator_depth; + urep->target_ack_delay = krep->target_ack_delay; + urep->failover_accepted = krep->failover_accepted; + urep->flow_control = krep->flow_control; + urep->rnr_retry_count = krep->rnr_retry_count; + urep->srq = krep->srq; +} + +static void ib_ucm_event_rej_get(struct ib_ucm_rej_event_resp *urej, + struct ib_cm_rej_event_param *krej) +{ + urej->reason = krej->reason; +} + +static void ib_ucm_event_mra_get(struct ib_ucm_mra_event_resp *umra, + struct ib_cm_mra_event_param *kmra) +{ + umra->timeout = kmra->service_timeout; +} + +static void ib_ucm_event_lap_get(struct ib_ucm_lap_event_resp *ulap, + struct ib_cm_lap_event_param *klap) +{ + ib_ucm_event_path_get(&ulap->path, klap->alternate_path); +} + +static void ib_ucm_event_apr_get(struct ib_ucm_apr_event_resp *uapr, + struct ib_cm_apr_event_param *kapr) +{ + uapr->status = kapr->ap_status; +} + +static void ib_ucm_event_sidr_req_get(struct ib_ucm_sidr_req_event_resp *ureq, + struct ib_cm_sidr_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + ureq->pkey = kreq->pkey; +} + +static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, + struct ib_cm_sidr_rep_event_param *krep) +{ + urep->status = krep->status; + urep->qkey = krep->qkey; + urep->qpn = krep->qpn; +}; + +static int ib_ucm_event_process(struct ib_cm_event *evt, + struct ib_ucm_event *uvt) +{ + void *info = NULL; + int result; + + switch (evt->event) { + case IB_CM_REQ_RECEIVED: + ib_ucm_event_req_get(&uvt->resp.u.req_resp, + &evt->param.req_rcvd); + uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REP_RECEIVED: + ib_ucm_event_rep_get(&uvt->resp.u.rep_resp, + &evt->param.rep_rcvd); + uvt->data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_RTU_RECEIVED: + uvt->data_len = IB_CM_RTU_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREQ_RECEIVED: + uvt->data_len = IB_CM_DREQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREP_RECEIVED: + uvt->data_len = IB_CM_DREP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_MRA_RECEIVED: + ib_ucm_event_mra_get(&uvt->resp.u.mra_resp, + &evt->param.mra_rcvd); + uvt->data_len = IB_CM_MRA_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REJ_RECEIVED: + ib_ucm_event_rej_get(&uvt->resp.u.rej_resp, + &evt->param.rej_rcvd); + uvt->data_len = IB_CM_REJ_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.rej_rcvd.ari_length; + info = evt->param.rej_rcvd.ari; + + break; + case IB_CM_LAP_RECEIVED: + ib_ucm_event_lap_get(&uvt->resp.u.lap_resp, + &evt->param.lap_rcvd); + uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_APR_RECEIVED: + ib_ucm_event_apr_get(&uvt->resp.u.apr_resp, + &evt->param.apr_rcvd); + uvt->data_len = IB_CM_APR_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.apr_rcvd.info_len; + info = evt->param.apr_rcvd.apr_info; + + break; + case IB_CM_SIDR_REQ_RECEIVED: + ib_ucm_event_sidr_req_get(&uvt->resp.u.sidr_req_resp, + &evt->param.sidr_req_rcvd); + uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_SIDR_REP_RECEIVED: + ib_ucm_event_sidr_rep_get(&uvt->resp.u.sidr_rep_resp, + &evt->param.sidr_rep_rcvd); + uvt->data_len = IB_CM_SIDR_REP_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.sidr_rep_rcvd.info_len; + info = evt->param.sidr_rep_rcvd.info; + + break; + default: + uvt->resp.u.send_status = evt->param.send_status; + + break; + } + + if (uvt->data_len && evt->private_data) { + + uvt->data = kmalloc(uvt->data_len, GFP_KERNEL); + if (!uvt->data) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->data, evt->private_data, uvt->data_len); + } + + if (uvt->info_len && info) { + + uvt->info = kmalloc(uvt->info_len, GFP_KERNEL); + if (!uvt->info) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->info, info, uvt->info_len); + } + + return 0; +error: + if (uvt->info) + kfree(uvt->info); + if (uvt->data) + kfree(uvt->data); + return result; +} + +static int ib_ucm_event_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event) +{ + struct ib_ucm_event *uevent; + struct ib_ucm_context *ctx; + int result = 0; + int id; + + /* + * lookup correct context based on event type. + */ + switch (event->event) { + case IB_CM_REQ_RECEIVED: + id = (int)event->param.req_rcvd.listen_id->context; + break; + case IB_CM_SIDR_REQ_RECEIVED: + id = (int)event->param.sidr_req_rcvd.listen_id->context; + break; + default: + id = (int)cm_id->context; + break; + } + + ctx = ib_ucm_ctx_get(id); + if (!ctx) + return -ENOENT; + + if (event->event == IB_CM_REQ_RECEIVED || + event->event == IB_CM_SIDR_REQ_RECEIVED) + id = IB_UCM_CM_ID_INVALID; + + uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) { + result = -ENOMEM; + goto done; + } + + memset(uevent, 0, sizeof(*uevent)); + + uevent->resp.id = id; + uevent->resp.event = event->event; + uevent->resp.state = cm_id->state; + + result = ib_ucm_event_process(event, uevent); + if (result) + goto done; + + uevent->ctx = ctx; + + down(&ctx->file->mutex); + + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + + wake_up_interruptible(&ctx->file->poll_wait); + + up(&ctx->file->mutex); +done: + ctx->error = result; + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_qp_event(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_event_get cmd; + struct ib_ucm_event *uevent = NULL; + int result = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct ib_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + /* + * wait + */ + down(&file->mutex); + + while (list_empty(&file->events)) { + + if (file->filp->f_flags & O_NONBLOCK) { + result = -EAGAIN; + break; + } + + if (signal_pending(current)) { + result = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + + up(&file->mutex); + schedule(); + down(&file->mutex); + + finish_wait(&file->poll_wait, &wait); + } + + if (result) + goto done; + + uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); + + if (uevent->resp.id != IB_UCM_CM_ID_INVALID) + goto user; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } + + uevent->resp.id = ctx->id; + +user: + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + result = -EFAULT; + goto done; + } + + if (uevent->data) { + + if (cmd.data_len < uevent->data_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.data, + uevent->data, cmd.data_len)) { + result = -EFAULT; + goto done; + } + } + + if (uevent->info) { + + if (cmd.info_len < uevent->info_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.info, + uevent->info, cmd.info_len)) { + result = -EFAULT; + goto done; + } + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + if (uevent->data) + kfree(uevent->data); + if (uevent->info) + kfree(uevent->info); + kfree(uevent); +done: + up(&file->mutex); + return result; +} + + +static ssize_t ib_ucm_create_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_create_id cmd; + struct ib_ucm_create_id_resp resp; + struct ib_ucm_context *ctx; + int result; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) + return -ENOMEM; + + ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, + (void *)(unsigned long)ctx->id); + if (!ctx->cm_id) { + result = -ENOMEM; + goto err_cm; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + result = -EFAULT; + goto err_ret; + } + + return 0; +err_ret: + (void)ib_destroy_cm_id(ctx->cm_id); +err_cm: + ib_ucm_ctx_put(ctx); /* user reference */ + + return result; +} + +static ssize_t ib_ucm_destroy_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_destroy_id cmd; + struct ib_ucm_context *ctx; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + ib_ucm_ctx_put(ctx); /* user reference */ + ib_ucm_ctx_put(ctx); /* func reference */ + + return 0; +} + +static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_attr_id_resp resp; + struct ib_ucm_attr_id cmd; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) { + result = -EINVAL; + goto done; + } + + resp.service_id = ctx->cm_id->service_id; + resp.service_mask = ctx->cm_id->service_mask; + resp.state = ctx->cm_id->state; + resp.lap_state = ctx->cm_id->lap_state; + resp.local_id = ctx->cm_id->local_id; + resp.remote_id = ctx->cm_id->remote_id; + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +done: + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_listen(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_listen cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_listen(ctx->cm_id, cmd.service_id, + cmd.service_mask); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_establish(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_establish cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_establish(ctx->cm_id); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static int ib_ucm_alloc_data(void **dest, u64 src, u32 len) +{ + void *data; + + *dest = NULL; + + if (!len) + return 0; + + data = kmalloc(len, GFP_KERNEL); + if (!data) + return -ENOMEM; + + if (copy_from_user(data, (void __user *)(unsigned long)src, len)) { + kfree(data); + return -EFAULT; + } + + *dest = data; + return 0; +} + +static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) +{ + struct ib_ucm_path_rec ucm_path; + struct ib_sa_path_rec *sa_path; + + *path = NULL; + + if (!src) + return 0; + + sa_path = kmalloc(sizeof(*sa_path), GFP_KERNEL); + if (!sa_path) + return -ENOMEM; + + if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, + sizeof(ucm_path))) { + + kfree(sa_path); + return -EFAULT; + } + + memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof(union ib_gid)); + memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof(union ib_gid)); + + sa_path->dlid = ucm_path.dlid; + sa_path->slid = ucm_path.slid; + sa_path->raw_traffic = ucm_path.raw_traffic; + sa_path->flow_label = ucm_path.flow_label; + sa_path->hop_limit = ucm_path.hop_limit; + sa_path->traffic_class = ucm_path.traffic_class; + sa_path->reversible = ucm_path.reversible; + sa_path->numb_path = ucm_path.numb_path; + sa_path->pkey = ucm_path.pkey; + sa_path->sl = ucm_path.sl; + sa_path->mtu_selector = ucm_path.mtu_selector; + sa_path->mtu = ucm_path.mtu; + sa_path->rate_selector = ucm_path.rate_selector; + sa_path->rate = ucm_path.rate; + sa_path->packet_life_time = ucm_path.packet_life_time; + sa_path->preference = ucm_path.preference; + + sa_path->packet_life_time_selector = + ucm_path.packet_life_time_selector; + + *path = sa_path; + return 0; +} + +static ssize_t ib_ucm_send_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_req cmd; + int result; + + param.private_data = NULL; + param.primary_path = NULL; + param.alternate_path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.primary_path, cmd.primary_path); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.alternate_path, cmd.alternate_path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.qp_num = cmd.qpn; + param.qp_type = cmd.qp_type; + param.starting_psn = cmd.psn; + param.peer_to_peer = cmd.peer_to_peer; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.remote_cm_response_timeout = cmd.remote_cm_response_timeout; + param.flow_control = cmd.flow_control; + param.local_cm_response_timeout = cmd.local_cm_response_timeout; + param.retry_count = cmd.retry_count; + param.rnr_retry_count = cmd.rnr_retry_count; + param.max_cm_retries = cmd.max_cm_retries; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.primary_path) + kfree(param.primary_path); + if (param.alternate_path) + kfree(param.alternate_path); + + return result; +} + +static ssize_t ib_ucm_send_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_rep_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_rep cmd; + int result; + + param.private_data = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + return result; + + param.qp_num = cmd.qpn; + param.starting_psn = cmd.psn; + param.private_data_len = cmd.len; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.target_ack_delay = cmd.target_ack_delay; + param.failover_accepted = cmd.failover_accepted; + param.flow_control = cmd.flow_control; + param.rnr_retry_count = cmd.rnr_retry_count; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + + return result; +} + +static ssize_t ib_ucm_send_private_data(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len)) +{ + struct ib_ucm_private_data cmd; + struct ib_ucm_context *ctx; + void *private_data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&private_data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, private_data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (private_data) + kfree(private_data); + + return result; +} + +static ssize_t ib_ucm_send_rtu(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_rtu); +} + +static ssize_t ib_ucm_send_dreq(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_dreq); +} + +static ssize_t ib_ucm_send_drep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_drep); +} + +static ssize_t ib_ucm_send_info(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + int status, + void *info, + u8 info_len, + void *data, + u8 data_len)) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_info cmd; + void *data = NULL; + void *info = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(&info, cmd.info, cmd.info_len); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, cmd.status, + info, cmd.info_len, + data, cmd.data_len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (info) + kfree(info); + + return result; +} + +static ssize_t ib_ucm_send_rej(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_rej); +} + +static ssize_t ib_ucm_send_apr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_apr); +} + +static ssize_t ib_ucm_send_mra(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_mra cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_mra(ctx->cm_id, cmd.timeout, + data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + + return result; +} + +static ssize_t ib_ucm_send_lap(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_sa_path_rec *path = NULL; + struct ib_ucm_lap cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(&path, cmd.path); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_lap(ctx->cm_id, path, data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (path) + kfree(path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_sidr_req cmd; + int result; + + param.private_data = NULL; + param.path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.path, cmd.path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.timeout_ms = cmd.timeout; + param.max_cm_retries = cmd.max_cm_retries; + param.pkey = cmd.pkey; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.path) + kfree(param.path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_rep_param param; + struct ib_ucm_sidr_rep cmd; + struct ib_ucm_context *ctx; + int result; + + param.info = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, + cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(¶m.info, cmd.info, cmd.info_len); + if (result) + goto done; + + param.qp_num = cmd.qpn; + param.qkey = cmd.qkey; + param.status = cmd.status; + param.info_length = cmd.info_len; + param.private_data_len = cmd.data_len; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.info) + kfree(param.info); + + return result; +} + +static ssize_t ib_ucm_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return 0; +} + +static ssize_t (*ucm_cmd_table[])(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [IB_USER_CM_CMD_CREATE_ID] = ib_ucm_create_id, + [IB_USER_CM_CMD_DESTORY_ID] = ib_ucm_destroy_id, + [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, + [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, + [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, + [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, + [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, + [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, + [IB_USER_CM_CMD_SEND_DREQ] = ib_ucm_send_dreq, + [IB_USER_CM_CMD_SEND_DREP] = ib_ucm_send_drep, + [IB_USER_CM_CMD_SEND_REJ] = ib_ucm_send_rej, + [IB_USER_CM_CMD_SEND_MRA] = ib_ucm_send_mra, + [IB_USER_CM_CMD_SEND_LAP] = ib_ucm_send_lap, + [IB_USER_CM_CMD_SEND_APR] = ib_ucm_send_apr, + [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, + [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, + [IB_USER_CM_CMD_QP_ATTR] = ib_ucm_qp_attr, + [IB_USER_CM_CMD_EVENT] = ib_ucm_qp_event, +}; + +static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_cmd_hdr hdr; + ssize_t result; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%d>\n", + hdr.cmd, hdr.in, hdr.out, len); + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + result = ucm_cmd_table[hdr.cmd](file, buf + sizeof(hdr), + hdr.in, hdr.out); + if (!result) + result = len; + + return result; +} + +static unsigned int ib_ucm_poll(struct file *filp, + struct poll_table_struct *wait) +{ + struct ib_ucm_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_ucm_open(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file; + + file = kmalloc(sizeof(*file), GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + + init_MUTEX(&file->mutex); + + filp->private_data = file; + file->filp = filp; + + printk(KERN_ERR "UCM: Created struct\n"); + + return 0; +} + +static int ib_ucm_close(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_context *ctx; + + down(&file->mutex); + + while (!list_empty(&file->ctxs)) { + + ctx = list_entry(file->ctxs.next, + struct ib_ucm_context, file_list); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* user reference */ + down(&file->mutex); + } + + up(&file->mutex); + + kfree(file); + + printk(KERN_ERR "UCM: Deleted struct\n"); + return 0; +} + +static struct file_operations ib_ucm_fops = { + .owner = THIS_MODULE, + .open = ib_ucm_open, + .release = ib_ucm_close, + .write = ib_ucm_write, + .poll = ib_ucm_poll, +}; + + +static struct class_simple *ib_ucm_class; +static struct cdev ib_ucm_cdev; + +static int __init ib_ucm_init(void) +{ + int result; + + result = register_chrdev_region(IB_UCM_DEV, 1, "infiniband_cm"); + if (result) { + printk(KERN_ERR "UCM: Error <%d> registering dev\n", result); + goto err_chr; + } + + cdev_init(&ib_ucm_cdev, &ib_ucm_fops); + + result = cdev_add(&ib_ucm_cdev, IB_UCM_DEV, 1); + if (result) { + printk(KERN_ERR "UCM: Error <%d> adding cdev\n", result); + goto err_cdev; + } + + ib_ucm_class = class_simple_create(THIS_MODULE, "ucm"); + if (IS_ERR(ib_ucm_class)) { + result = PTR_ERR(ib_ucm_class); + printk(KERN_ERR "UCM: Error <%d> creating class\n", result); + goto err_class; + } + + class_simple_device_add(ib_ucm_class, + IB_UCM_DEV, + NULL, + "ucm"); + + devfs_mk_cdev(IB_UCM_DEV, + S_IFCHR|S_IRUGO|S_IWUGO, + "infiniband/ucm"); + + idr_init(&ctx_id_table); + init_MUTEX(&ctx_id_mutex); + + return 0; +err_class: + cdev_del(&ib_ucm_cdev); +err_cdev: + unregister_chrdev_region(IB_UCM_DEV, 1); +err_chr: + return result; +} + +static void __exit ib_ucm_cleanup(void) +{ + devfs_remove("infiniband/ucm"); + class_simple_device_remove(IB_UCM_DEV); + class_simple_destroy(ib_ucm_class); + cdev_del(&ib_ucm_cdev); + unregister_chrdev_region(IB_UCM_DEV, 1); +} + +module_init(ib_ucm_init); +module_exit(ib_ucm_cleanup); Index: infiniband/core/ucm.h =================================================================== --- infiniband/core/ucm.h (revision 0) +++ infiniband/core/ucm.h (revision 0) @@ -0,0 +1,84 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef UCM_H +#define UCM_H + +#include +#include +#include +#include + +#include +#include + +#define IB_UCM_CM_ID_INVALID 0xffffffff + +struct ib_ucm_file { + struct semaphore mutex; + struct file *filp; + /* + * list of pending events + */ + struct list_head ctxs; /* list of active connections */ + struct list_head events; /* list of pending events */ + wait_queue_head_t poll_wait; +}; + +struct ib_ucm_context { + int id; + int ref; + int error; + + struct ib_ucm_file *file; + struct ib_cm_id *cm_id; + struct semaphore mutex; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ib_ucm_event { + struct ib_ucm_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + + struct ib_ucm_event_resp resp; + void *data; + void *info; + int data_len; + int info_len; +}; + +#endif /* UCM_H */ From libor at topspin.com Fri Mar 11 15:17:48 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:17:48 -0800 Subject: [openib-general] Re: [PATCH] [TRIVIAL] SDP: Eliminate uneeded initialization and fix some typos In-Reply-To: <1110364516.4645.22.camel@localhost.localdomain>; from halr@voltaire.com on Wed, Mar 09, 2005 at 05:35:17AM -0500 References: <1110364516.4645.22.camel@localhost.localdomain> Message-ID: <20050311151748.B31689@topspin.com> On Wed, Mar 09, 2005 at 05:35:17AM -0500, Hal Rosenstock wrote: > SDP: Eliminate uneeded initialization and fix some typos Thanks, applied and commited. -Libor From libor at topspin.com Fri Mar 11 15:22:13 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:22:13 -0800 Subject: [openib-general] [RFC] Userspace CM access. In-Reply-To: <20050311150946.A31689@topspin.com>; from libor@topspin.com on Fri, Mar 11, 2005 at 03:09:46PM -0800 References: <20050311150946.A31689@topspin.com> Message-ID: <20050311152213.C31689@topspin.com> On Fri, Mar 11, 2005 at 03:09:46PM -0800, Libor Michalek wrote: > > Below is the source for the kernel portion of the userspace CM. I've > got enough of the userspace library to verify the basic functionality, > but it's not yet ready for general use. However, I wanted to get the > kernel portion posted for comment and checked-in now that the bulk of > it is complete. The code is for the most part a pass through from > userspace to the kernel CM, plus synchronization, sanity checking, > and the event model is turned into a "get next event" interface. OK. Not sure how one of the structure fields disappeared, but here's a resend, that actually builds. -Libor Signed-off-by: Libor Michalek Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 1979) +++ infiniband/core/Makefile (working copy) @@ -1,6 +1,7 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o \ + ib_ucm.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -12,3 +13,5 @@ ib_sa-y := sa_query.o ib_umad-y := user_mad.o + +ib_ucm-y := ucm.o Index: infiniband/include/ib_user_cm.h =================================================================== --- infiniband/include/ib_user_cm.h (revision 0) +++ infiniband/include/ib_user_cm.h (revision 0) @@ -0,0 +1,326 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_user_verbs.h 1852 2005-02-21 22:21:01Z roland $ + */ + +#ifndef IB_USER_CM_H +#define IB_USER_CM_H + +#include + +#define IB_USER_CM_ABI_VERSION 1 + +enum { + IB_USER_CM_CMD_CREATE_ID, + IB_USER_CM_CMD_DESTORY_ID, + IB_USER_CM_CMD_ATTR_ID, + + IB_USER_CM_CMD_LISTEN, + IB_USER_CM_CMD_ESTABLISH, + + IB_USER_CM_CMD_SEND_REQ, + IB_USER_CM_CMD_SEND_REP, + IB_USER_CM_CMD_SEND_RTU, + IB_USER_CM_CMD_SEND_DREQ, + IB_USER_CM_CMD_SEND_DREP, + IB_USER_CM_CMD_SEND_REJ, + IB_USER_CM_CMD_SEND_MRA, + IB_USER_CM_CMD_SEND_LAP, + IB_USER_CM_CMD_SEND_APR, + IB_USER_CM_CMD_SEND_SIDR_REQ, + IB_USER_CM_CMD_SEND_SIDR_REP, + IB_USER_CM_CMD_QP_ATTR, + + IB_USER_CM_CMD_EVENT, +}; +/* + * command ABI structures. + */ +struct ib_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ib_ucm_create_id { + __u64 response; +}; + +struct ib_ucm_create_id_resp { + __u32 id; +}; + +struct ib_ucm_destroy_id { + __u32 id; +}; + +struct ib_ucm_attr_id { + __u64 response; + __u32 id; +}; + +struct ib_ucm_attr_id_resp { + __u64 service_id; + __u64 service_mask; + __u32 state; + __u32 lap_state; + __u32 local_id; + __u32 remote_id; +}; + +struct ib_ucm_listen { + __u64 service_id; + __u64 service_mask; + __u32 id; +}; + +struct ib_ucm_establish { + __u32 id; +}; + +struct ib_ucm_private_data { + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __u16 dlid; + __u16 slid; + __u32 raw_traffic; + __u32 flow_label; + __u32 reversible; + __u32 mtu; + __u16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +struct ib_ucm_req { + __u32 id; + __u32 qpn; + __u32 qp_type; + __u32 psn; + __u64 sid; + __u64 data; + __u64 primary_path; + __u64 alternate_path; + __u8 len; + __u8 peer_to_peer; + __u8 responder_resources; + __u8 initiator_depth; + __u8 remote_cm_response_timeout; + __u8 flow_control; + __u8 local_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 max_cm_retries; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rep { + __u64 data; + __u32 id; + __u32 qpn; + __u32 psn; + __u8 len; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_info { + __u32 id; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; + +struct ib_ucm_mra { + __u64 data; + __u32 id; + __u8 len; + __u8 timeout; + __u8 reserved[2]; +}; + +struct ib_ucm_lap { + __u64 path; + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_sidr_req { + __u32 id; + __u32 timeout; + __u64 sid; + __u64 data; + __u64 path; + __u16 pkey; + __u8 len; + __u8 max_cm_retries; +}; + +struct ib_ucm_sidr_rep { + __u32 id; + __u32 qpn; + __u32 qkey; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; +/* + * event notification ABI structures. + */ +struct ib_ucm_event_get { + __u64 response; + __u64 data; + __u64 info; + __u8 data_len; + __u8 info_len; + __u8 reserved[2]; +}; + +struct ib_ucm_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + struct ib_ucm_path_rec primary_path; + struct ib_ucm_path_rec alternate_path; + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 qp_type; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 local_cm_response_timeout; + __u8 flow_control; + __u8 remote_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_rep_event_resp { + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rej_event_resp { + __u32 reason; + /* ari in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_mra_event_resp { + __u8 timeout; + __u8 reserved[3]; +}; + +struct ib_ucm_lap_event_resp { + struct ib_ucm_path_rec path; +}; + +struct ib_ucm_apr_event_resp { + __u32 status; + /* apr info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_sidr_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + __u16 pkey; + __u8 reserved[2]; +}; + +struct ib_ucm_sidr_rep_event_resp { + __u32 status; + __u32 qkey; + __u32 qpn; + /* info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_event_resp { + __u32 id; + __u32 state; + __u32 event; + union { + struct ib_ucm_req_event_resp req_resp; + struct ib_ucm_rep_event_resp rep_resp; + struct ib_ucm_rej_event_resp rej_resp; + struct ib_ucm_mra_event_resp mra_resp; + struct ib_ucm_lap_event_resp lap_resp; + struct ib_ucm_apr_event_resp apr_resp; + + struct ib_ucm_sidr_req_event_resp sidr_req_resp; + struct ib_ucm_sidr_rep_event_resp sidr_rep_resp; + + __u32 send_status; + } u; +}; + +#endif /* IB_USER_CM_H */ Index: infiniband/core/ucm.c =================================================================== --- infiniband/core/ucm.c (revision 0) +++ infiniband/core/ucm.c (revision 0) @@ -0,0 +1,1388 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ucm.h" + +MODULE_AUTHOR("Libor Michalek"); +MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UCM_MAJOR = 231, + IB_UCM_MINOR = 255 +}; + +#define IB_UCM_DEV MKDEV(IB_UCM_MAJOR, IB_UCM_MINOR) + +static struct semaphore ctx_id_mutex; +static struct idr ctx_id_table; +static int ctx_id_rover = 0; + +static struct ib_ucm_context *ib_ucm_ctx_get(int id) +{ + struct ib_ucm_context *ctx; + + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, id); + if (ctx) + ctx->ref++; + up(&ctx_id_mutex); + + return ctx; +} + +static void ib_ucm_ctx_put(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; + + down(&ctx_id_mutex); + + ctx->ref--; + if (!ctx->ref) + idr_remove(&ctx_id_table, ctx->id); + + up(&ctx_id_mutex); + + if (ctx->ref) + return; + + down(&ctx->file->mutex); + + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, + struct ib_ucm_event, ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + kfree(uevent); + } + + up(&ctx->file->mutex); + + printk(KERN_ERR "UCM: Destroyed CM ID <%d>\n", ctx->id); + + (void)ib_destroy_cm_id(ctx->cm_id); + kfree(ctx); +} + +static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) +{ + struct ib_ucm_context *ctx; + int result; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->ref = 1; /* user reference */ + ctx->file = file; + + INIT_LIST_HEAD(&ctx->events); + init_MUTEX(&ctx->mutex); + + list_add_tail(&ctx->file_list, &file->ctxs); + + ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; +retry: + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); + up(&ctx_id_mutex); + + if (result == -EAGAIN) + goto retry; + if (result) + goto error; + + printk(KERN_ERR "UCM: Allocated CM ID <%d>\n", ctx->id); + + return ctx; +error: + list_del(&ctx->file_list); + kfree(ctx); + + return NULL; +} +/* + * Event portion of the API, handle CM events + * and allow event polling. + */ +static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, + struct ib_sa_path_rec *kpath) +{ + memcpy(upath->dgid, kpath->dgid.raw, sizeof(union ib_gid)); + memcpy(upath->sgid, kpath->sgid.raw, sizeof(union ib_gid)); + + upath->dlid = kpath->dlid; + upath->slid = kpath->slid; + upath->raw_traffic = kpath->raw_traffic; + upath->flow_label = kpath->flow_label; + upath->hop_limit = kpath->hop_limit; + upath->traffic_class = kpath->traffic_class; + upath->reversible = kpath->reversible; + upath->numb_path = kpath->numb_path; + upath->pkey = kpath->pkey; + upath->sl = kpath->sl; + upath->mtu_selector = kpath->mtu_selector; + upath->mtu = kpath->mtu; + upath->rate_selector = kpath->rate_selector; + upath->rate = kpath->rate; + upath->packet_life_time = kpath->packet_life_time; + upath->preference = kpath->preference; + + upath->packet_life_time_selector = + kpath->packet_life_time_selector; +} + +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, + struct ib_cm_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + + ureq->remote_ca_guid = kreq->remote_ca_guid; + ureq->remote_qkey = kreq->remote_qkey; + ureq->remote_qpn = kreq->remote_qpn; + ureq->qp_type = kreq->qp_type; + ureq->starting_psn = kreq->starting_psn; + ureq->responder_resources = kreq->responder_resources; + ureq->initiator_depth = kreq->initiator_depth; + ureq->local_cm_response_timeout = kreq->local_cm_response_timeout; + ureq->flow_control = kreq->flow_control; + ureq->remote_cm_response_timeout = kreq->remote_cm_response_timeout; + ureq->retry_count = kreq->retry_count; + ureq->rnr_retry_count = kreq->rnr_retry_count; + ureq->srq = kreq->srq; + + ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); + ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); +} + +static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, + struct ib_cm_rep_event_param *krep) +{ + urep->remote_ca_guid = krep->remote_ca_guid; + urep->remote_qkey = krep->remote_qkey; + urep->remote_qpn = krep->remote_qpn; + urep->starting_psn = krep->starting_psn; + urep->responder_resources = krep->responder_resources; + urep->initiator_depth = krep->initiator_depth; + urep->target_ack_delay = krep->target_ack_delay; + urep->failover_accepted = krep->failover_accepted; + urep->flow_control = krep->flow_control; + urep->rnr_retry_count = krep->rnr_retry_count; + urep->srq = krep->srq; +} + +static void ib_ucm_event_rej_get(struct ib_ucm_rej_event_resp *urej, + struct ib_cm_rej_event_param *krej) +{ + urej->reason = krej->reason; +} + +static void ib_ucm_event_mra_get(struct ib_ucm_mra_event_resp *umra, + struct ib_cm_mra_event_param *kmra) +{ + umra->timeout = kmra->service_timeout; +} + +static void ib_ucm_event_lap_get(struct ib_ucm_lap_event_resp *ulap, + struct ib_cm_lap_event_param *klap) +{ + ib_ucm_event_path_get(&ulap->path, klap->alternate_path); +} + +static void ib_ucm_event_apr_get(struct ib_ucm_apr_event_resp *uapr, + struct ib_cm_apr_event_param *kapr) +{ + uapr->status = kapr->ap_status; +} + +static void ib_ucm_event_sidr_req_get(struct ib_ucm_sidr_req_event_resp *ureq, + struct ib_cm_sidr_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + ureq->pkey = kreq->pkey; +} + +static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, + struct ib_cm_sidr_rep_event_param *krep) +{ + urep->status = krep->status; + urep->qkey = krep->qkey; + urep->qpn = krep->qpn; +}; + +static int ib_ucm_event_process(struct ib_cm_event *evt, + struct ib_ucm_event *uvt) +{ + void *info = NULL; + int result; + + switch (evt->event) { + case IB_CM_REQ_RECEIVED: + ib_ucm_event_req_get(&uvt->resp.u.req_resp, + &evt->param.req_rcvd); + uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REP_RECEIVED: + ib_ucm_event_rep_get(&uvt->resp.u.rep_resp, + &evt->param.rep_rcvd); + uvt->data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_RTU_RECEIVED: + uvt->data_len = IB_CM_RTU_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREQ_RECEIVED: + uvt->data_len = IB_CM_DREQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREP_RECEIVED: + uvt->data_len = IB_CM_DREP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_MRA_RECEIVED: + ib_ucm_event_mra_get(&uvt->resp.u.mra_resp, + &evt->param.mra_rcvd); + uvt->data_len = IB_CM_MRA_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REJ_RECEIVED: + ib_ucm_event_rej_get(&uvt->resp.u.rej_resp, + &evt->param.rej_rcvd); + uvt->data_len = IB_CM_REJ_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.rej_rcvd.ari_length; + info = evt->param.rej_rcvd.ari; + + break; + case IB_CM_LAP_RECEIVED: + ib_ucm_event_lap_get(&uvt->resp.u.lap_resp, + &evt->param.lap_rcvd); + uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_APR_RECEIVED: + ib_ucm_event_apr_get(&uvt->resp.u.apr_resp, + &evt->param.apr_rcvd); + uvt->data_len = IB_CM_APR_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.apr_rcvd.info_len; + info = evt->param.apr_rcvd.apr_info; + + break; + case IB_CM_SIDR_REQ_RECEIVED: + ib_ucm_event_sidr_req_get(&uvt->resp.u.sidr_req_resp, + &evt->param.sidr_req_rcvd); + uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_SIDR_REP_RECEIVED: + ib_ucm_event_sidr_rep_get(&uvt->resp.u.sidr_rep_resp, + &evt->param.sidr_rep_rcvd); + uvt->data_len = IB_CM_SIDR_REP_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.sidr_rep_rcvd.info_len; + info = evt->param.sidr_rep_rcvd.info; + + break; + default: + uvt->resp.u.send_status = evt->param.send_status; + + break; + } + + if (uvt->data_len && evt->private_data) { + + uvt->data = kmalloc(uvt->data_len, GFP_KERNEL); + if (!uvt->data) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->data, evt->private_data, uvt->data_len); + } + + if (uvt->info_len && info) { + + uvt->info = kmalloc(uvt->info_len, GFP_KERNEL); + if (!uvt->info) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->info, info, uvt->info_len); + } + + return 0; +error: + if (uvt->info) + kfree(uvt->info); + if (uvt->data) + kfree(uvt->data); + return result; +} + +static int ib_ucm_event_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event) +{ + struct ib_ucm_event *uevent; + struct ib_ucm_context *ctx; + int result = 0; + int id; + + /* + * lookup correct context based on event type. + */ + switch (event->event) { + case IB_CM_REQ_RECEIVED: + id = (int)event->param.req_rcvd.listen_id->context; + break; + case IB_CM_SIDR_REQ_RECEIVED: + id = (int)event->param.sidr_req_rcvd.listen_id->context; + break; + default: + id = (int)cm_id->context; + break; + } + + ctx = ib_ucm_ctx_get(id); + if (!ctx) + return -ENOENT; + + if (event->event == IB_CM_REQ_RECEIVED || + event->event == IB_CM_SIDR_REQ_RECEIVED) + id = IB_UCM_CM_ID_INVALID; + + uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) { + result = -ENOMEM; + goto done; + } + + memset(uevent, 0, sizeof(*uevent)); + + uevent->resp.id = id; + uevent->resp.event = event->event; + uevent->resp.state = cm_id->state; + + result = ib_ucm_event_process(event, uevent); + if (result) + goto done; + + uevent->ctx = ctx; + + down(&ctx->file->mutex); + + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + + wake_up_interruptible(&ctx->file->poll_wait); + + up(&ctx->file->mutex); +done: + ctx->error = result; + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_qp_event(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_event_get cmd; + struct ib_ucm_event *uevent = NULL; + int result = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct ib_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + /* + * wait + */ + down(&file->mutex); + + while (list_empty(&file->events)) { + + if (file->filp->f_flags & O_NONBLOCK) { + result = -EAGAIN; + break; + } + + if (signal_pending(current)) { + result = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + + up(&file->mutex); + schedule(); + down(&file->mutex); + + finish_wait(&file->poll_wait, &wait); + } + + if (result) + goto done; + + uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); + + if (uevent->resp.id != IB_UCM_CM_ID_INVALID) + goto user; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } + + uevent->resp.id = ctx->id; + +user: + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + result = -EFAULT; + goto done; + } + + if (uevent->data) { + + if (cmd.data_len < uevent->data_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.data, + uevent->data, cmd.data_len)) { + result = -EFAULT; + goto done; + } + } + + if (uevent->info) { + + if (cmd.info_len < uevent->info_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.info, + uevent->info, cmd.info_len)) { + result = -EFAULT; + goto done; + } + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + if (uevent->data) + kfree(uevent->data); + if (uevent->info) + kfree(uevent->info); + kfree(uevent); +done: + up(&file->mutex); + return result; +} + + +static ssize_t ib_ucm_create_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_create_id cmd; + struct ib_ucm_create_id_resp resp; + struct ib_ucm_context *ctx; + int result; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) + return -ENOMEM; + + ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, + (void *)(unsigned long)ctx->id); + if (!ctx->cm_id) { + result = -ENOMEM; + goto err_cm; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + result = -EFAULT; + goto err_ret; + } + + return 0; +err_ret: + (void)ib_destroy_cm_id(ctx->cm_id); +err_cm: + ib_ucm_ctx_put(ctx); /* user reference */ + + return result; +} + +static ssize_t ib_ucm_destroy_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_destroy_id cmd; + struct ib_ucm_context *ctx; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + ib_ucm_ctx_put(ctx); /* user reference */ + ib_ucm_ctx_put(ctx); /* func reference */ + + return 0; +} + +static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_attr_id_resp resp; + struct ib_ucm_attr_id cmd; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) { + result = -EINVAL; + goto done; + } + + resp.service_id = ctx->cm_id->service_id; + resp.service_mask = ctx->cm_id->service_mask; + resp.state = ctx->cm_id->state; + resp.lap_state = ctx->cm_id->lap_state; + resp.local_id = ctx->cm_id->local_id; + resp.remote_id = ctx->cm_id->remote_id; + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +done: + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_listen(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_listen cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_listen(ctx->cm_id, cmd.service_id, + cmd.service_mask); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_establish(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_establish cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_establish(ctx->cm_id); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static int ib_ucm_alloc_data(void **dest, u64 src, u32 len) +{ + void *data; + + *dest = NULL; + + if (!len) + return 0; + + data = kmalloc(len, GFP_KERNEL); + if (!data) + return -ENOMEM; + + if (copy_from_user(data, (void __user *)(unsigned long)src, len)) { + kfree(data); + return -EFAULT; + } + + *dest = data; + return 0; +} + +static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) +{ + struct ib_ucm_path_rec ucm_path; + struct ib_sa_path_rec *sa_path; + + *path = NULL; + + if (!src) + return 0; + + sa_path = kmalloc(sizeof(*sa_path), GFP_KERNEL); + if (!sa_path) + return -ENOMEM; + + if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, + sizeof(ucm_path))) { + + kfree(sa_path); + return -EFAULT; + } + + memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof(union ib_gid)); + memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof(union ib_gid)); + + sa_path->dlid = ucm_path.dlid; + sa_path->slid = ucm_path.slid; + sa_path->raw_traffic = ucm_path.raw_traffic; + sa_path->flow_label = ucm_path.flow_label; + sa_path->hop_limit = ucm_path.hop_limit; + sa_path->traffic_class = ucm_path.traffic_class; + sa_path->reversible = ucm_path.reversible; + sa_path->numb_path = ucm_path.numb_path; + sa_path->pkey = ucm_path.pkey; + sa_path->sl = ucm_path.sl; + sa_path->mtu_selector = ucm_path.mtu_selector; + sa_path->mtu = ucm_path.mtu; + sa_path->rate_selector = ucm_path.rate_selector; + sa_path->rate = ucm_path.rate; + sa_path->packet_life_time = ucm_path.packet_life_time; + sa_path->preference = ucm_path.preference; + + sa_path->packet_life_time_selector = + ucm_path.packet_life_time_selector; + + *path = sa_path; + return 0; +} + +static ssize_t ib_ucm_send_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_req cmd; + int result; + + param.private_data = NULL; + param.primary_path = NULL; + param.alternate_path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.primary_path, cmd.primary_path); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.alternate_path, cmd.alternate_path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.qp_num = cmd.qpn; + param.qp_type = cmd.qp_type; + param.starting_psn = cmd.psn; + param.peer_to_peer = cmd.peer_to_peer; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.remote_cm_response_timeout = cmd.remote_cm_response_timeout; + param.flow_control = cmd.flow_control; + param.local_cm_response_timeout = cmd.local_cm_response_timeout; + param.retry_count = cmd.retry_count; + param.rnr_retry_count = cmd.rnr_retry_count; + param.max_cm_retries = cmd.max_cm_retries; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.primary_path) + kfree(param.primary_path); + if (param.alternate_path) + kfree(param.alternate_path); + + return result; +} + +static ssize_t ib_ucm_send_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_rep_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_rep cmd; + int result; + + param.private_data = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + return result; + + param.qp_num = cmd.qpn; + param.starting_psn = cmd.psn; + param.private_data_len = cmd.len; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.target_ack_delay = cmd.target_ack_delay; + param.failover_accepted = cmd.failover_accepted; + param.flow_control = cmd.flow_control; + param.rnr_retry_count = cmd.rnr_retry_count; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + + return result; +} + +static ssize_t ib_ucm_send_private_data(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len)) +{ + struct ib_ucm_private_data cmd; + struct ib_ucm_context *ctx; + void *private_data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&private_data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, private_data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (private_data) + kfree(private_data); + + return result; +} + +static ssize_t ib_ucm_send_rtu(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_rtu); +} + +static ssize_t ib_ucm_send_dreq(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_dreq); +} + +static ssize_t ib_ucm_send_drep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_drep); +} + +static ssize_t ib_ucm_send_info(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + int status, + void *info, + u8 info_len, + void *data, + u8 data_len)) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_info cmd; + void *data = NULL; + void *info = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(&info, cmd.info, cmd.info_len); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, cmd.status, + info, cmd.info_len, + data, cmd.data_len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (info) + kfree(info); + + return result; +} + +static ssize_t ib_ucm_send_rej(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_rej); +} + +static ssize_t ib_ucm_send_apr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_apr); +} + +static ssize_t ib_ucm_send_mra(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_mra cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_mra(ctx->cm_id, cmd.timeout, + data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + + return result; +} + +static ssize_t ib_ucm_send_lap(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_sa_path_rec *path = NULL; + struct ib_ucm_lap cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(&path, cmd.path); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_lap(ctx->cm_id, path, data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (path) + kfree(path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_sidr_req cmd; + int result; + + param.private_data = NULL; + param.path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.path, cmd.path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.timeout_ms = cmd.timeout; + param.max_cm_retries = cmd.max_cm_retries; + param.pkey = cmd.pkey; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.path) + kfree(param.path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_rep_param param; + struct ib_ucm_sidr_rep cmd; + struct ib_ucm_context *ctx; + int result; + + param.info = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, + cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(¶m.info, cmd.info, cmd.info_len); + if (result) + goto done; + + param.qp_num = cmd.qpn; + param.qkey = cmd.qkey; + param.status = cmd.status; + param.info_length = cmd.info_len; + param.private_data_len = cmd.data_len; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.info) + kfree(param.info); + + return result; +} + +static ssize_t ib_ucm_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return 0; +} + +static ssize_t (*ucm_cmd_table[])(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [IB_USER_CM_CMD_CREATE_ID] = ib_ucm_create_id, + [IB_USER_CM_CMD_DESTORY_ID] = ib_ucm_destroy_id, + [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, + [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, + [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, + [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, + [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, + [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, + [IB_USER_CM_CMD_SEND_DREQ] = ib_ucm_send_dreq, + [IB_USER_CM_CMD_SEND_DREP] = ib_ucm_send_drep, + [IB_USER_CM_CMD_SEND_REJ] = ib_ucm_send_rej, + [IB_USER_CM_CMD_SEND_MRA] = ib_ucm_send_mra, + [IB_USER_CM_CMD_SEND_LAP] = ib_ucm_send_lap, + [IB_USER_CM_CMD_SEND_APR] = ib_ucm_send_apr, + [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, + [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, + [IB_USER_CM_CMD_QP_ATTR] = ib_ucm_qp_attr, + [IB_USER_CM_CMD_EVENT] = ib_ucm_qp_event, +}; + +static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_cmd_hdr hdr; + ssize_t result; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%d>\n", + hdr.cmd, hdr.in, hdr.out, len); + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + result = ucm_cmd_table[hdr.cmd](file, buf + sizeof(hdr), + hdr.in, hdr.out); + if (!result) + result = len; + + return result; +} + +static unsigned int ib_ucm_poll(struct file *filp, + struct poll_table_struct *wait) +{ + struct ib_ucm_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_ucm_open(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file; + + file = kmalloc(sizeof(*file), GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + + init_MUTEX(&file->mutex); + + filp->private_data = file; + file->filp = filp; + + printk(KERN_ERR "UCM: Created struct\n"); + + return 0; +} + +static int ib_ucm_close(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_context *ctx; + + down(&file->mutex); + + while (!list_empty(&file->ctxs)) { + + ctx = list_entry(file->ctxs.next, + struct ib_ucm_context, file_list); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* user reference */ + down(&file->mutex); + } + + up(&file->mutex); + + kfree(file); + + printk(KERN_ERR "UCM: Deleted struct\n"); + return 0; +} + +static struct file_operations ib_ucm_fops = { + .owner = THIS_MODULE, + .open = ib_ucm_open, + .release = ib_ucm_close, + .write = ib_ucm_write, + .poll = ib_ucm_poll, +}; + + +static struct class_simple *ib_ucm_class; +static struct cdev ib_ucm_cdev; + +static int __init ib_ucm_init(void) +{ + int result; + + result = register_chrdev_region(IB_UCM_DEV, 1, "infiniband_cm"); + if (result) { + printk(KERN_ERR "UCM: Error <%d> registering dev\n", result); + goto err_chr; + } + + cdev_init(&ib_ucm_cdev, &ib_ucm_fops); + + result = cdev_add(&ib_ucm_cdev, IB_UCM_DEV, 1); + if (result) { + printk(KERN_ERR "UCM: Error <%d> adding cdev\n", result); + goto err_cdev; + } + + ib_ucm_class = class_simple_create(THIS_MODULE, "ucm"); + if (IS_ERR(ib_ucm_class)) { + result = PTR_ERR(ib_ucm_class); + printk(KERN_ERR "UCM: Error <%d> creating class\n", result); + goto err_class; + } + + class_simple_device_add(ib_ucm_class, + IB_UCM_DEV, + NULL, + "ucm"); + + devfs_mk_cdev(IB_UCM_DEV, + S_IFCHR|S_IRUGO|S_IWUGO, + "infiniband/ucm"); + + idr_init(&ctx_id_table); + init_MUTEX(&ctx_id_mutex); + + return 0; +err_class: + cdev_del(&ib_ucm_cdev); +err_cdev: + unregister_chrdev_region(IB_UCM_DEV, 1); +err_chr: + return result; +} + +static void __exit ib_ucm_cleanup(void) +{ + devfs_remove("infiniband/ucm"); + class_simple_device_remove(IB_UCM_DEV); + class_simple_destroy(ib_ucm_class); + cdev_del(&ib_ucm_cdev); + unregister_chrdev_region(IB_UCM_DEV, 1); +} + +module_init(ib_ucm_init); +module_exit(ib_ucm_cleanup); Index: infiniband/core/ucm.h =================================================================== --- infiniband/core/ucm.h (revision 0) +++ infiniband/core/ucm.h (revision 0) @@ -0,0 +1,84 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef UCM_H +#define UCM_H + +#include +#include +#include +#include + +#include +#include + +#define IB_UCM_CM_ID_INVALID 0xffffffff + +struct ib_ucm_file { + struct semaphore mutex; + struct file *filp; + /* + * list of pending events + */ + struct list_head ctxs; /* list of active connections */ + struct list_head events; /* list of pending events */ + wait_queue_head_t poll_wait; +}; + +struct ib_ucm_context { + int id; + int ref; + int error; + + struct ib_ucm_file *file; + struct ib_cm_id *cm_id; + struct semaphore mutex; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ib_ucm_event { + struct ib_ucm_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + + struct ib_ucm_event_resp resp; + void *data; + void *info; + int data_len; + int info_len; +}; + +#endif /* UCM_H */ From libor at topspin.com Fri Mar 11 15:26:28 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:26:28 -0800 Subject: [openib-general] Re: [PATCH] [TRIVIAL] SDP: sdp_actv.c remove redundant initialization In-Reply-To: <1110379398.4645.46.camel@localhost.localdomain>; from halr@voltaire.com on Wed, Mar 09, 2005 at 09:43:18AM -0500 References: <1110379398.4645.46.camel@localhost.localdomain> Message-ID: <20050311152628.D31689@topspin.com> On Wed, Mar 09, 2005 at 09:43:18AM -0500, Hal Rosenstock wrote: > SDP: sdp_actv.c remove redundant initialization > > qp_attr->min_rnr_timer is already initialized to 0 by > cm_init_qp_rtr_attr in cm.c > > Is this really intended to be IB_RNR_TIMER_122_88 instead ? No, the RNR timer can be set to 0. SDP should never need RNR since the protocol ensures that buffers are posted for receive before the remote connection peer sends any data. Applied and commited along with the patch for sdp_pass.c Thanks. -Libor From libor at topspin.com Fri Mar 11 15:43:16 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:43:16 -0800 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050311073108.GA20989@mellanox.co.il>; from mst@mellanox.co.il on Fri, Mar 11, 2005 at 09:31:08AM +0200 References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> Message-ID: <20050311154316.E31689@topspin.com> On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > Sdp also has a couple of uses. > Maybe we can use the atomic branch in all cases here, as well? > Libor? Yes, the case in sdp_iocb.c can probably always take the atomic path. The kmap/kunmap cases really only care whether we're in an interrupt, so switching to in_interrupt() should be sufficient. -Libor From roland at topspin.com Fri Mar 11 15:45:31 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 15:45:31 -0800 Subject: [openib-general] [RFC] Userspace CM access. In-Reply-To: <20050311152213.C31689@topspin.com> (Libor Michalek's message of "Fri, 11 Mar 2005 15:22:13 -0800") References: <20050311150946.A31689@topspin.com> <20050311152213.C31689@topspin.com> Message-ID: <524qfh7nms.fsf@topspin.com> I suggest tabifying the file -- there seem to be some whitespace problems like: + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) (spaces on one line, tabs the next). More substantive comments later... - R. From hozer at hozed.org Fri Mar 11 17:15:27 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 11 Mar 2005 19:15:27 -0600 Subject: [openib-general] kernel 2.6.11 and userland packages? Message-ID: <20050312011527.GC9768@kalmia.hozed.org> I have in my office a shiny new kernel.org 2.6.11 64 bit kernel running on my Mac G5, with the drivers/infiniband modules loaded. What do I need to do to verify this all works? Also, I'd really like to make debian packages of the userland utilities and libraries, and get a debian/ subdirectory into the subversion release, so the packages can be rebuilt easily. Where should I start on this? -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From admin at donateonline.info Fri Mar 11 17:29:02 2005 From: admin at donateonline.info (Help) Date: Fri, 11 Mar 2005 17:29:02 -0800 (PST) Subject: [openib-general] Children in crisis Message-ID: <20050312012902.F38DB22834D@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From hozer at hozed.org Fri Mar 11 21:55:09 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 11 Mar 2005 23:55:09 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <20050309025647.GN5502@esmail.cup.hp.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <20050309025647.GN5502@esmail.cup.hp.com> Message-ID: <20050312055509.GH9768@kalmia.hozed.org> On Tue, Mar 08, 2005 at 06:56:47PM -0800, Grant Grundler wrote: > On Tue, Mar 08, 2005 at 03:31:48PM -0800, Matt Leininger wrote: > > You can grab the openib source code from the subversion repository. > > See http://www.openib.org/tools.html. If you want everything run 'svn > > co https://openib.org/svn' > > Matt, > probably best to just add a short blurb to tools.html > that includes an example using gen2 branch. That's what > we want people to focus on I think. having just waded into this stuff, I'd really like a "just_build_it_all.sh" script. Well, actually, what I'd really like is to do: svn co https://openib.org/some/path cd some/path fakeroot dpkg-buildpackage and get me some debian packages ;) FYI, I'm hopeing that at least the PPC debian 2.6.11 kernel packages with have IB modules enabled in the .config From mst at mellanox.co.il Mon Mar 14 06:46:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Mar 2005 16:46:50 +0200 Subject: [openib-general] [PATCH] alignment check in reg_phys_mr Message-ID: <20050314144650.GF16749@mellanox.co.il> Apparently reg_phys_mr in mthca requires that the start address is page aligned. Seems like a bug to me. Roland? Signed-off-by: Michael S. Tsirkin Index: drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- drivers/infiniband/hw/mthca/mthca_provider.c (revision 1983) +++ drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -494,7 +494,7 @@ static struct ib_mr *mthca_reg_phys_mr(s mask = 0; total_size = 0; for (i = 0; i < num_phys_buf; ++i) { - if (buffer_list[i].addr & ~PAGE_MASK) + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) return ERR_PTR(-EINVAL); if (i != 0 && i != num_phys_buf - 1 && (buffer_list[i].size & ~PAGE_MASK)) -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 14 08:07:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Mar 2005 11:07:25 -0500 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <20050312203709.34B3522834D@openib.ca.sandia.gov> References: <20050312203709.34B3522834D@openib.ca.sandia.gov> Message-ID: <1110816445.4645.31.camel@localhost.localdomain> On Sat, 2005-03-12 at 15:37, roland at openib.org wrote: > + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; > + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; These are getting set to 1. That's not what I was expecting. Thanks. -- Hal From gshipman at cs.unm.edu Mon Mar 14 09:21:56 2005 From: gshipman at cs.unm.edu (gshipman) Date: Mon, 14 Mar 2005 10:21:56 -0700 Subject: [openib-general] vstat error on bproc slave node (VAPI_EGEN) Message-ID: <6b97df984990ccc634d3ab93673a52fd@cs.unm.edu> I am relatively new to openib so here goes: I am attempting to configure our small cluster to use bproc and openib. Note I am using gen1 on kernel 2.6.6 patched with the clustermatic stuff, (should I be using gen2, is it stable for general use?). I have successfully gotten things going on the head node including opensm. I have successfully gotten the slave nodes to run the patched kernel, load the appropriate modules as well as the various user level libraries but I am having an issue on the slave nodes: If I run: $bpsh 13 /usr/mellanox/bin/vstat 1 HCA found: hca_id=InfiniHost0 Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EGEN) On the head node I get: $/usr/mellanox/bin/vstat 1 HCA found: hca_id=InfiniHost0 vendor_id=0x02C9 vendor_part_id=0x5A44 hw_ver=0xA1 fw_ver=0x300020000 num_phys_ports=2 port=1 port_state=PORT_DOWN sm_lid=0x0000 port_lid=0x0353 port_lmc=0x00 max_mtu=2048 port=2 port_state=PORT_ACTIVE sm_lid=0x0354 port_lid=0x0354 port_lmc=0x00 max_mtu=2048 I can run ifconfig on the slave I see ib0 properly: $bpsh 13 ifconfig ib0 ib0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Thanks, Galen From roland at topspin.com Mon Mar 14 09:28:38 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 09:28:38 -0800 Subject: [openib-general] [PATCH] Make SDP compile with gcc-2.95 Message-ID: <52br9m2l2x.fsf@topspin.com> This trivial patch seems to be required to get SDP to compile with gcc 2.95. It seems to be working around a bug with handling empty "arg..." parameters to macros (without this change, gcc 2.95 eats x->state in addition to the comma following it when arg is empty). - R. Index: infiniband/ulp/sdp/sdp_proto.h =================================================================== --- infiniband/ulp/sdp/sdp_proto.h (revision 1977) +++ infiniband/ulp/sdp/sdp_proto.h (working copy) @@ -482,7 +482,7 @@ extern int sdp_debug_level; if (x) { \ sdp_dbg_out(level, type, \ "<%d> <%04x:%04x> " format, \ - x->hashent, x->istate, x->state, \ + x->hashent, x->istate, x->state , \ ## arg); \ } \ else { \ From rminnich at lanl.gov Mon Mar 14 09:37:55 2005 From: rminnich at lanl.gov (Ronald G. Minnich) Date: Mon, 14 Mar 2005 10:37:55 -0700 (MST) Subject: [openib-general] vstat error on bproc slave node (VAPI_EGEN) In-Reply-To: <6b97df984990ccc634d3ab93673a52fd@cs.unm.edu> References: <6b97df984990ccc634d3ab93673a52fd@cs.unm.edu> Message-ID: On Mon, 14 Mar 2005, gshipman wrote: > I am attempting to configure our small cluster to use bproc and openib. > Note I am using gen1 on kernel 2.6.6 patched with the clustermatic > stuff, (should I be using gen2, is it stable for general use?). use gen2. I have tested it and it is ok. > > I have successfully gotten things going on the head node including opensm. I > have successfully gotten the slave nodes to run the patched kernel, load the > appropriate modules as well as the various user level libraries but I am > having an issue on the slave nodes: > > If I run: > $bpsh 13 /usr/mellanox/bin/vstat > 1 HCA found: > hca_id=InfiniHost0 > Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EGEN) arg. I used to have this a lot. It's a mellanox issue, and it's a pain to work around. Can you just cut to gen2 and stop using gen1? I would really recommend for future use only using gen2 and not using any of the mellanox stuff. I realize user level is not there yet but I think it is worth just waiting for. ron From roland at topspin.com Mon Mar 14 10:05:48 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:05:48 -0800 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <1110816445.4645.31.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Mar 2005 11:07:25 -0500") References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> Message-ID: <52oedm14sj.fsf@topspin.com> > > + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; > > + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; > These are getting set to 1. That's not what I was expecting. i.e. rdb_shift == 0. Hmm... OK, should be fixed now. - R. From mst at mellanox.co.il Mon Mar 14 10:13:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Mar 2005 20:13:45 +0200 Subject: [openib-general] Re: fmr support in mthca In-Reply-To: <523bv286ig.fsf@topspin.com> References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> <20050311131446.GC20989@mellanox.co.il> <523bv286ig.fsf@topspin.com> Message-ID: <20050314181345.GB17668@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: fmr support in mthca > > Michael> Roland, would you like me to implement FMRs in mthca? It > Michael> is needed by SDP for zero copy support. > > Yes, that would be great. > > BTW, for mem-free mode I put the MPT and MTT in lowmem to make FMRs > simpler to use. > > - R. > OK, I have done the implementation, will test and post tomorrow. -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 14 10:09:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Mar 2005 13:09:11 -0500 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <52oedm14sj.fsf@topspin.com> References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> Message-ID: <1110823751.4645.5.camel@localhost.localdomain> On Mon, 2005-03-14 at 13:05, Roland Dreier wrote: > > > + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; > > > + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; > > > These are getting set to 1. That's not what I was expecting. > > i.e. rdb_shift == 0. Hmm... > > OK, should be fixed now. This is now getting set to 4 (rdb_shift = 2). Still not what I was expecting :-( -- Hal From roland at topspin.com Mon Mar 14 10:18:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:18:18 -0800 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <1110823751.4645.5.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Mar 2005 13:09:11 -0500") References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> <1110823751.4645.5.camel@localhost.localdomain> Message-ID: <527jka147p.fsf@topspin.com> Hal> This is now getting set to 4 (rdb_shift = 2). Still not what Hal> I was expecting :-( What were you expecting? - R. From halr at voltaire.com Mon Mar 14 10:17:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Mar 2005 13:17:21 -0500 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <527jka147p.fsf@topspin.com> References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> <1110823751.4645.5.camel@localhost.localdomain> <527jka147p.fsf@topspin.com> Message-ID: <1110824241.4645.9.camel@localhost.localdomain> On Mon, 2005-03-14 at 13:18, Roland Dreier wrote: > Hal> This is now getting set to 4 (rdb_shift = 2). Still not what > Hal> I was expecting :-( > > What were you expecting? I thought this would be a larger number, around 64K. That's what I think gen1 sees. Is 4 correct (for gen2) ? -- Hal From roland at topspin.com Mon Mar 14 10:51:04 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:51:04 -0800 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <1110824241.4645.9.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Mar 2005 13:17:21 -0500") References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> <1110823751.4645.5.camel@localhost.localdomain> <527jka147p.fsf@topspin.com> <1110824241.4645.9.camel@localhost.localdomain> Message-ID: <52psy2ysbr.fsf@topspin.com> Hal> I thought this would be a larger number, around 64K. That's Hal> what I think gen1 sees. Is 4 correct (for gen2) ? The initiator number may be slightly bogus but the target number is correct. Each RDMA request takes 32 bytes of context memory at the target, so I don't see how a driver could support 64K outstanding RDMAs per QP (that would be 64K * 32 bytes * ~64K possible QPs = 128GB of context memory). - R. From roland at topspin.com Mon Mar 14 10:51:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:51:18 -0800 Subject: [openib-general] Re: fmr support in mthca In-Reply-To: <20050314181345.GB17668@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Mar 2005 20:13:45 +0200") References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> <20050311131446.GC20989@mellanox.co.il> <523bv286ig.fsf@topspin.com> <20050314181345.GB17668@mellanox.co.il> Message-ID: <52ll8qysbd.fsf@topspin.com> Michael> OK, I have done the implementation, will test and post tomorrow. Excellent, I'm looking forward to seeing it. - R. From roland at topspin.com Mon Mar 14 10:53:10 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:53:10 -0800 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <20050310123129.GA12542@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Mar 2005 14:31:29 +0200") References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> Message-ID: <52hdjeys89.fsf@topspin.com> I started looking over this code. As far as I can see, neither tx_depth nor rx_depth is used for anything. Is this correct? Should we just get rid of the options? Also would it make sense to change the RDMA operation to be unsignaled and just poll the destination buffer (ignore completions)? I realize this is a Mellanox extension to the spec but it might be more interesting than yet another variation on the pingpong code. - R. From tduffy at sun.com Mon Mar 14 10:57:13 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 14 Mar 2005 10:57:13 -0800 Subject: [openib-general] .org pavilion spot in LW 2005 in SF Message-ID: <1110826633.21708.8.camel@duffman> Duncan, Hello, I am contacting you as a representative from the OpenIB.org alliance. We are a non-profit organization that is dedicated to providing an open-source, multi-vendor, best-of-breed Infiniband stack for the Linux kernel as well as all the related userland libraries and utilities. Our website is http://www.openib.org. All of our projects are available under the GPL as well as a BSD license. We would like a slot in the .org pavilion for LinuxWorld 2005 in San Francisco. The booth will have demos of InfiniBand in action using the recently accepted code in the 2.6.11 kernel running on multiple vendors hardware. Please "reply all" as I have CC'ed the developer list for OpenIB. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Mon Mar 14 11:10:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Mar 2005 21:10:11 +0200 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <52hdjeys89.fsf@topspin.com> References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> <52hdjeys89.fsf@topspin.com> Message-ID: <20050314191011.GD17668@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] uverbs rdma example (updated) > > I started looking over this code. As far as I can see, neither > tx_depth nor rx_depth is used for anything. Is this correct? Should > we just get rid of the options? Hmm. rx_depth is unused. tx_depth is used. > Also would it make sense to change the RDMA operation to be unsignaled > and just poll the destination buffer (ignore completions)? Hmm. Thats what I do for receieve - polling on data. You cant assume the hardware will not read the buffer until you get a send completion, so you wont be able to re-use the send buffer. Since polling cq is done after post, it does not affect the latency in any way. > I realize > this is a Mellanox extension to the spec but it might be more > interesting than yet another variation on the pingpong code. > > - R. > What do you refer to as extension? -- MST - Michael S. Tsirkin From roland at topspin.com Mon Mar 14 11:57:22 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 11:57:22 -0800 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <20050314191011.GD17668@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Mar 2005 21:10:11 +0200") References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> <52hdjeys89.fsf@topspin.com> <20050314191011.GD17668@mellanox.co.il> Message-ID: <52u0nexaot.fsf@topspin.com> Michael> Hmm. rx_depth is unused. tx_depth is used. Where? If I search through the whole patch for "tx_depth," the only place I see it do anything at all is in + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); but I don't see how more than one send can be outstanding. Michael> Hmm. Thats what I do for receieve - polling on data. Michael> You cant assume the hardware will not read the buffer Michael> until you get a send completion, so you wont be able to Michael> re-use the send buffer. Since polling cq is done after Michael> post, it does not affect the latency in any way. That makes sense. Also I forgot that without a completion we can never clean up the WQE buffer. - R. From hozer at hozed.org Mon Mar 14 15:01:18 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 14 Mar 2005 17:01:18 -0600 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <20050314230118.GP9768@kalmia.hozed.org> The current InfiniBand model of using 'mlock()' to maintain a constant virtual to physical mapping for registered memory pages is not going to work with NUMA page migration and memory hotplug. I want to get some discussion started on this list, and once we have an idea what's feasable from the infiniband side, to bring up the discussion on linux-kernel, and get the memory hotplug and numa page migration people involved as well. I think the following list covers the major points. Are there any big "gotcha's" involved? * Add "registered" flag to linux/mm.h (VM_REGISTERED 0x01000000) * Need to define a 'registered memory' api. Maybe using 'madvise()' ? * Kernel needs to be able to unpin registered memory and shoot down cached mappings in network cards (treat IB/Iwarp cards like a TLB) * Requires IB/Iwarp card to dispatch an interrupt on a mapping 'miss' * This model allows applications to register more memory than physically exists, and the kernel manages what is actually pinned. * Requires adding hooks in MM code to dispatch driver mapping shootdowns. (A per-VM area list of adapters to be notified for the mapping?) I know that having the card dispatch an interrupt on an incoming packet that's not mapped is outside the spec. The alternative is that if the kernel wants to move some memory around that's registered, it's got to have some way to either kill the application, or tear down and re-establish all the QP's. I suppose an alternative would be a "SIG_I_KILLED_YOUR_MAPPINGS" type signal to tell the application (or library) that it needs to re-establish all it's pinned memory might work. From caitlinb at siliquent.com Mon Mar 14 15:29:06 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 14 Mar 2005 15:29:06 -0800 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <8508251A6FC08A489844A94261D3693A039002@fiona.siliquent.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Troy > Benjegerdes > Sent: Monday, March 14, 2005 3:01 PM > To: openib-general at openib.org > Subject: [openib-general] Getting rid of pinned memory requirement > > The current InfiniBand model of using 'mlock()' to maintain a > constant virtual to physical mapping for registered memory > pages is not going to work with NUMA page migration and > memory hotplug. > > I want to get some discussion started on this list, and once > we have an idea what's feasable from the infiniband side, to > bring up the discussion on linux-kernel, and get the memory > hotplug and numa page migration people involved as well. > > I think the following list covers the major points. Are there > any big "gotcha's" involved? > > * Add "registered" flag to linux/mm.h (VM_REGISTERED 0x01000000) > > * Need to define a 'registered memory' api. Maybe using 'madvise()' ? > > * Kernel needs to be able to unpin registered memory and > shoot down cached > mappings in network cards (treat IB/Iwarp cards like a TLB) > > * Requires IB/Iwarp card to dispatch an interrupt on a mapping 'miss' > The point of requiring that the memory be pinned is so that the IB/iWARP card does not have to deal with the kernel on a per-placement basis. That includes having to double-check any host memory resources to see if there is anything to 'miss' in the mapping. Once a memory region is registered the HCA/RNIC is entitled to assume that the mapping from LKey/Address (STag/TO) to physical memory is not subject to change. Enhancement protocols have been discussed in both DAPL and RNIC-PI to allow kernels to rearrange memory, but they involve the host explicitly telling the HCA/RNIC to suspend access to a memory region *and* when possible taking action to quiesce the connections using the memory region. > * This model allows applications to register more memory than > physically exists, and the kernel manages what is actually pinned. > Fundamental to any definition of RDMA is that the application controls the avialability of target memory -- not the kernel. That is why traditional buffer flow controls do not apply. > * Requires adding hooks in MM code to dispatch driver mapping > shootdowns. (A > per-VM area list of adapters to be notified for the mapping?) > > > I know that having the card dispatch an interrupt on an > incoming packet that's not mapped is outside the spec. The > alternative is that if the kernel wants to move some memory > around that's registered, it's got to have some way to either > kill the application, or tear down and re-establish all the > QP's. I suppose an alternative would be a > "SIG_I_KILLED_YOUR_MAPPINGS" type signal to tell the > application (or library) that it needs to re-establish all > it's pinned memory might work. > Only if you are re-arranging memory for a bunch of connections that were taking a nice nap. If you did this for active connections they could be dead before you could reregister the memory. And even if you could reregister it, how do you redistribute the RKeys? From hozer at hozed.org Mon Mar 14 15:56:05 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 14 Mar 2005 17:56:05 -0600 Subject: [openib-general] Getting rid of pinned memory requirement In-Reply-To: <8508251A6FC08A489844A94261D3693A039002@fiona.siliquent.com> References: <8508251A6FC08A489844A94261D3693A039002@fiona.siliquent.com> Message-ID: <20050314235605.GS9768@kalmia.hozed.org> On Mon, Mar 14, 2005 at 03:29:06PM -0800, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Troy > > Benjegerdes > > Sent: Monday, March 14, 2005 3:01 PM > > To: openib-general at openib.org > > Subject: [openib-general] Getting rid of pinned memory requirement > > > > The current InfiniBand model of using 'mlock()' to maintain a > > constant virtual to physical mapping for registered memory > > pages is not going to work with NUMA page migration and > > memory hotplug. > > > > I want to get some discussion started on this list, and once > > we have an idea what's feasable from the infiniband side, to > > bring up the discussion on linux-kernel, and get the memory > > hotplug and numa page migration people involved as well. > > > > I think the following list covers the major points. Are there > > any big "gotcha's" involved? > > > > * Add "registered" flag to linux/mm.h (VM_REGISTERED 0x01000000) > > > > * Need to define a 'registered memory' api. Maybe using 'madvise()' ? > > > > * Kernel needs to be able to unpin registered memory and > > shoot down cached > > mappings in network cards (treat IB/Iwarp cards like a TLB) > > > > * Requires IB/Iwarp card to dispatch an interrupt on a mapping 'miss' > > > > The point of requiring that the memory be pinned is so that > the IB/iWARP card does not have to deal with the kernel on > a per-placement basis. > > That includes having to double-check any host memory resources > to see if there is anything to 'miss' in the mapping. I guess I wasn't implying any 'double-checking'.. What I want is for the kernel to be able to unpin memory and tell the card it did so, instead of being locked into never being able to move that memory around. This requires no host memory interaction. By doing this, I can register a whole lot *more* memory, and the kernel can still keep buggy applications from trashing the whole system. [snip] > Fundamental to any definition of RDMA is that the application > controls the avialability of target memory -- not the kernel. > That is why traditional buffer flow controls do not apply. While hardware designers may like this idea, I would like to make the point that if you want the application to *absolutely* control the availability of physical memory, you shouldn't be writing userspace applications that run on Linux. There's always going to be a limit on how much memory you can mlock. And right now the only option the kernel has for unlocking that memory is to kill the application. I think there's got to be a reasonable way to deal with this that doesn't make the application responsible for everything in the world. We don't want to have to rewrite every RDMA application to be able to support memory hotplug. This is an obvious layer that can and should be abstracted by the kernel. From mshefty at ichips.intel.com Mon Mar 14 16:22:58 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Mar 2005 16:22:58 -0800 Subject: [openib-general] [PATCH] [MAD] API changes and updates to support RMPP Message-ID: <20050314162258.1bedff07.mshefty@ichips.intel.com> This patch updates the MAD API to help provide support for the RMPP implementation and clients. Notable changes: * A valid memory region (MR) is returned as part of the mad_agent registration process. The agent, CM, and SA query modules were updated to use the returned MR. * A list_head structure was added to ib_mad_recv_wc to make walking the list of received MAD buffers easier. As part of this change, a bug was fixed where freed memory could have been accessed in ib_free_recv_mad() if RMPP were enabled. This change is unlikely to affect existing clients. Please respond with any comments. The received RMPP support code* is currently dependent on these changes. Signed-off-by: Sean Hefty *not included, some assembly required... Index: core/agent.c =================================================================== --- core/agent.c (revision 1964) +++ core/agent.c (working copy) @@ -135,7 +135,7 @@ static int agent_mad_send(struct ib_mad_ sizeof(mad_priv->mad), DMA_TO_DEVICE); gather_list.length = sizeof(mad_priv->mad); - gather_list.lkey = (*port_priv->mr).lkey; + gather_list.lkey = mad_agent->mr->lkey; send_wr.next = NULL; send_wr.opcode = IB_WR_SEND; @@ -324,22 +324,12 @@ int ib_agent_port_open(struct ib_device goto error3; } - port_priv->mr = ib_get_dma_mr(port_priv->smp_agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(port_priv->mr)) { - printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); - ret = PTR_ERR(port_priv->mr); - goto error4; - } - spin_lock_irqsave(&ib_agent_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_agent_port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); return 0; -error4: - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error3: ib_unregister_mad_agent(port_priv->smp_agent); error2: @@ -363,8 +353,6 @@ int ib_agent_port_close(struct ib_device list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - ib_dereg_mr(port_priv->mr); - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); ib_unregister_mad_agent(port_priv->smp_agent); kfree(port_priv); Index: core/cm.c =================================================================== --- core/cm.c (revision 1977) +++ core/cm.c (working copy) @@ -75,7 +75,6 @@ static struct ib_cm { struct cm_port { struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; - struct ib_mr *mr; u8 port_num; }; @@ -191,7 +190,7 @@ static int cm_alloc_msg(struct cm_id_pri DMA_TO_DEVICE); pci_unmap_addr_set(m, mapping, m->sge.addr); m->sge.length = sizeof m->mad; - m->sge.lkey = cm_id_priv->av.port->mr->lkey; + m->sge.lkey = cm_id_priv->av.port->mad_agent->mr->lkey; m->send_wr.wr_id = (unsigned long) m; m->send_wr.sg_list = &m->sge; @@ -2970,14 +2969,9 @@ static void cm_add_one(struct ib_device if (IS_ERR(port->mad_agent)) goto error2; - port->mr = ib_get_dma_mr(port->mad_agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(port->mr)) - goto error3; - ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error4; + goto error3; } ib_set_client_data(device, &cm_client, cm_dev); @@ -2986,15 +2980,13 @@ static void cm_add_one(struct ib_device write_unlock_irqrestore(&cm.device_lock, flags); return; -error4: - ib_dereg_mr(port->mr); error3: ib_unregister_mad_agent(port->mad_agent); error2: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { - port = &cm_dev->port[i]; + port = &cm_dev->port[i-1]; ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } @@ -3022,7 +3014,6 @@ static void cm_remove_one(struct ib_devi for (i = 1; i <= device->phys_port_cnt; i++) { port = &cm_dev->port[i-1]; - ib_dereg_mr(port->mr); ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } Index: core/mad.c =================================================================== --- core/mad.c (revision 1980) +++ core/mad.c (working copy) @@ -35,8 +35,6 @@ #include #include -#include - #include "mad_priv.h" #include "smi.h" #include "agent.h" @@ -264,22 +262,29 @@ struct ib_mad_agent *ib_register_mad_age ret = ERR_PTR(-ENOMEM); goto error1; } + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + + mad_agent_priv->agent.mr = ib_get_dma_mr(port_priv->qp_info[qpn].qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(mad_agent_priv->agent.mr)) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } if (mad_reg_req) { reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); if (!reg_req) { ret = ERR_PTR(-ENOMEM); - goto error2; + goto error3; } /* Make a copy of the MAD registration request */ memcpy(reg_req, mad_reg_req, sizeof *reg_req); } /* Now, fill in the various structures */ - memset(mad_agent_priv, 0, sizeof *mad_agent_priv); mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; - mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.rmpp_version = rmpp_version; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; @@ -304,7 +309,7 @@ struct ib_mad_agent *ib_register_mad_age if (method) { if (method_in_use(&method, mad_reg_req)) - goto error3; + goto error4; } } ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, @@ -320,14 +325,14 @@ struct ib_mad_agent *ib_register_mad_age if (is_vendor_method_in_use( vendor_class, mad_reg_req)) - goto error3; + goto error4; } } ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); } if (ret2) { ret = ERR_PTR(ret2); - goto error3; + goto error4; } } @@ -349,11 +354,13 @@ struct ib_mad_agent *ib_register_mad_age return &mad_agent_priv->agent; -error3: +error4: spin_unlock_irqrestore(&port_priv->reg_lock, flags); kfree(reg_req); -error2: +error3: kfree(mad_agent_priv); +error2: + ib_dereg_mr(mad_agent_priv->agent.mr); error1: return ret; } @@ -490,18 +497,16 @@ static void unregister_mad_agent(struct * MADs, preventing us from queuing additional work */ cancel_mads(mad_agent_priv); - port_priv = mad_agent_priv->qp_info->port_priv; - cancel_delayed_work(&mad_agent_priv->timed_work); - flush_workqueue(port_priv->wq); spin_lock_irqsave(&port_priv->reg_lock, flags); remove_mad_reg_req(mad_agent_priv); list_del(&mad_agent_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); - /* XXX: Cleanup pending RMPP receives for this agent */ + flush_workqueue(port_priv->wq); + /* ib_cancel_rmpp_recvs(mad_agent_priv); */ atomic_dec(&mad_agent_priv->refcount); wait_event(mad_agent_priv->wait, @@ -509,6 +514,7 @@ static void unregister_mad_agent(struct if (mad_agent_priv->reg_req) kfree(mad_agent_priv->reg_req); + ib_dereg_mr(mad_agent_priv->agent.mr); kfree(mad_agent_priv); } @@ -757,7 +763,7 @@ static int handle_outgoing_dr_smp(struct list_add_tail(&local->completion_list, &mad_agent_priv->local_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); queue_work(mad_agent_priv->qp_info->port_priv->wq, - &mad_agent_priv->local_work); + &mad_agent_priv->local_work); ret = 1; out: return ret; @@ -919,31 +925,33 @@ EXPORT_SYMBOL(ib_post_send_mad); */ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_mad_recv_buf *entry; + struct ib_mad_recv_buf *mad_recv_buf, *temp_recv_buf; struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_private *priv; + struct list_head free_list; - mad_priv_hdr = container_of(mad_recv_wc, - struct ib_mad_private_header, - recv_wc); - priv = container_of(mad_priv_hdr, struct ib_mad_private, header); - - /* - * Walk receive buffer list associated with this WC - * No need to remove them from list of receive buffers - */ - list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { - /* Free previous receive buffer */ - kmem_cache_free(ib_mad_cache, priv); + if (mad_recv_wc->mad_len <= sizeof(struct ib_mad)) { mad_priv_hdr = container_of(mad_recv_wc, struct ib_mad_private_header, recv_wc); priv = container_of(mad_priv_hdr, struct ib_mad_private, header); - } + kmem_cache_free(ib_mad_cache, priv); + } else { + INIT_LIST_HEAD(&free_list); + list_splice_init(&mad_recv_wc->rmpp_list, &free_list); - /* Free last buffer */ - kmem_cache_free(ib_mad_cache, priv); + list_for_each_entry_safe(mad_recv_buf, temp_recv_buf, + &free_list, list) { + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, + struct ib_mad_private, + header); + kmem_cache_free(ib_mad_cache, priv); + } + } } EXPORT_SYMBOL(ib_free_recv_mad); @@ -1486,16 +1494,19 @@ out: return valid; } -/* - * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet - */ -static struct ib_mad_private * -reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_private *recv) -{ - /* Until we have RMPP, all receives are reassembled!... */ - INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); - return recv; +static struct ib_mad_recv_wc * +process_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_recv_wc *mad_recv_wc) +{ + INIT_LIST_HEAD(&mad_recv_wc->rmpp_list); + list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list); + + /* + if (mad_agent_priv->agent.rmpp_version) + return ib_process_rmpp_recv(mad_agent_priv, mad_recv_wc); + else + */ + return mad_recv_wc; } static struct ib_mad_send_wr_private* @@ -1526,16 +1537,17 @@ find_send_req(struct ib_mad_agent_privat } static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_private *recv, + struct ib_mad_recv_wc *mad_recv_wc, int solicited) { struct ib_mad_send_wr_private *mad_send_wr; struct ib_mad_send_wc mad_send_wc; unsigned long flags; + u64 tid; - /* Fully reassemble receive before processing */ - recv = reassemble_recv(mad_agent_priv, recv); - if (!recv) { + /* Process the receive before giving it to the user. */ + mad_recv_wc = process_recv(mad_agent_priv, mad_recv_wc); + if (!mad_recv_wc) { if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); return; @@ -1543,12 +1555,12 @@ static void ib_mad_complete_recv(struct /* Complete corresponding request */ if (solicited) { + tid = mad_recv_wc->recv_buf.mad->mad_hdr.tid; spin_lock_irqsave(&mad_agent_priv->lock, flags); - mad_send_wr = find_send_req(mad_agent_priv, - recv->mad.mad.mad_hdr.tid); + mad_send_wr = find_send_req(mad_agent_priv, tid); if (!mad_send_wr) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - ib_free_recv_mad(&recv->header.recv_wc); + ib_free_recv_mad(mad_recv_wc); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); return; @@ -1558,10 +1570,9 @@ static void ib_mad_complete_recv(struct spin_unlock_irqrestore(&mad_agent_priv->lock, flags); /* Defined behavior is to complete response before request */ - recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; - mad_agent_priv->agent.recv_handler( - &mad_agent_priv->agent, - &recv->header.recv_wc); + mad_recv_wc->wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, + mad_recv_wc); atomic_dec(&mad_agent_priv->refcount); mad_send_wc.status = IB_WC_SUCCESS; @@ -1569,9 +1580,8 @@ static void ib_mad_complete_recv(struct mad_send_wc.wr_id = mad_send_wr->wr_id; ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); } else { - mad_agent_priv->agent.recv_handler( - &mad_agent_priv->agent, - &recv->header.recv_wc); + mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, + mad_recv_wc); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); } @@ -1675,7 +1685,8 @@ local: solicited = solicited_mad(&recv->mad.mad); mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); if (mad_agent) { - ib_mad_complete_recv(mad_agent, recv, solicited); + ib_mad_complete_recv(mad_agent, &recv->header.recv_wc, + solicited); /* * recv is freed up in error cases in ib_mad_complete_recv * or via recv_handler in ib_mad_complete_recv() @@ -1757,10 +1768,18 @@ static void ib_mad_complete_send_wr(stru { struct ib_mad_agent_private *mad_agent_priv; unsigned long flags; + enum ib_mad_result ret; mad_agent_priv = container_of(mad_send_wr->agent, struct ib_mad_agent_private, agent); + /* + if (mad_agent_priv->agent.rmpp_version) + ret = process_rmpp_send_wc(mad_send_wr, mad_send_wc); + else + */ + ret = IB_MAD_RESULT_SUCCESS; + spin_lock_irqsave(&mad_agent_priv->lock, flags); if (mad_send_wc->status != IB_WC_SUCCESS && mad_send_wr->status == IB_WC_SUCCESS) { @@ -1784,8 +1803,9 @@ static void ib_mad_complete_send_wr(stru if (mad_send_wr->status != IB_WC_SUCCESS ) mad_send_wc->status = mad_send_wr->status; - mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, - mad_send_wc); + if (ret == IB_MAD_RESULT_SUCCESS) + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); /* Release reference on agent taken when sending */ if (atomic_dec_and_test(&mad_agent_priv->refcount)) @@ -2034,8 +2054,7 @@ void cancel_sends(void *data) &mad_send_wc); kfree(mad_send_wr); - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); + atomic_dec(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->lock, flags); } spin_unlock_irqrestore(&mad_agent_priv->lock, flags); Index: core/agent_priv.h =================================================================== --- core/agent_priv.h (revision 1964) +++ core/agent_priv.h (working copy) @@ -57,7 +57,6 @@ struct ib_agent_port_private { int port_num; struct ib_mad_agent *smp_agent; /* SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ - struct ib_mr *mr; }; #endif /* __IB_AGENT_PRIV_H__ */ Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1980) +++ core/mad_priv.h (working copy) @@ -101,7 +101,6 @@ struct ib_mad_agent_private { atomic_t refcount; wait_queue_head_t wait; - u8 rmpp_version; }; struct ib_mad_snoop_private { Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 1964) +++ core/sa_query.c (working copy) @@ -77,7 +77,6 @@ struct ib_sa_sm_ah { struct ib_sa_port { struct ib_mad_agent *agent; - struct ib_mr *mr; struct ib_sa_sm_ah *sm_ah; struct work_struct update_task; spinlock_t ah_lock; @@ -492,7 +491,7 @@ retry: sizeof (struct ib_sa_mad), DMA_TO_DEVICE); gather_list.length = sizeof (struct ib_sa_mad); - gather_list.lkey = port->mr->lkey; + gather_list.lkey = port->agent->mr->lkey; pci_unmap_addr_set(query, mapping, gather_list.addr); ret = ib_post_send_mad(port->agent, &wr, &bad_wr); @@ -771,7 +770,6 @@ static void ib_sa_add_one(struct ib_devi sa_dev->end_port = e; for (i = 0; i <= e - s; ++i) { - sa_dev->port[i].mr = NULL; sa_dev->port[i].sm_ah = NULL; sa_dev->port[i].port_num = i + s; spin_lock_init(&sa_dev->port[i].ah_lock); @@ -783,13 +781,6 @@ static void ib_sa_add_one(struct ib_devi if (IS_ERR(sa_dev->port[i].agent)) goto err; - sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(sa_dev->port[i].mr)) { - ib_unregister_mad_agent(sa_dev->port[i].agent); - goto err; - } - INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah, &sa_dev->port[i]); } @@ -813,10 +804,8 @@ static void ib_sa_add_one(struct ib_devi return; err: - while (--i >= 0) { - ib_dereg_mr(sa_dev->port[i].mr); + while (--i >= 0) ib_unregister_mad_agent(sa_dev->port[i].agent); - } kfree(sa_dev); Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1964) +++ include/ib_mad.h (working copy) @@ -70,9 +70,37 @@ #define IB_MGMT_MAX_METHODS 128 +/* RMPP information */ +#define IB_MGMT_RMPP_VERSION 1 + +#define IB_MGMT_RMPP_TYPE_DATA 1 +#define IB_MGMT_RMPP_TYPE_ACK 2 +#define IB_MGMT_RMPP_TYPE_STOP 3 +#define IB_MGMT_RMPP_TYPE_ABORT 4 + +#define IB_MGMT_RMPP_FLAG_ACTIVE 1 +#define IB_MGMT_RMPP_FLAG_FIRST (1<<1) +#define IB_MGMT_RMPP_FLAG_LAST (1<<2) + +#define IB_MGMT_RMPP_NO_RESPTIME 0x1F + +#define IB_MGMT_RMPP_STATUS_SUCCESS 0 +#define IB_MGMT_RMPP_STATUS_RESX 1 +#define IB_MGMT_RMPP_STATUS_T2L 118 +#define IB_MGMT_RMPP_STATUS_BAD_LEN 119 +#define IB_MGMT_RMPP_STATUS_BAD_SEG 120 +#define IB_MGMT_RMPP_STATUS_BADT 121 +#define IB_MGMT_RMPP_STATUS_W2S 122 +#define IB_MGMT_RMPP_STATUS_S2B 123 +#define IB_MGMT_RMPP_STATUS_BAD_STATUS 124 +#define IB_MGMT_RMPP_STATUS_UNV 125 +#define IB_MGMT_RMPP_STATUS_TMR 126 +#define IB_MGMT_RMPP_STATUS_UNSPEC 127 + #define IB_QP0 0 #define IB_QP1 __constant_htonl(1) #define IB_QP1_QKEY 0x80010000 +#define IB_QP_SET_QKEY 0x80000000 struct ib_grh { u32 version_tclass_flow; @@ -124,6 +152,45 @@ struct ib_vendor_mad { u8 data[216]; } __attribute__ ((packed)); +/** + * ib_get_rmpp_resptime - Returns the RMPP response time. + * @rmpp_hdr: An RMPP header. + */ +static inline u8 ib_get_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr) +{ + return rmpp_hdr->rmpp_rtime_flags >> 3; +} + +/** + * ib_get_rmpp_flags - Returns the RMPP flags. + * @rmpp_hdr: An RMPP header. + */ +static inline u8 ib_get_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr) +{ + return rmpp_hdr->rmpp_rtime_flags & 0x7; +} + +/** + * ib_set_rmpp_resptime - Sets the response time in an RMPP header. + * @rmpp_hdr: An RMPP header. + * @rtime: The response time to set. + */ +static inline void ib_set_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr, u8 rtime) +{ + rmpp_hdr->rmpp_rtime_flags = ib_get_rmpp_flags(rmpp_hdr) | (rtime << 3); +} + +/** + * ib_set_rmpp_flags - Sets the flags in an RMPP header. + * @rmpp_hdr: An RMPP header. + * @flags: The flags to set. + */ +static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags) +{ + rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF1) | + (flags & 0x7); +} + struct ib_mad_agent; struct ib_mad_send_wc; struct ib_mad_recv_wc; @@ -168,6 +235,7 @@ typedef void (*ib_mad_recv_handler)(stru * ib_mad_agent - Used to track MAD registration with the access layer. * @device: Reference to device registration is on. * @qp: Reference to QP used for sending and receiving MADs. + * @mr: Memory region for system memory usable for DMA. * @recv_handler: Callback handler for a received MAD. * @send_handler: Callback handler for a sent MAD. * @snoop_handler: Callback handler for snooped sent MADs. @@ -176,16 +244,19 @@ typedef void (*ib_mad_recv_handler)(stru * Unsolicited MADs sent by this client will have the upper 32-bits * of their TID set to this value. * @port_num: Port number on which QP is registered + * @rmpp_version: If set, indicates the RMPP version used by this agent. */ struct ib_mad_agent { struct ib_device *device; struct ib_qp *qp; + struct ib_mr *mr; ib_mad_recv_handler recv_handler; ib_mad_send_handler send_handler; ib_mad_snoop_handler snoop_handler; void *context; u32 hi_tid; u8 port_num; + u8 rmpp_version; }; /** @@ -219,6 +290,7 @@ struct ib_mad_recv_buf { * ib_mad_recv_wc - received MAD information. * @wc: Completion information for the received data. * @recv_buf: Specifies the location of the received data buffer(s). + * @rmpp_list: Specifies a list of RMPP reassembled received MAD buffers. * @mad_len: The length of the received MAD, without duplicated headers. * * For received response, the wr_id field of the wc is set to the wr_id @@ -227,6 +299,7 @@ struct ib_mad_recv_buf { struct ib_mad_recv_wc { struct ib_wc *wc; struct ib_mad_recv_buf recv_buf; + struct list_head rmpp_list; int mad_len; }; From caitlinb at siliquent.com Mon Mar 14 16:33:19 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 14 Mar 2005 16:33:19 -0800 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <8508251A6FC08A489844A94261D3693A039009@fiona.siliquent.com> > > While hardware designers may like this idea, I would like to > make the point that if you want the application to > *absolutely* control the availability of physical memory, you > shouldn't be writing userspace applications that run on Linux. > This is not just a hardware design issue. It is fundamental to why RDMA is able to optimize end-to-end traffic flow. The application is directly advertising the availability of buffers (through RKeys) to the other side. It is bad network engineering for the kernel to revoke that good faith advertisement and count on the HCA/RNIC to say "oops" when the data does arrive but the targeted buffer is not in memory. But that does not mean that you cannot design mechanisms below the application to allow the kernel to re-organize physical memory -- it just means that the kernel had best not be playing overcommit tricks behind the applications back. To use a banking analogy, an adverised RKey is like a certified check. The application has sent this RKey to its peer, and it expects the HCA/RNIC to honor that check when RDMA Writes are made to that memory. But just as a bank does not have to guarantee in advance which specific bills will be used to cash a guaranteed check, there is nothing to say that the virtual to physical mappings are permanent and immutable. It would be possible to design an interface that allowed the kernel to: a) suspend the use of a memory region. 1) outputs referencing the suspend LKey would be temporarily held by the HCA/RNIC. 2) inputs referencing the suspend memory region would be delayed (RNR NAK, internal buffers, etc.) 3) possibly ask the peer to similarly suspend sending. This is trickier though. b) Update the virtual to physical mappings, or at least provide the RDMA layer with "physical page X replaced by physical page Y". c) unsuspend the memory region. The key is that the entire operation either has to be fast enough so that no connection or application session layer time-outs occur, or an end-to-end agreement to suspend the connetion is a requirement. The first option seems more plausible to me, the second essentially reuqires extending the CM protocol. That's a tall order even for InfiniBand, and it's even worse for iWARP where the CM functionality typically ends when the connection is established. > There's always going to be a limit on how much memory you can > mlock. And right now the only option the kernel has for > unlocking that memory is to kill the application. I think > there's got to be a reasonable way to deal with this that > doesn't make the application responsible for everything in > the world. We don't want to have to rewrite every RDMA > application to be able to support memory hotplug. This is an > obvious layer that can and should be abstracted by the kernel. > Yes, there are limits on how much memory you can mlock, or even allocate. Applications are required to reqister memory precisely because the required guarantess are not there by default. Eliminating those guarantees *is* effectively rewriting every RDMA application without even letting them know. From hozer at hozed.org Mon Mar 14 17:06:19 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 14 Mar 2005 19:06:19 -0600 Subject: [openib-general] Getting rid of pinned memory requirement In-Reply-To: <8508251A6FC08A489844A94261D3693A039009@fiona.siliquent.com> References: <8508251A6FC08A489844A94261D3693A039009@fiona.siliquent.com> Message-ID: <20050315010619.GT9768@kalmia.hozed.org> On Mon, Mar 14, 2005 at 04:33:19PM -0800, Caitlin Bestler wrote: > > > > While hardware designers may like this idea, I would like to > > make the point that if you want the application to > > *absolutely* control the availability of physical memory, you > > shouldn't be writing userspace applications that run on Linux. > > > > This is not just a hardware design issue. It is fundamental to > why RDMA is able to optimize end-to-end traffic flow. The application > is directly advertising the availability of buffers (through RKeys) > to the other side. It is bad network engineering for the kernel > to revoke that good faith advertisement and count on the HCA/RNIC > to say "oops" when the data does arrive but the targeted buffer > is not in memory. > > But that does not mean that you cannot design mechanisms below > the application to allow the kernel to re-organize physical > memory -- it just means that the kernel had best not be playing > overcommit tricks behind the applications back. > > To use a banking analogy, an adverised RKey is like a certified > check. The application has sent this RKey to its peer, and it > expects the HCA/RNIC to honor that check when RDMA Writes are > made to that memory. But just as a bank does not have to > guarantee in advance which specific bills will be used to > cash a guaranteed check, there is nothing to say that the > virtual to physical mappings are permanent and immutable. > > It would be possible to design an interface that allowed > the kernel to: > > a) suspend the use of a memory region. > 1) outputs referencing the suspend LKey would be > temporarily held by the HCA/RNIC. > 2) inputs referencing the suspend memory region > would be delayed (RNR NAK, internal buffers, > etc.) > 3) possibly ask the peer to similarly suspend > sending. This is trickier though. > b) Update the virtual to physical mappings, or at least > provide the RDMA layer with "physical page X replaced > by physical page Y". > c) unsuspend the memory region. > > The key is that the entire operation either has to be fast > enough so that no connection or application session layer > time-outs occur, or an end-to-end agreement to suspend the > connetion is a requirement. The first option seems more > plausible to me, the second essentially reuqires extending > the CM protocol. That's a tall order even for InfiniBand, > and it's even worse for iWARP where the CM functionality > typically ends when the connection is established. I'll buy the good network design argument. I suppose if the kernel wants to revoke a card's pinned memory, we should be able to guarantee that it gets new pinned memory within a bounded time. What sort of timing do we need? Milliseconds? Microseconds? In the case of iWarp, isn't this just TCP underneath? If so, can't we just drop any packets in the pipe on the floor and let them get retransmitted? (I suppose the same argument goes for infiniband.. what sort of a time window do we have for retransmission?) What are the limits on end-to-end flow control in IB and iWarp? > > > > > There's always going to be a limit on how much memory you can > > mlock. And right now the only option the kernel has for > > unlocking that memory is to kill the application. I think > > there's got to be a reasonable way to deal with this that > > doesn't make the application responsible for everything in > > the world. We don't want to have to rewrite every RDMA > > application to be able to support memory hotplug. This is an > > obvious layer that can and should be abstracted by the kernel. > > > > Yes, there are limits on how much memory you can mlock, or > even allocate. Applications are required to reqister memory > precisely because the required guarantess are not there by > default. Eliminating those guarantees *is* effectively > rewriting every RDMA application without even letting > them know. Some of this argument is a policy issue, which I would argue shouldn't be hard-coded in the code or in the network hardware. At least in my view, the guarantees are only there to make applications go fast. We are getting low latency and high performance with infiniband by making memory registration go really really slow. If, to make big HPC simulation applications work, we wind up doing memcpy() to put the data into a registered buffer because we can't register half of physical memory, the application isn't going very fast. From caitlinb at siliquent.com Mon Mar 14 17:35:31 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 14 Mar 2005 17:35:31 -0800 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <8508251A6FC08A489844A94261D3693A03900B@fiona.siliquent.com> > -----Original Message----- > From: Troy Benjegerdes [mailto:hozer at hozed.org] > Sent: Monday, March 14, 2005 5:06 PM > To: Caitlin Bestler > Cc: openib-general at openib.org > Subject: Re: [openib-general] Getting rid of pinned memory requirement > > > > > The key is that the entire operation either has to be fast > > enough so that no connection or application session layer > > time-outs occur, or an end-to-end agreement to suspend the > > connetion is a requirement. The first option seems more > > plausible to me, the second essentially > > reuqires extending the CM protocol. That's a tall order even for > > InfiniBand, and it's even worse for iWARP where the CM > > functionality typically ends when the connection is established. > > I'll buy the good network design argument. > > I suppose if the kernel wants to revoke a card's pinned > memory, we should be able to guarantee that it gets new > pinned memory within a bounded time. What sort of timing do > we need? Milliseconds? > Microseconds? > > In the case of iWarp, isn't this just TCP underneath? If so, > can't we just drop any packets in the pipe on the floor and > let them get retransmitted? (I suppose the same argument goes > for infiniband.. > what sort of a time window do we have for retransmission?) > > What are the limits on end-to-end flow control in IB and iWarp? > >From the RDMA Provider's perspective, the short answer is "quick enough so that I don't have to do anything heroic to keep the connection alive." With TCP you also have to add "and healthy". If you've ever had a long download that got effectively stalled by a burst of noise and you just hit the 'reload' button on your browser then you know what I'm talking about. But in transport neutral terms I would think that one RTT is definitely safe -- that much data could have been dropped by one switch failure or one nasty spike in inbound noise. > > > > Yes, there are limits on how much memory you can mlock, or even > > allocate. Applications are required to reqister memory precisely > > because the required guarantess are not there by default. > Eliminating > > those guarantees *is* effectively rewriting every RDMA application > > without even letting them know. > > Some of this argument is a policy issue, which I would argue > shouldn't be hard-coded in the code or in the network hardware. > > At least in my view, the guarantees are only there to make > applications go fast. We are getting low latency and high > performance with infiniband by making memory registration go > really really slow. If, to make big HPC simulation > applications work, we wind up doing memcpy() to put the data > into a registered buffer because we can't register half of > physical memory, the application isn't going very fast. > What you are looking for is a distinction between registering memory to *enable* the RNIC to optimize local access and registering memory to enable its being advertised to the remote end. Early implementations of RDMA, both IB and iWARP, have not distinquished between the two. But theoretically *applications* do not need memory regions that are not enabled for remote access to be pinned. That is an RNIC requirement that could evolve. But applications themselves *do* need remotely accessible memory regions, portions of which they intend to advertise with RKeys, to be truly available (i.e., pinned). You are also making a policy assumption that an application that actually needs half of physical memory should be using paged memory. Memory is cheap, and if performance is critical why should this memory be swapped out to disk? Is the limitation on not being able to register half of physical memory based upon some assumption that swapping is a requirement? Or is it a limitation in the memory region size? If it's the latter, you need to get the OS to support larger page sizes. From abhijitngpune at indiatimes.com Mon Mar 14 21:32:34 2005 From: abhijitngpune at indiatimes.com (abhijitngpune) Date: Tue, 15 Mar 2005 11:02:34 +0530 Subject: [openib-general] openSM Message-ID: <200503150518.KAA03171@WS0005.indiatimes.com> Hi, Does openSM supports non-fat tree (irregular such as graph) topologies? AbhijeetIndiatimes Email now powered by APIC Advantage. Help! Help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Mar 14 21:55:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 07:55:42 +0200 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <52u0nexaot.fsf@topspin.com> References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> <52hdjeys89.fsf@topspin.com> <20050314191011.GD17668@mellanox.co.il> <52u0nexaot.fsf@topspin.com> Message-ID: <20050315055542.GA18928@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] uverbs rdma example (updated) > > Michael> Hmm. rx_depth is unused. tx_depth is used. > > Where? If I search through the whole patch for "tx_depth," the only > place I see it do anything at all is in > > + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); It also sets the qp depth, not sure why dont you see it. Attached please find my latest version of the test. > but I don't see how more than one send can be outstanding. Only one may be outstanding at a time, but tx_depth option makes it possible to study the effect of qp/cq size on the latency. mst -- MST - Michael S. Tsirkin -------------- next part -------------- /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id$ */ #if HAVE_CONFIG_H # include #endif /* HAVE_CONFIG_H */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include enum { PINGPONG_RDMA_WRID = 3, }; static int page_size; struct pingpong_context { struct ibv_context *context; struct ibv_pd *pd; struct ibv_mr *mr; struct ibv_cq *cq; struct ibv_qp *qp; void *buf; volatile char *post_buf; volatile char *poll_buf; int size; int rx_depth; int tx_depth; struct ibv_sge list; struct ibv_send_wr wr; }; struct pingpong_dest { int lid; int qpn; int psn; unsigned rkey; unsigned long long vaddr; }; /* * pp_get_local_lid() uses a pretty bogus method for finding the LID * of a local port. Please don't copy this into your app (or if you * do, please rip it out soon). */ static uint16_t pp_get_local_lid(struct ibv_device *dev, int port) { char path[256]; char val[16]; char *name; if (sysfs_get_mnt_path(path, sizeof path)) { fprintf(stderr, "Couldn't find sysfs mount.\n"); return 0; } asprintf(&name, "%s/class/infiniband/%s/ports/%d/lid", path, ibv_get_device_name(dev), port); if (sysfs_read_attribute_value(name, val, sizeof val)) { fprintf(stderr, "Couldn't read LID at %s\n", name); return 0; } return strtol(val, NULL, 0); } static int pp_client_connect(const char *servername, int port) { struct addrinfo *res, *t; struct addrinfo hints = { .ai_family = AF_UNSPEC, .ai_socktype = SOCK_STREAM }; char *service; int n; int sockfd = -1; asprintf(&service, "%d", port); n = getaddrinfo(servername, service, &hints, &res); if (n < 0) { fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); return n; } for (t = res; t; t = t->ai_next) { sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) break; close(sockfd); sockfd = -1; } } freeaddrinfo(res); if (sockfd < 0) { fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); return sockfd; } return sockfd; } struct pingpong_dest * pp_client_exch_dest(int sockfd, const struct pingpong_dest *my_dest) { struct pingpong_dest *rem_dest = NULL; char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; int parsed; sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn,my_dest->rkey,my_dest->vaddr); if (write(sockfd, msg, sizeof msg) != sizeof msg) { perror("client write"); fprintf(stderr, "Couldn't send local address\n"); goto out; } if (read(sockfd, msg, sizeof msg) != sizeof msg) { perror("client read"); fprintf(stderr, "Couldn't read remote address\n"); goto out; } rem_dest = malloc(sizeof *rem_dest); if (!rem_dest) goto out; parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); if (parsed != 5) { fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, msg); free(rem_dest); rem_dest = NULL; goto out; } out: return rem_dest; } int pp_server_connect(int port) { struct addrinfo *res, *t; struct addrinfo hints = { .ai_flags = AI_PASSIVE, .ai_family = AF_UNSPEC, .ai_socktype = SOCK_STREAM }; char *service; int sockfd = -1, connfd; int n; asprintf(&service, "%d", port); n = getaddrinfo(NULL, service, &hints, &res); if (n < 0) { fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); return n; } for (t = res; t; t = t->ai_next) { sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { n = 1; setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) break; close(sockfd); sockfd = -1; } } freeaddrinfo(res); if (sockfd < 0) { fprintf(stderr, "Couldn't listen to port %d\n", port); return sockfd; } listen(sockfd, 1); connfd = accept(sockfd, NULL, 0); if (connfd < 0) { perror("server accept"); fprintf(stderr, "accept() failed\n"); close(sockfd); return connfd; } close(sockfd); return connfd; } static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) { char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; struct pingpong_dest *rem_dest = NULL; int parsed; int n; n = read(connfd, msg, sizeof msg); if (n != sizeof msg) { perror("server read"); fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); goto out; } rem_dest = malloc(sizeof *rem_dest); if (!rem_dest) goto out; parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn, &rem_dest->rkey, &rem_dest->vaddr); if (parsed != 5) { fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, msg); free(rem_dest); rem_dest = NULL; goto out; } sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn, my_dest->rkey, my_dest->vaddr); if (write(connfd, msg, sizeof msg) != sizeof msg) { perror("server write"); fprintf(stderr, "Couldn't send local address\n"); free(rem_dest); rem_dest = NULL; goto out; } out: return rem_dest; } static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, int tx_depth, int rx_depth, int port) { struct pingpong_context *ctx; ctx = malloc(sizeof *ctx); if (!ctx) return NULL; ctx->size = size; ctx->rx_depth = rx_depth; ctx->tx_depth = tx_depth; ctx->buf = memalign(page_size, size * 2); if (!ctx->buf) { fprintf(stderr, "Couldn't allocate work buf.\n"); return NULL; } memset(ctx->buf, 0, size * 2); ctx->post_buf = (char*)ctx->buf + (size - 1); ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); ctx->context = ibv_open_device(ib_dev); if (!ctx->context) { fprintf(stderr, "Couldn't get context for %s\n", ibv_get_device_name(ib_dev)); return NULL; } ctx->pd = ibv_alloc_pd(ctx->context); if (!ctx->pd) { fprintf(stderr, "Couldn't allocate PD\n"); return NULL; } ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, IBV_ACCESS_REMOTE_WRITE); if (!ctx->mr) { fprintf(stderr, "Couldn't allocate MR\n"); return NULL; } ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); if (!ctx->cq) { fprintf(stderr, "Couldn't create CQ\n"); return NULL; } { struct ibv_qp_init_attr attr = { .send_cq = ctx->cq, .recv_cq = ctx->cq, .cap = { .max_send_wr = tx_depth, .max_recv_wr = rx_depth, .max_send_sge = 1, .max_recv_sge = 1 }, .qp_type = IBV_QPT_RC }; ctx->qp = ibv_create_qp(ctx->pd, &attr); if (!ctx->qp) { fprintf(stderr, "Couldn't create QP\n"); return NULL; } } { struct ibv_qp_attr attr; attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = port; attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS)) { fprintf(stderr, "Failed to modify QP to INIT\n"); return NULL; } } ctx->wr.wr_id = PINGPONG_RDMA_WRID; ctx->wr.sg_list = &ctx->list; ctx->wr.num_sge = 1; ctx->wr.opcode = IBV_WR_RDMA_WRITE; ctx->wr.send_flags = IBV_SEND_SIGNALED; return ctx; } static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, struct pingpong_dest *dest) { struct ibv_qp_attr attr; attr.qp_state = IBV_QPS_RTR; attr.path_mtu = IBV_MTU_1024; attr.dest_qp_num = dest->qpn; attr.rq_psn = dest->psn; attr.max_dest_rd_atomic = 1; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.dlid = dest->lid; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.ah_attr.port_num = port; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER)) { fprintf(stderr, "Failed to modify QP to RTR\n"); return 1; } attr.qp_state = IBV_QPS_RTS; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = my_psn; attr.max_rd_atomic = 1; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { fprintf(stderr, "Failed to modify QP to RTS\n"); return 1; } return 0; } static void usage(const char *argv0) { printf("Usage:\n"); printf(" %s start a server and wait for connection\n", argv0); printf(" %s connect to server at \n", argv0); printf("\n"); printf("Options:\n"); printf(" -p, --port= listen on/connect to port (default 18515)\n"); printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 4096)\n"); printf(" -t, --tx-depth= size of tx queue (default 50)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); } int main(int argc, char *argv[]) { struct dlist *dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; struct pingpong_dest *rem_dest; struct timeval start, end; char *ib_devname = NULL; char *servername = NULL; int port = 18515; int ib_port = 1; int size = 1; int rx_depth = 1; int tx_depth = 50; int iters = 1000; int scnt, rcnt, ccnt; int client_first_post; int sockfd; struct ibv_qp *qp; struct ibv_send_wr *wr; volatile char *poll_buf; volatile char *post_buf; srand48(getpid() * time(NULL)); while (1) { int c; static struct option long_options[] = { { .name = "port", .has_arg = 1, .val = 'p' }, { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "tx-depth",.has_arg = 1, .val = 't' }, { 0 } }; c = getopt_long(argc, argv, "p:d:i:s:t:n:e", long_options, NULL); if (c == -1) break; switch (c) { case 'p': port = strtol(optarg, NULL, 0); if (port < 0 || port > 65535) { usage(argv[0]); return 1; } break; case 'd': ib_devname = strdupa(optarg); break; case 'i': ib_port = strtol(optarg, NULL, 0); if (port < 0) { usage(argv[0]); return 1; } break; case 's': size = strtol(optarg, NULL, 0); break; case 't': tx_depth = strtol(optarg, NULL, 0); break; case 'n': iters = strtol(optarg, NULL, 0); break; default: usage(argv[0]); return 1; } } if (optind == argc - 1) servername = strdupa(argv[optind]); else if (optind < argc) { usage(argv[0]); return 1; } page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); dlist_start(dev_list); if (!ib_devname) { ib_dev = dlist_next(dev_list); if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { dlist_for_each_data(dev_list, ib_dev, struct ibv_device) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { fprintf(stderr, "IB device %s not found\n", ib_devname); return 1; } } ctx = pp_init_ctx(ib_dev, size, iters, rx_depth, ib_port); if (!ctx) return 1; my_dest.lid = pp_get_local_lid(ib_dev, ib_port); my_dest.qpn = ctx->qp->qp_num; my_dest.psn = lrand48() & 0xffffff; if (!my_dest.lid) { fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); return 1; } my_dest.rkey = ctx->mr->rkey; my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " "RKey %#08x VAddr %#016Lx\n", my_dest.lid, my_dest.qpn, my_dest.psn, my_dest.rkey, my_dest.vaddr); if (servername) { sockfd = pp_client_connect(servername, port); } else { sockfd = pp_server_connect(port); } if (sockfd < 0) return 1; if (servername) { rem_dest = pp_client_exch_dest(sockfd, &my_dest); } else { rem_dest = pp_server_exch_dest(sockfd, &my_dest); } if (!rem_dest) return 1; printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " "RKey %#08x VAddr %#016Lx\n", rem_dest->lid, rem_dest->qpn, rem_dest->psn, rem_dest->rkey, rem_dest->vaddr); if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; /* An additional handshake is required *after* moving qp to RTR. Arbitrarily reuse exch_dest for this purpose. */ if (servername) { rem_dest = pp_client_exch_dest(sockfd, &my_dest); } else { rem_dest = pp_server_exch_dest(sockfd, &my_dest); } write(sockfd, "done", sizeof "done"); close(sockfd); wr = &ctx->wr; ctx->list.addr = (uintptr_t) ctx->buf; ctx->list.length = ctx->size; ctx->list.lkey = ctx->mr->lkey; wr->wr.rdma.remote_addr = rem_dest->vaddr; wr->wr.rdma.rkey = rem_dest->rkey; scnt = 0; rcnt = 0; ccnt = 0; if (servername) client_first_post = 1; else client_first_post = 0; poll_buf = ctx->poll_buf; post_buf = ctx->post_buf; qp = ctx->qp; if (gettimeofday(&start, NULL)) { perror("gettimeofday"); return 1; } while (scnt < iters || ccnt < iters || rcnt < iters) { /* Wait till buffer changes. */ if (rcnt < iters && ! client_first_post) { ++rcnt; while (*poll_buf != (char)rcnt) { } /* Here the data is already in the physical memory. If we wanted to actually use it, we may need a read memory barrier here. */ } else client_first_post = 0; if (scnt < iters) { struct ibv_send_wr *bad_wr; *post_buf = (char)++scnt; if (ibv_post_send(qp, wr, &bad_wr)) { fprintf(stderr, "Couldn't post send: scnt=%d\n", scnt); return 1; } } if (ccnt < iters) { struct ibv_wc wc; int ne; ++ccnt; do { ne = ibv_poll_cq(ctx->cq, 1, &wc); } while (ne == 0); if (ne < 0) { fprintf(stderr, "poll CQ failed %d\n", ne); return 1; } if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Completion wth error at %s:\n", servername?"client":"server"); fprintf(stderr, "Failed status %d: wr_id %d\n", wc.status, (int) wc.wr_id); fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", scnt, rcnt, ccnt); return 1; } } } if (gettimeofday(&end, NULL)) { perror("gettimeofday"); return 1; } { float usec = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_usec - start.tv_usec); long long bytes = (long long) size * iters; printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", bytes, usec / 1000000., bytes * 8. / usec); printf("%d iters in %.2f seconds = %.2f usec/iter\n", iters, usec / 1000000., usec / iters); } return 0; } From mst at mellanox.co.il Mon Mar 14 22:23:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 08:23:52 +0200 Subject: [openib-general] uverbs security Message-ID: <20050315062352.GA19233@mellanox.co.il> Hi, Roland! Looking at uverbs kernel module, I notice that in some instances it passes some parameters from userspace directly to ib core, without verifying their sanity. One example of this is qp attributes in create and modify qp. For example, modify qp and alloc qp will simply copy the attributes. This might create issues since the core may assume it works against a trusted kernel client, so it may get confused if passed illegal parameter values. For example, qp type could be IB_QPT_SMI or IB_QPT_GSI. Will this create a problem? Hard for me to tell ... I think the best approach is to validate *all* user-given parameters before passing them on to core. What do you think? -- MST - Michael S. Tsirkin From shaharf at voltaire.com Tue Mar 15 00:51:26 2005 From: shaharf at voltaire.com (shaharf) Date: Tue, 15 Mar 2005 10:51:26 +0200 Subject: [openib-general] openSM Message-ID: Hi, Does openSM supports non-fat tree (irregular such as graph) topologies? Abhijeet ________________________________ [shaharf] Yes. OpenSM supports any type of mesh (connected graph). -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Mar 15 07:17:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 10:17:17 -0500 Subject: [openib-general] user_mad.c and 2.6.11 Message-ID: <1110899837.4662.578.camel@localhost.localdomain> Hi Roland, Just ran across this reminder: Should user_mad.c be updated for the following: /* XXX remove once 2.6.11 is released */ Thanks. -- Hal From hozer at hozed.org Tue Mar 15 07:38:00 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 15 Mar 2005 09:38:00 -0600 Subject: [openib-general] uverbs security In-Reply-To: <20050315062352.GA19233@mellanox.co.il> References: <20050315062352.GA19233@mellanox.co.il> Message-ID: <20050315153759.GU9768@kalmia.hozed.org> On Tue, Mar 15, 2005 at 08:23:52AM +0200, Michael S. Tsirkin wrote: > Hi, Roland! > Looking at uverbs kernel module, I notice that in some instances > it passes some parameters from userspace directly to ib core, without > verifying their sanity. > > One example of this is qp attributes in create and modify qp. > > For example, modify qp and alloc qp will simply copy the attributes. > This might create issues since the core may assume it works against a > trusted kernel client, so it may get confused if passed illegal > parameter values. > > For example, qp type could be IB_QPT_SMI or IB_QPT_GSI. Will this create > a problem? Hard for me to tell ... > > I think the best approach is to validate *all* user-given parameters > before passing them on to core. What do you think? Yes. We should be validating all user parameters, and be thinking about malicious userspace apps. This is another reason I think we ought to have the linux MM support a 'VM_REGISTERED' flag, and things like selinux can have different security policies for registered memory vs not-registered. I think we should probably also have (possibly compile-time) options for IB core to sanity check everything, regardless of whether it came from userspace or kernelspace. (Kind of like CONFIG_DEBUG_KERNEL and the like) From mst at mellanox.co.il Tue Mar 15 08:18:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 18:18:57 +0200 Subject: [openib-general] mstflint update Message-ID: <20050315161857.GD16749@mellanox.co.il> I have updated mstflint in the openib repository. revision 1990 fixes a crash and cleans up progress reporting in flash error recovery process. Tested on x86/ia64/i686. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Tue Mar 15 08:27:06 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 18:27:06 +0200 Subject: [openib-general] [PATCH] set lkey in mthca mpt entry Message-ID: <20050315162706.GG16749@mellanox.co.il> lkey does not seem to be set in the mpt entry. does this look right? Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_mr.c =================================================================== --- hw/mthca/mthca_mr.c (revision 1983) +++ hw/mthca/mthca_mr.c (working copy) @@ -206,9 +206,9 @@ int mthca_mr_alloc_notrans(struct mthca_ mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = 0; mpt_entry->length = ~0ULL; - - memset(&mpt_entry->lkey, 0, - sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->lkey = cpu_to_be32(mr->ibmr.lkey); + memset(&mpt_entry->window_count, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, window_count)); err = mthca_SW2HW_MPT(dev, mpt_entry, key & (dev->limits.num_mpts - 1), @@ -327,8 +327,9 @@ int mthca_mr_alloc_phys(struct mthca_dev mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = cpu_to_be64(iova); mpt_entry->length = cpu_to_be64(total_size); - memset(&mpt_entry->lkey, 0, - sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->lkey = cpu_to_be32(mr->ibmr.lkey); + memset(&mpt_entry->window_count, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, window_count)); mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + mr->first_seg * dev->limits.mtt_seg_size); -- MST - Michael S. Tsirkin From roland at topspin.com Tue Mar 15 08:41:58 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 08:41:58 -0800 Subject: [openib-general] [PATCH] set lkey in mthca mpt entry In-Reply-To: <20050315162706.GG16749@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Mar 2005 18:27:06 +0200") References: <20050315162706.GG16749@mellanox.co.il> Message-ID: <52hdjczwrt.fsf@topspin.com> Michael> lkey does not seem to be set in the mpt entry. does this Michael> look right? You would know better but my docs say that the lkey field should be set to 0 for SW2HW_MPT and is only used to refer to the original region for memory windows. - R. From roland at topspin.com Tue Mar 15 08:42:22 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 08:42:22 -0800 Subject: [openib-general] Re: user_mad.c and 2.6.11 In-Reply-To: <1110899837.4662.578.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Mar 2005 10:17:17 -0500") References: <1110899837.4662.578.camel@localhost.localdomain> Message-ID: <52d5u0zwr5.fsf@topspin.com> Hal> Hi Roland, Just ran across this reminder: Hal> Should user_mad.c be updated for the following: /* XXX remove Hal> once 2.6.11 is released */ Yep, I'd apply that patch for sure. - R. From tduffy at sun.com Tue Mar 15 09:16:24 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 15 Mar 2005 09:16:24 -0800 Subject: [openib-general] kernel 2.6.11 and userland packages? In-Reply-To: <20050312011527.GC9768@kalmia.hozed.org> References: <20050312011527.GC9768@kalmia.hozed.org> Message-ID: <1110906984.28053.19.camel@duffman> On Fri, 2005-03-11 at 19:15 -0600, Troy Benjegerdes wrote: > I have in my office a shiny new kernel.org 2.6.11 64 bit kernel running > on my Mac G5, with the drivers/infiniband modules loaded. > > What do I need to do to verify this all works? Do you have the IB card plugged into an IB switch? Is that switch running an SM? Do you have another machine connected to your G5? You can see if the card is initializing on your machine by running ibstatus. Or check out the /sys/class/infiniband/ directory manually. Check the FAQ. > Also, I'd really like to make debian packages of the userland utilities > and libraries, and get a debian/ subdirectory into the subversion > release, so the packages can be rebuilt easily. > > Where should I start on this? Write the .deb, send it as a file or patch to the list. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From krause at cup.hp.com Tue Mar 15 09:51:07 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 15 Mar 2005 09:51:07 -0800 Subject: [openib-general] Getting rid of pinned memory requirement In-Reply-To: <8508251A6FC08A489844A94261D3693A03900B@fiona.siliquent.com > References: <8508251A6FC08A489844A94261D3693A03900B@fiona.siliquent.com> Message-ID: <6.2.0.14.2.20050315093749.02a06518@esmail.cup.hp.com> At 05:35 PM 3/14/2005, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:hozer at hozed.org] > > Sent: Monday, March 14, 2005 5:06 PM > > To: Caitlin Bestler > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] Getting rid of pinned memory requirement > > > > > > > > The key is that the entire operation either has to be fast > > > enough so that no connection or application session layer > > > time-outs occur, or an end-to-end agreement to suspend the > > > connetion is a requirement. The first option seems more > > > plausible to me, the second essentially > > > reuqires extending the CM protocol. That's a tall order even for > > > InfiniBand, and it's even worse for iWARP where the CM > > > functionality typically ends when the connection is established. > > > > I'll buy the good network design argument. I and others designed InfiniBand RNR (Receiver not ready) operations to allow one to adjust V-to-P mappings (not change the address that was advertised) in order to allow an OS to safely play some games with memory and not drop a connection. The time values associated with RNR allow a solution to tolerate up to infinite amount of time to perform such operations but the envisioned goal was to do this on the order of a handful or milliseconds in the worse case. For iWARP, there was no support for defining RNR functionality as indeed many people claimed one could just drop in-bound segments and allow the retransmission protocol to deal with the delay (even if this has performance implications due to back-off algorithms though some claim SACK would minimize this to a large extent). Again, the idea was to minimize the worse case to milliseconds of down time. BTW, all of this assumed that the OS would not perform these types of changes that often so the long-term impact on an application would be minimum. > > > > I suppose if the kernel wants to revoke a card's pinned > > memory, we should be able to guarantee that it gets new > > pinned memory within a bounded time. What sort of timing do > > we need? Milliseconds? > > Microseconds? > > > > In the case of iWarp, isn't this just TCP underneath? If so, > > can't we just drop any packets in the pipe on the floor and > > let them get retransmitted? (I suppose the same argument goes > > for infiniband.. > > what sort of a time window do we have for retransmission?) > > > > What are the limits on end-to-end flow control in IB and iWarp? > > > > >From the RDMA Provider's perspective, the short answer is "quick enough > so that I don't have to do anything heroic to keep the connection alive." It should not require anything heroic. What is does require is a local method to suspend the local QP(s) so that it cannot place or read memory in the effected area. That can take some time depending upon the implementation. There is then the time to over write the mappings which again depending upon the implementation and the number of mappings could be milliseconds in length. >With TCP you also have to add "and healthy". If you've ever had a long >download that got effectively stalled by a burst of noise and you just hit >the 'reload' button on your browser then you know what I'm talking about. > >But in transport neutral terms I would think that one RTT is definitely >safe -- that much data could have >been dropped by one switch failure or one nasty spike in inbound noise. > > > > > > > Yes, there are limits on how much memory you can mlock, or even > > > allocate. Applications are required to reqister memory precisely > > > because the required guarantess are not there by default. > > Eliminating > > > those guarantees *is* effectively rewriting every RDMA application > > > without even letting them know. > > > > Some of this argument is a policy issue, which I would argue > > shouldn't be hard-coded in the code or in the network hardware. > > > > At least in my view, the guarantees are only there to make > > applications go fast. We are getting low latency and high > > performance with infiniband by making memory registration go > > really really slow. If, to make big HPC simulation > > applications work, we wind up doing memcpy() to put the data > > into a registered buffer because we can't register half of > > physical memory, the application isn't going very fast. > > > >What you are looking for is a distinction between registering >memory to *enable* the RNIC to optimize local access and >registering memory to enable its being advertised to the >remote end. > >Early implementations of RDMA, both IB and iWARP, have not >distinquished between the two. But theoretically *applications* >do not need memory regions that are not enabled for remote >access to be pinned. That is an RNIC requirement that could >evolve. But applications themselves *do* need remotely >accessible memory regions, portions of which they intend >to advertise with RKeys, to be truly available (i.e., pinned). > >You are also making a policy assumption that an application >that actually needs half of physical memory should be using >paged memory. Memory is cheap, and if performance is critical >why should this memory be swapped out to disk? > >Is the limitation on not being able to register half of >physical memory based upon some assumption that swapping >is a requirement? Or is it a limitation in the memory region >size? If it's the latter, you need to get the OS to support >larger page sizes. For some OS, you can pin very large areas. I've seen 15/16 of memory being able to be pinned with no adverse impacts on the applications. For these OS, kernel memory is effectively pinned memory. As such, depending upon the mix of services being provided, the system may operate quite nicely with such large amounts of memory being pinned. As more services are "ported" to operate over RDMA technologies, memory management isn't necessarily any harder; it just becomes something people have to think more about. Today's VM designs have allowed people to get sloppy as they assume that swapping will occur and since many platforms are not that loaded, they don't see any real adverse impacts. User-space RDMA applications requires people to think once again about memory management and that swapping isn't a get-out-of-jail card. One needs to develop resource management tools to determine who obtains specified amounts of resources and their priorities. For the most part, this is somewhat a re-invention of some thinking that went into the micro-kernel work in past years. These problems are not intractable; they are only constrained by the legacy inertia inherent in all technologies today. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Mar 15 11:12:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 14:12:24 -0500 Subject: [openib-general] Re: [PATCH] [MAD] API changes and updates to support RMPP In-Reply-To: <20050314162258.1bedff07.mshefty@ichips.intel.com> References: <20050314162258.1bedff07.mshefty@ichips.intel.com> Message-ID: <1110913944.4662.666.camel@localhost.localdomain> On Mon, 2005-03-14 at 19:22, Sean Hefty wrote: > This patch updates the MAD API to help provide support for the RMPP > implementation and clients. Notable changes: Wouldn't this change also impact ib_user_mad.h and user_mad.c ? -- Hal From roland at topspin.com Tue Mar 15 11:27:11 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 11:27:11 -0800 Subject: [openib-general] uverbs security In-Reply-To: <20050315062352.GA19233@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Mar 2005 08:23:52 +0200") References: <20050315062352.GA19233@mellanox.co.il> Message-ID: <52acp4yak0.fsf@topspin.com> Michael> Hi, Roland! Looking at uverbs kernel module, I notice Michael> that in some instances it passes some parameters from Michael> userspace directly to ib core, without verifying their Michael> sanity. Michael> One example of this is qp attributes in create and modify Michael> qp. Michael> For example, modify qp and alloc qp will simply copy the Michael> attributes. This might create issues since the core may Michael> assume it works against a trusted kernel client, so it Michael> may get confused if passed illegal parameter values. Michael> For example, qp type could be IB_QPT_SMI or Michael> IB_QPT_GSI. Will this create a problem? Hard for me to Michael> tell ... This particular example is OK, because mthca_provider.c has: case IB_QPT_SMI: case IB_QPT_GSI: { /* Don't allow userspace to create special QPs */ if (pd->uobject) return ERR_PTR(-EINVAL); but I agree it might be better to check this in the uverbs module. Michael> I think the best approach is to validate *all* user-given Michael> parameters before passing them on to core. What do you Michael> think? Yes, we should do as much validation as possible, although I'm not very worried about bad values that have no effect on anyone other than the userspace process itself. - R. From mshefty at ichips.intel.com Tue Mar 15 11:27:32 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 15 Mar 2005 11:27:32 -0800 Subject: [openib-general] Re: [PATCH] [MAD] API changes and updates to support RMPP In-Reply-To: <1110913944.4662.666.camel@localhost.localdomain> References: <20050314162258.1bedff07.mshefty@ichips.intel.com> <1110913944.4662.666.camel@localhost.localdomain> Message-ID: <42373724.9070507@ichips.intel.com> Hal Rosenstock wrote: > On Mon, 2005-03-14 at 19:22, Sean Hefty wrote: > >>This patch updates the MAD API to help provide support for the RMPP >>implementation and clients. Notable changes: > > > Wouldn't this change also impact ib_user_mad.h and user_mad.c ? I don't think that they effect those files directly. BUT, I didn't test against these two files, and I don't even think that I included them in my compile, which is an obvious oversight. It might be possible to remove the internal MR in user_mad.c, but that could still come in a separate patch. Something needs to be done to support RMPP in usermode, but I haven't thought that far ahead yet. - Sean From roland at topspin.com Tue Mar 15 12:27:58 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 12:27:58 -0800 Subject: [openib-general] [PATCH] InfiniBand: remove unsafe use of in_atomic() Message-ID: <52zmx4wt69.fsf@topspin.com> Using in_atomic() to decide between GFP_KERNEL and GFP_ATOMIC is not safe (it doesn't work if CONFIG_PREEMPT=n). Change to just always allocating with GFP_ATOMIC, since we don't know if we can sleep or not. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-15 12:23:32.640868259 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-15 12:26:56.311553460 -0800 @@ -646,7 +646,7 @@ struct ib_smp *smp, struct ib_send_wr *send_wr) { - int ret, alloc_flags, solicited; + int ret, solicited; unsigned long flags; struct ib_mad_local_private *local; struct ib_mad_private *mad_priv; @@ -666,11 +666,7 @@ if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -678,7 +674,7 @@ } local->mad_priv = NULL; local->recv_mad_agent = NULL; - mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_ATOMIC); if (!mad_priv) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for local response MAD\n"); @@ -860,9 +856,7 @@ } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); From Nitin.Hande at Sun.COM Tue Mar 15 13:15:57 2005 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Tue, 15 Mar 2005 13:15:57 -0800 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] In-Reply-To: <1109969601.4648.32.camel@erez-s.us.voltaire.com> References: <1109969601.4648.32.camel@erez-s.us.voltaire.com> Message-ID: <1110921356.7768.447.camel@sr1-umpk-01> Hal, On Fri, 2005-03-04 at 12:53, Hal Rosenstock wrote: > Hi again Nitin, > > Finally got a chance to work on this. I have a workaround for you for > now. Real patch later... Let me know if this does the trick for you. It > did for me. > > -- Hal > > Index: osm_sa_mcmember_record.c > =================================================================== > --- osm_sa_mcmember_record.c (revision 1953) > +++ osm_sa_mcmember_record.c (working copy) > @@ -1522,9 +1522,11 @@ > if ((IB_MCR_COMPMASK_PROXY & comp_mask) && > (p_rcvd_rec->proxy_join != p_mgrp->mcmember_rec.proxy_join)) goto Exit; > > +#if 0 > /* if defined MUST match exactly !*/ > if ((IB_MCR_COMPMASK_MTU_SEL & comp_mask) && > ((p_rcvd_rec->mtu >> 6) != (p_mgrp->mcmember_rec.mtu >> 6))) goto Exit; > +#endif > > if ((IB_MCR_COMPMASK_MTU & comp_mask) && > ((p_rcvd_rec->mtu & 0x3F) != (p_mgrp->mcmember_rec.mtu & 0x3F))) goto Exit; This is cool, I have got Solaris IPoIB happily working with the OpenSM now. It plumbs, pings and snoops on 0xffff pkey. Here is some output: [root at dongon ~]# cat /etc/path_to_inst | grep ibd "/pci at 8,600000/pci at 1/pci15b3,5a44 at 0/ibport at 1,ffff,ipib" 0 "ibd" "/pci at 8,600000/pci at 1/pci15b3,5a44 at 0/ibport at 2,ffff,ipib" 1 "ibd" [root at dongon ~]# ifconfig ibd0 ibd0: flags=1000843 mtu 2044 index 3 inet 192.168.100.111 netmask ffffff00 broadcast 192.168.100.255 ipib 0:0:0:16:fe:80:0:0:0:0:0:0:0:2:c9:1:9:76:51:d1 [root at dongon ~]# ping 192.168.100.112 192.168.100.112 is alive [root at dongon ~]# snoop -d ibd1 192.168.100.112 -> * ARP C Who is 192.168.100.111, 192.168.100.111 ? 192.168.100.111 -> 192.168.100.112 ARP R 192.168.100.111, 192.168.100.111 is 0:0:0:16:fe:80:0:0:0:0:0:0:0:2:c9:1:9:76:51:d1 192.168.100.111 -> 192.168.100.112 ICMP Echo request (ID: 641 Sequence number: 0) 192.168.100.112 -> 192.168.100.111 ICMP Echo reply (ID: 641 Sequence number: 0) This is fantastic. Thanks Hal !.. BTW, I have not tested it with multiple GetTable reponse - RMPP packet. On other hand, on my linux node, if I try to use 8001 partition and configure IB interface with IP addr (same time while ib0 is using 0xffff pkey), I get the following error, you may want to investigate that.... [root at flopteron2 ~]# echo 0x8001 > /sys/class/net/ib0/create_child [root at flopteron2 ~]# ifconfig ib0.8001 10.10.1.1 [root at flopteron2ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 ~]# ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 [root at flopteron2 ~]# ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 Thanks Nitin > > > -----Forwarded Message----- > > From: Hal Rosenstock > To: Nitin Hande > Cc: openib , Tom Duffy > Subject: Re: [openib-general] Solaris IPoIB MTU with OpenSM > Date: 24 Feb 2005 08:42:23 -0500 > > Hi Nitin, > > On Wed, 2005-02-23 at 17:19, Nitin Hande wrote: > > Hal, > > > > [comments below] > > On Wed, 2005-02-23 at 02:19, Hal Rosenstock wrote: > > > On Tue, 2005-02-22 at 22:56, Nitin Hande wrote: > > > > So I tried the latest patches and preliminarily things seem to be > > > > working fine. > > > > > > Yipee. > > [snip..] > > > > > > > > > > > So after this test above, I try to run snoop on the solaris interface > > > > and get the following error message from the layer below IPoIB: > > > > > > > > Feb 22 19:50:25 dongon.SFBay.Sun.COM ibd: [ID 517869 kern.info] NOTICE: > > > > ibd0: HCA GUID 0002c901097651d0 port 1 PKEY ffff Could not get list of > > > > IBA multicast groups > > > > > > > > My preliminary assumption is that OpenSm is not returning the list of > > > > multicast groups that the ibd interface has joined. I will look at the > > > > MAD's tomorrow and try to ascertain that. > > > > > > How does S10 request this ? Remember that if it is a GetTable and > > > doesn't fit in a single MAD, it will be broken now. If that is the case, > > > we will live with this until we have real RMPP. > > Below is an an example of a single GetTable request and response between > > Solaris and OpenSM. OpenSM is not reporting the MCgroups in case of a > > single request/response. I have also provided a MAD output between > > Solaris IPoIB driver and IBSRM single GetTable request response below > > this example. > > > > Here is the MAD trace between solaris and OpenSM: > > Outgoing MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x12 - SubnAdmGetTable() > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x97651d1000000ec > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 12 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 00 00 0b 1b 00 00 84 00 ................ > > 60: ff ff 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > Incoming MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x92 - > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x97651d1000000ec > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 92 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > > 10: 00 38 00 00 ff ff ff ff 01 01 77 00 00 00 00 01 .8........w..... > > 20: 00 00 00 14 00 00 00 00 00 00 00 00 00 07 00 00 ................ > > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > It is likely failing the component checking in > osm_sa_mcmember_record.c::__osm_sa_mcm_by_comp_mask_cb due to an endian > issue. Either you can debug this code or I will early next week. > > The component mask in the request is 0x80b4 so the only components > checked are QKey (0xb1b), MTU (exactly 2048 (4)), PKey (0xffff), and > scope (2). > > If I don't hear anything by next week, I will work on this then. > > Thanks. > > -- Hal > > > Here is the transaction between IBSRM and Solaris IPoIB driver. > > > > Outgoing MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x12 - SubnAdmGetTable() > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x8fecc610000009a > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 12 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 81 23 45 68 00 00 84 00 .........#Eh.... > > 60: 80 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > Incoming MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x92 - > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x8fecc610000009a > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 92 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > > 10: 00 38 00 00 00 00 00 00 01 01 73 00 00 00 00 01 .8........s..... > > 20: 00 00 01 40 00 00 00 00 00 00 00 00 00 07 00 00 ... at ............ > > 30: 00 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 .......... at ..... > > 40: 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 81 23 45 68 c0 04 84 00 .........#Eh.... > > 60: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > 70: ff 12 40 1b 80 01 00 00 00 00 00 00 00 00 00 01 .. at ............. > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 81 23 45 68 c0 03 84 00 80 01 83 8d 00 00 00 00 .#Eh............ > > a0: 20 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 ......... at ..... > > b0: 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 81 23 45 68 c0 00 84 00 .........#Eh.... > > d0: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > e0: ff 12 60 1b 80 01 00 00 00 00 00 01 ff 76 5b 01 ..`..........v[. > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > Thanks > > Nitin > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Tue Mar 15 13:23:51 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 13:23:51 -0800 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] In-Reply-To: <1110921356.7768.447.camel@sr1-umpk-01> (Nitin Hande's message of "Tue, 15 Mar 2005 13:15:57 -0800") References: <1109969601.4648.32.camel@erez-s.us.voltaire.com> <1110921356.7768.447.camel@sr1-umpk-01> Message-ID: <52vf7swql4.fsf@topspin.com> Nitin> On other hand, on my linux node, if I try to use 8001 Nitin> partition and configure IB interface with IP addr (same Nitin> time while ib0 is using 0xffff pkey), I get the following Nitin> error, you may want to investigate that.... I think this is probably an OpenSM issue (does OpenSM support multiple partitions?). On my fabric, running Topspin's embedded SM on a switch, I can do: # modprobe ib_ipoib # echo 0x8001 > /sys/class/net/ib0/create_child # ifconfig ib0.8001 up on both systems. On system #1 I have: # ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 00-13-04-06-FE-80-00-00-00-00-00-00-00-00-00-00 inet6 addr: fe80::202:c901:7fc:c711/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:300 (300.0 b) and on system #2 I'm able to do: # ping6 -I ib0.8001 fe80::202:c901:7fc:c711 PING fe80::202:c901:7fc:c711(fe80::202:c901:7fc:c711) from fe80::202:c901:78c:e461 ib0.8001: 56 data bytes 64 bytes from fe80::202:c901:7fc:c711: icmp_seq=1 ttl=64 time=4.56 ms 64 bytes from fe80::202:c901:7fc:c711: icmp_seq=2 ttl=64 time=0.077 ms 64 bytes from fe80::202:c901:7fc:c711: icmp_seq=3 ttl=64 time=0.065 ms - R. From roland at topspin.com Tue Mar 15 14:06:29 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 14:06:29 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs Message-ID: <52k6o8wom2.fsf@topspin.com> I just spent a little time creating a new "ibv" module for NetPIPE that runs on top of the userspace verbs I've been developing on the roland-uverbs branch. This is pretty much a straight port of the current Mellanox VAPI "ib" module, with the main changes coming from the fact that OpenIB doesn't support the non-standard "unsignaled receive" extension, and the fact that a completion event thread is no longer created automatically. I found several bugs in the verbs support while making this work, but it seems quite stable now, although I haven't tried all option combinations. I also have not had a chance to compare Mellanox VAPI and OpenIB verbs performance on identical hardware -- it would be very useful to see this comparison on a variety of systems. The new ibv module is contained in the patch included below. Thanks, Roland --- NetPIPE_3.6.2.orig/makefile 2004-06-09 12:46:35.000000000 -0700 +++ NetPIPE_3.6.2/makefile 2005-03-15 13:58:08.000000000 -0800 @@ -229,6 +229,10 @@ -DINFINIBAND -DTCP -I $(VAPI_INC) -L $(VAPI_LIB) \ -lmpga -lvapi -lpthread +ibv: $(SRC)/ibv.c $(SRC)/netpipe.c $(SRC)/netpipe.h + $(CC) $(CFLAGS) $(SRC)/ibv.c $(SRC)/netpipe.c -o NPibv \ + -DOPENIB -DTCP -libverbs + atoll: $(SRC)/atoll.c $(SRC)/netpipe.c $(SRC)/netpipe.h $(CC) $(CFLAGS) -DATOLL $(SRC)/netpipe.c \ $(SRC)/atoll.c -o NPatoll \ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ NetPIPE_3.6.2/src/ibv.c 2005-03-15 13:30:03.000000000 -0800 @@ -0,0 +1,1072 @@ +/*****************************************************************************/ +/* "NetPIPE" -- Network Protocol Independent Performance Evaluator. */ +/* Copyright 1997, 1998 Iowa State University Research Foundation, Inc. */ +/* */ +/* This program is free software; you can redistribute it and/or modify */ +/* it under the terms of the GNU General Public License as published by */ +/* the Free Software Foundation. You should have received a copy of the */ +/* GNU General Public License along with this program; if not, write to the */ +/* Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ +/* */ +/* ibv.c ---- Infiniband module for OpenIB verbs */ +/*****************************************************************************/ + +#define USE_VOLATILE_RPTR /* needed for polling on last byte of recv buffer */ +#include "netpipe.h" +#include +#include +#include + +/* Debugging output macro */ + +FILE* logfile; + +#if 0 +#define LOGPRINTF(_format, _aa...) fprintf(logfile, "%s: " _format, __func__ , ##_aa); fflush(logfile) +#else +#define LOGPRINTF(_format, _aa...) +#endif + +/* Header files needed for Infiniband */ + +#include + +/* Global vars */ + +static struct ibv_device *hca; +static struct ibv_context *ctx; +static struct ibv_port_attr hca_port; +static int port_num; +static uint16_t lid; +static uint16_t d_lid; +static struct ibv_pd *pd_hndl; +static int num_cqe; +static int act_num_cqe; +static struct ibv_cq *s_cq_hndl; +static struct ibv_cq *r_cq_hndl; +static struct ibv_mr *s_mr_hndl; +static struct ibv_mr *r_mr_hndl; +static struct ibv_qp_init_attr qp_init_attr; +static struct ibv_qp *qp_hndl; +static uint32_t d_qp_num; +static struct ibv_qp_attr qp_attr; +static struct ibv_wc wc; +static int max_wq=50000; +static void* remote_address; +static uint32_t remote_key; +static volatile int receive_complete; +static pthread_t thread; + +/* Function definitions */ + +void Init(ArgStruct *p, int* pargc, char*** pargv) +{ + /* Set defaults + */ + p->prot.ib_mtu = IBV_MTU_1024; /* 1024 Byte MTU */ + p->prot.commtype = NP_COMM_RDMAWRITE; /* Use RDMA write communications */ + p->prot.comptype = NP_COMP_LOCALPOLL; /* Use local polling for completion */ + p->tr = 0; /* I am not the transmitter */ + p->rcv = 1; /* I am the receiver */ +} + +void Setup(ArgStruct *p) +{ + + int one = 1; + int sockfd; + struct sockaddr_in *lsin1, *lsin2; /* ptr to sockaddr_in in ArgStruct */ + char *host; + struct hostent *addr; + struct protoent *proto; + int send_size, recv_size, sizeofint = sizeof(int); + struct sigaction sigact1; + char logfilename[80]; + + /* Sanity check */ + if( p->prot.commtype == NP_COMM_RDMAWRITE && + p->prot.comptype != NP_COMP_LOCALPOLL ) { + fprintf(stderr, "Error, RDMA Write may only be used with local polling.\n"); + fprintf(stderr, "Try using RDMA Write With Immediate Data with vapi polling\n"); + fprintf(stderr, "or event completion\n"); + exit(-1); + } + + if( p->prot.commtype != NP_COMM_RDMAWRITE && + p->prot.comptype == NP_COMP_LOCALPOLL ) { + fprintf(stderr, "Error, local polling may only be used with RDMA Write.\n"); + fprintf(stderr, "Try using vapi polling or event completion\n"); + exit(-1); + } + + /* Open log file */ + sprintf(logfilename, ".iblog%d", 1 - p->tr); + logfile = fopen(logfilename, "w"); + + host = p->host; /* copy ptr to hostname */ + + lsin1 = &(p->prot.sin1); + lsin2 = &(p->prot.sin2); + + bzero((char *) lsin1, sizeof(*lsin1)); + bzero((char *) lsin2, sizeof(*lsin2)); + + if ( (sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0){ + printf("NetPIPE: can't open stream socket! errno=%d\n", errno); + exit(-4); + } + + if(!(proto = getprotobyname("tcp"))){ + printf("NetPIPE: protocol 'tcp' unknown!\n"); + exit(555); + } + + if (p->tr){ /* if client i.e., Sender */ + + + if (atoi(host) > 0) { /* Numerical IP address */ + lsin1->sin_family = AF_INET; + lsin1->sin_addr.s_addr = inet_addr(host); + + } else { + + if ((addr = gethostbyname(host)) == NULL){ + printf("NetPIPE: invalid hostname '%s'\n", host); + exit(-5); + } + + lsin1->sin_family = addr->h_addrtype; + bcopy(addr->h_addr, (char*) &(lsin1->sin_addr.s_addr), addr->h_length); + } + + lsin1->sin_port = htons(p->port); + + } else { /* we are the receiver (server) */ + + bzero((char *) lsin1, sizeof(*lsin1)); + lsin1->sin_family = AF_INET; + lsin1->sin_addr.s_addr = htonl(INADDR_ANY); + lsin1->sin_port = htons(p->port); + + if (bind(sockfd, (struct sockaddr *) lsin1, sizeof(*lsin1)) < 0){ + printf("NetPIPE: server: bind on local address failed! errno=%d", errno); + exit(-6); + } + + } + + if(p->tr) + p->commfd = sockfd; + else + p->servicefd = sockfd; + + + + /* Establish tcp connections */ + + establish(p); + + /* Initialize Mellanox Infiniband */ + + if(initIB(p) == -1) { + CleanUp(p); + exit(-1); + } +} + +void event_handler(struct ibv_cq *cq); + +void *EventThread(void *unused) +{ + struct ibv_cq *cq; + void *data; + + while (1) { + if (ibv_get_cq_event(ctx, 0, &cq, &data)) { + fprintf(stderr, "Failed to get CQ event\n"); + return NULL; + } + event_handler(cq); + } +} + +int initIB(ArgStruct *p) +{ + struct dlist *dev_list; + int ret; + + dev_list = ibv_get_devices(); + dlist_start(dev_list); + hca = dlist_next(dev_list); + if (!hca) { + fprintf(stderr, "Couldn't find any InfiniBand devices\n"); + return -1; + } else { + LOGPRINTF("Found Infiniband HCA %s\n", ibv_get_device_name(hca)); + } + + ctx = ibv_open_device(hca); + if (!ctx) { + fprintf(stderr, "Couldn't create InfiniBand context\n"); + return -1; + } else { + LOGPRINTF("Found Infiniband HCA %s\n", ibv_get_device_name(hca)); + } + + /* Get HCA properties */ + + port_num=1; + ret = ibv_query_port(ctx, port_num, &hca_port); + if(ret) { + fprintf(stderr, "Error querying Infiniband HCA\n"); + return -1; + } else { + LOGPRINTF("Queried Infiniband HCA\n"); + } + lid = hca_port.lid; + LOGPRINTF(" lid = %d\n", lid); + + + /* Allocate Protection Domain */ + + pd_hndl = ibv_alloc_pd(ctx); + if(!pd_hndl) { + fprintf(stderr, "Error allocating PD\n"); + return -1; + } else { + LOGPRINTF("Allocated Protection Domain\n"); + } + + + /* Create send completion queue */ + + num_cqe = 30000; /* Requested number of completion q elements */ + s_cq_hndl = ibv_create_cq(ctx, num_cqe, NULL); + if(!s_cq_hndl) { + fprintf(stderr, "Error creating send CQ\n"); + return -1; + } else { + act_num_cqe = s_cq_hndl->cqe; + LOGPRINTF("Created Send Completion Queue with %d elements\n", act_num_cqe); + } + + + /* Create recv completion queue */ + + num_cqe = 20000; /* Requested number of completion q elements */ + r_cq_hndl = ibv_create_cq(ctx, num_cqe, NULL); + if(!r_cq_hndl) { + fprintf(stderr, "Error creating send CQ\n"); + return -1; + } else { + act_num_cqe = r_cq_hndl->cqe; + LOGPRINTF("Created Recv Completion Queue with %d elements\n", act_num_cqe); + } + + + /* Placeholder for MR */ + + + /* Create Queue Pair */ + + qp_init_attr.cap.max_recv_wr = max_wq; /* Max outstanding WR on RQ */ + qp_init_attr.cap.max_send_wr = max_wq; /* Max outstanding WR on SQ */ + qp_init_attr.cap.max_recv_sge = 1; /* Max scatter/gather entries on RQ */ + qp_init_attr.cap.max_send_sge = 1; /* Max scatter/gather entries on SQ */ + qp_init_attr.recv_cq = r_cq_hndl; /* CQ handle for RQ */ + qp_init_attr.send_cq = s_cq_hndl; /* CQ handle for SQ */ + qp_init_attr.sq_sig_all = 0; /* Signalling type */ + qp_init_attr.qp_type = IBV_QPT_RC; /* Transmission type */ + + qp_hndl = ibv_create_qp(pd_hndl, &qp_init_attr); + if(!qp_hndl) { + fprintf(stderr, "Error creating Queue Pair\n"); + return -1; + } else { + LOGPRINTF("Created Queue Pair\n"); + } + + + /* Exchange lid and qp_num with other node */ + + if( write(p->commfd, &lid, sizeof(lid) ) != sizeof(lid) ) { + fprintf(stderr, "Failed to send lid over socket\n"); + return -1; + } + if( write(p->commfd, &qp_hndl->qp_num, sizeof(qp_hndl->qp_num) ) != sizeof(qp_hndl->qp_num) ) { + fprintf(stderr, "Failed to send qpnum over socket\n"); + return -1; + } + if( read(p->commfd, &d_lid, sizeof(d_lid) ) != sizeof(d_lid) ) { + fprintf(stderr, "Failed to read lid from socket\n"); + return -1; + } + if( read(p->commfd, &d_qp_num, sizeof(d_qp_num) ) != sizeof(d_qp_num) ) { + fprintf(stderr, "Failed to read qpnum from socket\n"); + return -1; + } + + LOGPRINTF("Local: lid=%d qp_num=%d Remote: lid=%d qp_num=%d\n", + lid, qp_hndl->qp_num, d_lid, d_qp_num); + + + /* Bring up Queue Pair */ + + /******* INIT state ******/ + + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.pkey_index = 0; + qp_attr.port_num = port_num; + qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS); + if(ret) { + fprintf(stderr, "Error modifying QP to INIT\n"); + return -1; + } + + LOGPRINTF("Modified QP to INIT\n"); + + /******* RTR (Ready-To-Receive) state *******/ + + qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.max_dest_rd_atomic = 1; + qp_attr.dest_qp_num = d_qp_num; + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.is_global = 0; + qp_attr.ah_attr.dlid = d_lid; + qp_attr.ah_attr.static_rate = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = port_num; + qp_attr.path_mtu = p->prot.ib_mtu; + qp_attr.rq_psn = 0; + qp_attr.pkey_index = 0; + qp_attr.min_rnr_timer = 5; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER); + + if(ret) { + fprintf(stderr, "Error modifying QP to RTR\n"); + return -1; + } + + LOGPRINTF("Modified QP to RTR\n"); + + /* Sync before going to RTS state */ + Sync(p); + + /******* RTS (Ready-to-Send) state *******/ + + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 0; + qp_attr.timeout = 31; + qp_attr.retry_cnt = 1; + qp_attr.rnr_retry = 1; + qp_attr.max_rd_atomic = 1; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC); + + if(ret) { + fprintf(stderr, "Error modifying QP to RTS\n"); + return -1; + } + + LOGPRINTF("Modified QP to RTS\n"); + + /* If using event completion, request the initial notification */ + if( p->prot.comptype == NP_COMP_EVENT ) { + if (pthread_create(&thread, NULL, EventThread, NULL)) { + fprintf(stderr, "Couldn't start event thread\n"); + return -1; + } + ibv_req_notify_cq(r_cq_hndl, 0); + } + + return 0; +} + +int finalizeIB(ArgStruct *p) +{ + int ret; + + LOGPRINTF("Finalizing IB stuff\n"); + + if(qp_hndl) { + LOGPRINTF("Destroying QP\n"); + ret = ibv_destroy_qp(qp_hndl); + if(ret) { + fprintf(stderr, "Error destroying Queue Pair\n"); + } + } + + if(r_cq_hndl) { + LOGPRINTF("Destroying Recv CQ\n"); + ret = ibv_destroy_cq(r_cq_hndl); + if(ret) { + fprintf(stderr, "Error destroying recv CQ\n"); + } + } + + if(s_cq_hndl) { + LOGPRINTF("Destroying Send CQ\n"); + ret = ibv_destroy_cq(s_cq_hndl); + if(ret) { + fprintf(stderr, "Error destroying send CQ\n"); + } + } + + /* Check memory registrations just in case user bailed out */ + if(s_mr_hndl) { + LOGPRINTF("Deregistering send buffer\n"); + ret = ibv_dereg_mr(s_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering send mr\n"); + } + } + + if(r_mr_hndl) { + LOGPRINTF("Deregistering recv buffer\n"); + ret = ibv_dereg_mr(r_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering recv mr\n"); + } + } + + if(pd_hndl) { + LOGPRINTF("Deallocating PD\n"); + ret = ibv_dealloc_pd(pd_hndl); + if(ret) { + fprintf(stderr, "Error deallocating PD\n"); + } + } + + /* Application code should not close HCA, just release handle */ + + if(ctx) { + LOGPRINTF("Releasing HCA\n"); + ret = ibv_close_device(ctx); + if(ret) { + fprintf(stderr, "Error releasing HCA\n"); + } + } + + return 0; +} + +void event_handler(struct ibv_cq *cq) +{ + int ret; + + while(1) { + + ret = ibv_poll_cq(cq, 1, &wc); + + if(ret == 0) { + LOGPRINTF("Empty completion queue, requesting next notification\n"); + ibv_req_notify_cq(r_cq_hndl, 0); + return; + } else if(ret < 0) { + fprintf(stderr, "Error in event_handler, polling cq\n"); + exit(-1); + } else if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Error in event_handler, on returned work completion " + "status: %d\n", wc.status); + exit(-1); + } + + LOGPRINTF("Retrieved work completion\n"); + + /* For ping-pong mode at least, this check shouldn't be needed for + * normal operation, but it will help catch any bugs with multiple + * sends coming through when we're only expecting one. + */ + if(receive_complete == 1) { + + while(receive_complete != 0) sched_yield(); + + } + + receive_complete = 1; + + } + +} + +static int +readFully(int fd, void *obuf, int len) +{ + int bytesLeft = len; + char *buf = (char *) obuf; + int bytesRead = 0; + + while (bytesLeft > 0 && + (bytesRead = read(fd, (void *) buf, bytesLeft)) > 0) + { + bytesLeft -= bytesRead; + buf += bytesRead; + } + if (bytesRead <= 0) + return bytesRead; + return len; +} + +void Sync(ArgStruct *p) +{ + char s[] = "SyncMe"; + char response[7]; + + if (write(p->commfd, s, strlen(s)) < 0 || + readFully(p->commfd, response, strlen(s)) < 0) + { + perror("NetPIPE: error writing or reading synchronization string"); + exit(3); + } + if (strncmp(s, response, strlen(s))) + { + fprintf(stderr, "NetPIPE: Synchronization string incorrect!\n"); + exit(3); + } +} + +void PrepareToReceive(ArgStruct *p) +{ + int ret; /* Return code */ + struct ibv_recv_wr rr; /* Receive request */ + struct ibv_recv_wr *bad_wr; + struct ibv_sge sg_entry; /* Scatter/Gather list - holds buff addr */ + + /* We don't need to post a receive if doing RDMA write with local polling */ + + if( p->prot.commtype == NP_COMM_RDMAWRITE && + p->prot.comptype == NP_COMP_LOCALPOLL ) + return; + + rr.num_sge = 1; + rr.sg_list = &sg_entry; + rr.next = NULL; + + sg_entry.lkey = r_mr_hndl->lkey; + sg_entry.length = p->bufflen; + sg_entry.addr = (uintptr_t)p->r_ptr; + + ret = ibv_post_recv(qp_hndl, &rr, &bad_wr); + if(ret) { + fprintf(stderr, "Error posting recv request\n"); + CleanUp(p); + exit(-1); + } else { + LOGPRINTF("Posted recv request\n"); + } + + /* Set receive flag to zero and request event completion + * notification for this receive so the event handler will + * be triggered when the receive completes. + */ + if( p->prot.comptype == NP_COMP_EVENT ) { + receive_complete = 0; + } +} + +void SendData(ArgStruct *p) +{ + int ret; /* Return code */ + struct ibv_send_wr sr; /* Send request */ + struct ibv_send_wr *bad_wr; + struct ibv_sge sg_entry; /* Scatter/Gather list - holds buff addr */ + + /* Fill in send request struct */ + + if(p->prot.commtype == NP_COMM_SENDRECV) { + sr.opcode = IBV_WR_SEND; + LOGPRINTF("Doing regular send\n"); + } else if(p->prot.commtype == NP_COMM_SENDRECV_WITH_IMM) { + sr.opcode = IBV_WR_SEND_WITH_IMM; + LOGPRINTF("Doing regular send with imm\n"); + } else if(p->prot.commtype == NP_COMM_RDMAWRITE) { + sr.opcode = IBV_WR_RDMA_WRITE; + sr.wr.rdma.remote_addr = (uintptr_t)(remote_address + (p->s_ptr - p->s_buff)); + sr.wr.rdma.rkey = remote_key; + LOGPRINTF("Doing RDMA write (raddr=%p)\n", sr.wr.rdma.remote_addr); + } else if(p->prot.commtype == NP_COMM_RDMAWRITE_WITH_IMM) { + sr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; + sr.wr.rdma.remote_addr = (uintptr_t)(remote_address + (p->s_ptr - p->s_buff)); + sr.wr.rdma.rkey = remote_key; + LOGPRINTF("Doing RDMA write with imm (raddr=%p)\n", sr.wr.rdma.remote_addr); + } else { + fprintf(stderr, "Error, invalid communication type in SendData\n"); + exit(-1); + } + + sr.send_flags = 0; /* This needed due to a bug in Mellanox HW rel a-0 */ + + sr.num_sge = 1; + sr.sg_list = &sg_entry; + sr.next = NULL; + + sg_entry.lkey = s_mr_hndl->lkey; /* Local memory region key */ + sg_entry.length = p->bufflen; + sg_entry.addr = (uintptr_t)p->s_ptr; + + ret = ibv_post_send(qp_hndl, &sr, &bad_wr); + if(ret) { + fprintf(stderr, "Error posting send request\n"); + } else { + LOGPRINTF("Posted send request\n"); + } + +} + +void RecvData(ArgStruct *p) +{ + int ret; + + /* Busy wait for incoming data */ + + LOGPRINTF("Receiving at buffer address %p\n", p->r_ptr); + + /* + * Unsignaled receives are not supported, so we must always poll the + * CQ, except when using RDMA writes. + */ + if( p->prot.commtype == NP_COMM_RDMAWRITE ) { + + /* Poll for receive completion locally on the receive data */ + + LOGPRINTF("Waiting for last byte of data to arrive\n"); + + while(p->r_ptr[p->bufflen-1] != 'a' + (p->cache ? 1 - p->tr : 1) ) + { + /* BUSY WAIT -- this should be fine since we + * declared r_ptr with volatile qualifier */ + } + + /* Reset last byte */ + p->r_ptr[p->bufflen-1] = 'a' + (p->cache ? p->tr : 0); + + LOGPRINTF("Received all of data\n"); + + } else if( p->prot.comptype != NP_COMP_EVENT ) { + + /* Poll for receive completion using VAPI poll function */ + + LOGPRINTF("Polling completion queue for VAPI work completion\n"); + + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(r_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error in RecvData, polling for completion\n"); + exit(-1); + } + + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Error in status of returned completion: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Retrieved successful completion\n"); + + } else if( p->prot.comptype == NP_COMP_EVENT ) { + + /* Instead of polling directly on data or VAPI completion queue, + * let the VAPI event completion handler set a flag when the receive + * completes, and poll on that instead. Could try using semaphore here + * as well to eliminate busy polling + */ + + LOGPRINTF("Polling receive flag\n"); + + while( receive_complete == 0 ) + { + /* BUSY WAIT */ + } + + /* If in prepost-burst mode, we won't be calling PrepareToReceive + * between ping-pongs, so we need to reset the receive_complete + * flag here. + */ + if( p->preburst ) receive_complete = 0; + + LOGPRINTF("Receive completed\n"); + } +} + +/* Reset is used after a trial to empty the work request queues so we + have enough room for the next trial to run */ +void Reset(ArgStruct *p) +{ + + int ret; /* Return code */ + struct ibv_send_wr sr; /* Send request */ + struct ibv_send_wr *bad_sr; + struct ibv_recv_wr rr; /* Recv request */ + struct ibv_recv_wr *bad_rr; + + /* If comptype is event, then we'll use event handler to detect receive, + * so initialize receive_complete flag + */ + if(p->prot.comptype == NP_COMP_EVENT) receive_complete = 0; + + /* Prepost receive */ + rr.num_sge = 0; + rr.next = NULL; + + LOGPRINTF("Posting recv request in Reset\n"); + ret = ibv_post_recv(qp_hndl, &rr, &bad_rr); + if(ret) { + fprintf(stderr, " Error posting recv request\n"); + CleanUp(p); + exit(-1); + } + + /* Make sure both nodes have preposted receives */ + Sync(p); + + /* Post Send */ + sr.opcode = IBV_WR_SEND; + sr.send_flags = IBV_SEND_SIGNALED; + sr.num_sge = 0; + sr.next = NULL; + + LOGPRINTF("Posting send request \n"); + ret = ibv_post_send(qp_hndl, &sr, &bad_sr); + if(ret) { + fprintf(stderr, " Error posting send request in Reset\n"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Polling for completion of send request\n"); + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(s_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error polling CQ for send in Reset\n"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Status of send completion: %d\n", wc.status); + + if(p->prot.comptype == NP_COMP_EVENT) { + /* If using event completion, the event handler will set receive_complete + * when it gets the completion event. + */ + LOGPRINTF("Waiting for receive_complete flag\n"); + while(receive_complete == 0) { /* BUSY WAIT */ } + } else { + LOGPRINTF("Polling for completion of recv request\n"); + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(r_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error polling CQ for recv in Reset"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Status of recv completion: %d\n", wc.status); + } + LOGPRINTF("Done with reset\n"); +} + +void SendTime(ArgStruct *p, double *t) +{ + uint32_t ltime, ntime; + + /* + Multiply the number of seconds by 1e6 to get time in microseconds + and convert value to an unsigned 32-bit integer. + */ + ltime = (uint32_t)(*t * 1.e6); + + /* Send time in network order */ + ntime = htonl(ltime); + if (write(p->commfd, (char *)&ntime, sizeof(uint32_t)) < 0) + { + printf("NetPIPE: write failed in SendTime: errno=%d\n", errno); + exit(301); + } +} + +void RecvTime(ArgStruct *p, double *t) +{ + uint32_t ltime, ntime; + int bytesRead; + + bytesRead = readFully(p->commfd, (void *)&ntime, sizeof(uint32_t)); + if (bytesRead < 0) + { + printf("NetPIPE: read failed in RecvTime: errno=%d\n", errno); + exit(302); + } + else if (bytesRead != sizeof(uint32_t)) + { + fprintf(stderr, "NetPIPE: partial read in RecvTime of %d bytes\n", + bytesRead); + exit(303); + } + ltime = ntohl(ntime); + + /* Result is ltime (in microseconds) divided by 1.0e6 to get seconds */ + *t = (double)ltime / 1.0e6; +} + +void SendRepeat(ArgStruct *p, int rpt) +{ + uint32_t lrpt, nrpt; + + lrpt = rpt; + /* Send repeat count as a long in network order */ + nrpt = htonl(lrpt); + if (write(p->commfd, (void *) &nrpt, sizeof(uint32_t)) < 0) + { + printf("NetPIPE: write failed in SendRepeat: errno=%d\n", errno); + exit(304); + } +} + +void RecvRepeat(ArgStruct *p, int *rpt) +{ + uint32_t lrpt, nrpt; + int bytesRead; + + bytesRead = readFully(p->commfd, (void *)&nrpt, sizeof(uint32_t)); + if (bytesRead < 0) + { + printf("NetPIPE: read failed in RecvRepeat: errno=%d\n", errno); + exit(305); + } + else if (bytesRead != sizeof(uint32_t)) + { + fprintf(stderr, "NetPIPE: partial read in RecvRepeat of %d bytes\n", + bytesRead); + exit(306); + } + lrpt = ntohl(nrpt); + + *rpt = lrpt; +} + +void establish(ArgStruct *p) +{ + int clen; + int one = 1; + struct protoent; + + clen = sizeof(p->prot.sin2); + if(p->tr){ + if(connect(p->commfd, (struct sockaddr *) &(p->prot.sin1), + sizeof(p->prot.sin1)) < 0){ + printf("Client: Cannot Connect! errno=%d\n",errno); + exit(-10); + } + } + else { + /* SERVER */ + listen(p->servicefd, 5); + p->commfd = accept(p->servicefd, (struct sockaddr *) &(p->prot.sin2), + &clen); + + if(p->commfd < 0){ + printf("Server: Accept Failed! errno=%d\n",errno); + exit(-12); + } + } +} + +void CleanUp(ArgStruct *p) +{ + char *quit="QUIT"; + if (p->tr) + { + write(p->commfd,quit, 5); + read(p->commfd, quit, 5); + close(p->commfd); + } + else + { + read(p->commfd,quit, 5); + write(p->commfd,quit,5); + close(p->commfd); + close(p->servicefd); + } + + finalizeIB(p); +} + + +void AfterAlignmentInit(ArgStruct *p) +{ + int bytesRead; + + /* Exchange buffer pointers and remote infiniband keys if doing rdma. Do + * the exchange in this function because this will happen after any + * memory alignment is done, which is important for getting the + * correct remote address. + */ + if( p->prot.commtype == NP_COMM_RDMAWRITE || + p->prot.commtype == NP_COMM_RDMAWRITE_WITH_IMM ) { + + /* Send my receive buffer address + */ + if(write(p->commfd, (void *)&p->r_buff, sizeof(void*)) < 0) { + perror("NetPIPE: write of buffer address failed in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Sent buffer address: %p\n", p->r_buff); + + /* Send my remote key for accessing + * my remote buffer via IB RDMA + */ + if(write(p->commfd, (void *)&r_mr_hndl->rkey, sizeof(uint32_t)) < 0) { + perror("NetPIPE: write of remote key failed in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Sent remote key: %d\n", r_mr_hndl->rkey); + + /* Read the sent data + */ + bytesRead = readFully(p->commfd, (void *)&remote_address, sizeof(void*)); + if (bytesRead < 0) { + perror("NetPIPE: read of buffer address failed in AfterAlignmentInit"); + exit(-1); + } else if (bytesRead != sizeof(void*)) { + perror("NetPIPE: partial read of buffer address in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Received remote address from other node: %p\n", remote_address); + + bytesRead = readFully(p->commfd, (void *)&remote_key, sizeof(uint32_t)); + if (bytesRead < 0) { + perror("NetPIPE: read of remote key failed in AfterAlignmentInit"); + exit(-1); + } else if (bytesRead != sizeof(uint32_t)) { + perror("NetPIPE: partial read of remote key in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Received remote key from other node: %d\n", remote_key); + + } +} + + +void MyMalloc(ArgStruct *p, int bufflen, int soffset, int roffset) +{ + /* Allocate buffers */ + + p->r_buff = malloc(bufflen+MAX(soffset,roffset)); + if(p->r_buff == NULL) { + fprintf(stderr, "Error malloc'ing buffer\n"); + exit(-1); + } + + if(p->cache) { + + /* Infiniband spec says we can register same memory region + * more than once, so just copy buffer address. We will register + * the same buffer twice with Infiniband. + */ + p->s_buff = p->r_buff; + + } else { + + p->s_buff = malloc(bufflen+soffset); + if(p->s_buff == NULL) { + fprintf(stderr, "Error malloc'ing buffer\n"); + exit(-1); + } + + } + + /* Register buffers with Infiniband */ + + r_mr_hndl = ibv_reg_mr(pd_hndl, p->r_buff, bufflen + MAX(soffset, roffset), + IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); + if(!r_mr_hndl) + { + fprintf(stderr, "Error registering recv buffer\n"); + exit(-1); + } + else + { + LOGPRINTF("Registered Recv Buffer\n"); + } + + s_mr_hndl = ibv_reg_mr(pd_hndl, p->s_buff, bufflen+soffset, IBV_ACCESS_LOCAL_WRITE); + if(!s_mr_hndl) { + fprintf(stderr, "Error registering send buffer\n"); + exit(-1); + } else { + LOGPRINTF("Registered Send Buffer\n"); + } + +} +void FreeBuff(char *buff1, char *buff2) +{ + int ret; + + if(s_mr_hndl) { + LOGPRINTF("Deregistering send buffer\n"); + ret = ibv_dereg_mr(s_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering send mr\n"); + } else { + s_mr_hndl = NULL; + } + } + + if(r_mr_hndl) { + LOGPRINTF("Deregistering recv buffer\n"); + ret = ibv_dereg_mr(r_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering recv mr\n"); + } else { + r_mr_hndl = NULL; + } + } + + if(buff1 != NULL) + free(buff1); + + if(buff2 != NULL) + free(buff2); +} + --- NetPIPE_3.6.2.orig/src/netpipe.c 2004-06-22 12:38:41.000000000 -0700 +++ NetPIPE_3.6.2/src/netpipe.c 2005-03-15 12:36:44.000000000 -0800 @@ -142,7 +142,7 @@ case 's': streamopt = 1; printf("Streaming in one direction only.\n\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("Sockets are reset between trials to avoid\n"); printf("degradation from a collapsing window size.\n\n"); #endif @@ -168,7 +168,7 @@ case 'u': end = atoi(optarg); break; -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) case 'b': /* -b # resets the buffer size, -b 0 keeps system defs */ args.prot.sndbufsz = args.prot.rcvbufsz = atoi(optarg); break; @@ -178,7 +178,7 @@ /* end will be maxed at sndbufsz+rcvbufsz */ printf("Passing data in both directions simultaneously.\n"); printf("Output is for the combined bandwidth.\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("The socket buffer size limits the maximum test size.\n\n"); #endif if( streamopt ) { @@ -270,7 +270,29 @@ exit(-1); } break; +#endif + +#if defined(OPENIB) + case 'm': switch(atoi(optarg)) { + case 256: args.prot.ib_mtu = IBV_MTU_256; + break; + case 512: args.prot.ib_mtu = IBV_MTU_512; + break; + case 1024: args.prot.ib_mtu = IBV_MTU_1024; + break; + case 2048: args.prot.ib_mtu = IBV_MTU_2048; + break; + case 4096: args.prot.ib_mtu = IBV_MTU_4096; + break; + default: + fprintf(stderr, "Invalid MTU size, must be one of " + "256, 512, 1024, 2048, 4096\n"); + exit(-1); + } + break; +#endif +#if defined(OPENIB) || defined(INFINIBAND) case 't': if( !strcmp(optarg, "send_recv") ) { printf("Using Send/Receive communications\n"); args.prot.commtype = NP_COMM_SENDRECV; @@ -317,7 +339,7 @@ case 'n': nrepeat_const = atoi(optarg); break; -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) case 'r': args.reset_conn = 1; printf("Resetting connection after every trial\n"); break; @@ -331,7 +353,7 @@ #endif /* ! defined TCGMSG */ -#if defined(INFINIBAND) +#if defined(OPENIB) || defined(INFINIBAND) asyncReceive = 1; fprintf(stderr, "Preposting asynchronous receives (required for Infiniband)\n"); if(args.bidir && ( @@ -377,7 +399,7 @@ end = args.upper; if( args.tr ) { printf("The upper limit is being set to %d Bytes\n", end); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("due to socket buffer size limitations\n\n"); #endif } } @@ -990,7 +1012,7 @@ void PrintUsage() { printf("\n NETPIPE USAGE \n\n"); -#if ! defined(INFINIBAND) +#if ! defined(INFINIBAND) && !defined(OPENIB) printf("a: asynchronous receive (a.k.a. preposted receive)\n"); #endif printf("B: burst all preposts before measuring performance\n"); @@ -998,7 +1020,7 @@ printf("b: specify TCP send/receive socket buffer sizes\n"); #endif -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("c: specify type of completion <-c type>\n" " valid types: local_poll, vapi_poll, event\n" " default: local_poll\n"); @@ -1010,7 +1032,7 @@ printf(" all MPI-2 implementations\n"); #endif -#if defined(TCP) || defined(INFINIBAND) +#if defined(TCP) || defined(INFINIBAND) || defined(OPENIB) printf("h: specify hostname of the receiver <-h host>\n"); #endif @@ -1019,7 +1041,7 @@ printf("i: Do an integrity check instead of measuring performance\n"); printf("l: lower bound start value e.g. <-l 1>\n"); -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("m: set MTU for Infiniband adapter <-m mtu_size>\n"); printf(" valid sizes: 256, 512, 1024, 2048, 4096 (default 1024)\n"); #endif @@ -1030,7 +1052,7 @@ printf("p: set the perturbation number <-p 1>\n" " (default = 3 Bytes, set to 0 for no perturbations)\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("r: reset sockets for every trial\n"); #endif @@ -1039,7 +1061,7 @@ printf("S: Use synchronous sends.\n"); #endif -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("t: specify type of communications <-t type>\n" " valid types: send_recv, send_recv_with_imm,\n" " rdma_write, rdma_write_with_imm\n" @@ -1056,7 +1078,7 @@ #if defined(MPI) printf(" May need to use -a to choose asynchronous communications for MPI/n"); #endif -#if defined(TCP) && !defined(INFINIBAND) +#if defined(TCP) && !defined(INFINIBAND) && !defined(OPENIB) printf(" The maximum test size is limited by the TCP buffer size/n"); #endif printf("\n"); @@ -1131,7 +1153,7 @@ memset(p->s_buff, 'b', nbytes+soffset); } -#if !defined(INFINIBAND) && !defined(ARMCI) && !defined(LAPI) && !defined(GPSHMEM) && !defined(SHMEM) && !defined(GM) +#if !defined(OPENIB) && !defined(INFINIBAND) && !defined(ARMCI) && !defined(LAPI) && !defined(GPSHMEM) && !defined(SHMEM) && !defined(GM) void MyMalloc(ArgStruct *p, int bufflen, int soffset, int roffset) { --- NetPIPE_3.6.2.orig/src/netpipe.h 2004-06-22 12:38:41.000000000 -0700 +++ NetPIPE_3.6.2/src/netpipe.h 2005-03-14 16:20:30.000000000 -0800 @@ -27,6 +27,10 @@ #include /* ib_mtu_t */ #endif +#ifdef OPENIB +#include /* enum ibv_mtu */ +#endif + #ifdef FINAL #define TRIALS 7 #define RUNTM 0.25 @@ -73,9 +77,14 @@ int commtype; /* Communications type */ int comptype; /* Completion type */ #endif +#if defined(OPENIB) + enum ibv_mtu ib_mtu; /* MTU Size for Infiniband HCA */ + int commtype; /* Communications type */ + int comptype; /* Completion type */ +#endif }; -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) enum completion_types { NP_COMP_LOCALPOLL, /* Poll locally on last byte of data */ NP_COMP_VAPIPOLL, /* Poll using vapi function */ From halr at voltaire.com Tue Mar 15 14:22:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 17:22:38 -0500 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] In-Reply-To: <1110921356.7768.447.camel@sr1-umpk-01> References: <1109969601.4648.32.camel@erez-s.us.voltaire.com> <1110921356.7768.447.camel@sr1-umpk-01> Message-ID: <1110925357.4662.682.camel@localhost.localdomain> Hi Nitin, On Tue, 2005-03-15 at 16:15, Nitin Hande wrote: > This is cool, I have got Solaris IPoIB happily working with the > OpenSM now. It plumbs, pings and snoops on 0xffff pkey. Great. That's good news. I'll work on a real fix for this now. > On other hand, on my linux node, if I try to use 8001 partition and > configure IB interface with IP addr (same time while ib0 is using 0xffff > pkey), I get the following error, you may want to investigate that.... > > [root at flopteron2 ~]# echo 0x8001 > /sys/class/net/ib0/create_child > [root at flopteron2 ~]# ifconfig ib0.8001 10.10.1.1 > [root at flopteron2ib0.8001: multicast join failed for > ff12:401b:8001:0:0:0:ffff:ffff, status -22 > ~]# ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, > status -22 I will look into this but I suspect this is caused by the response to some request in the join "flow" to be more than 1 RMPP packet. Remember that OpenSM is currently hamstrung in this manner until there is sufficient RMPP for SA GetTableResps. Thanks. -- Hal From robert.j.woodruff at intel.com Tue Mar 15 14:25:50 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 15 Mar 2005 14:25:50 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> Roland> I just spent a little time creating a new "ibv" module for NetPIPE >that runs on top of the userspace verbs I've been developing on the >roland-uverbs branch. Cool. this will be very useful. Any idea if/when the netpipe folks will release a version of netpipe that has this patch included ? woody From roland at topspin.com Tue Mar 15 15:03:09 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 15:03:09 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> (Robert J. Woodruff's message of "Tue, 15 Mar 2005 14:25:50 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> Message-ID: <52fyywwlzm.fsf@topspin.com> Robert> Cool. this will be very useful. Any idea if/when the Robert> netpipe folks will release a version of netpipe that has Robert> this patch included ? That's up to the netpipe folks. Posting the patch was the first contact I've made beyond downloading the source yesterday. It might be reasonable to wait until the APIs are a little more frozen and the support has landed on the OpenIB trunk (as I said, userspace verbs are still only on the roland-uverbs branch). I would estimate a time frame on the order of weeks for that to happen. - R. From robert.j.woodruff at intel.com Tue Mar 15 15:11:21 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 15 Mar 2005 15:11:21 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <52fyywwlzm.fsf@topspin.com> Message-ID: >It might be reasonable to wait until the APIs are a little more frozen >and the support has landed on the OpenIB trunk (as I said, userspace >verbs are still only on the roland-uverbs branch). I would estimate a >time frame on the order of weeks for that to happen. > - R. Good point, probably a little early for them to start to integrate until things settle down and the usermode verbs move to the trunk. On another note, Arlin says he is making good progress on the uDAPL port so we should have another test vehicle for the user-mode verbs pretty soon. Any idea when the user-mode CM support will show up ? From halr at voltaire.com Tue Mar 15 15:16:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 18:16:41 -0500 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: References: Message-ID: <1110928601.4662.721.camel@localhost.localdomain> On Tue, 2005-03-15 at 18:11, Bob Woodruff wrote: > Any idea when the user-mode CM support will show up ? I think it should be there in about a couple of weeks. -- Hal From roland at topspin.com Tue Mar 15 17:01:51 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 17:01:51 -0800 Subject: [openib-general] [PATCH] alignment check in reg_phys_mr In-Reply-To: <20050314144650.GF16749@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Mar 2005 16:46:50 +0200") References: <20050314144650.GF16749@mellanox.co.il> Message-ID: <52br9kwghs.fsf@topspin.com> Thanks, applied. - R. From mst at mellanox.co.il Tue Mar 15 21:18:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Mar 2005 07:18:48 +0200 Subject: [openib-general] [PATCH] set lkey in mthca mpt entry In-Reply-To: <52hdjczwrt.fsf@topspin.com> References: <20050315162706.GG16749@mellanox.co.il> <52hdjczwrt.fsf@topspin.com> Message-ID: <20050316051848.GA3950@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] set lkey in mthca mpt entry > > Michael> lkey does not seem to be set in the mpt entry. does this > Michael> look right? > > You would know better but my docs say that the lkey field should be > set to 0 for SW2HW_MPT and is only used to refer to the original > region for memory windows. > > - R. > Correct, sorry. lkey is for query only, I confused it with memkey. -- MST - Michael S. Tsirkin From hozer at hozed.org Tue Mar 15 22:39:28 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 16 Mar 2005 00:39:28 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> Message-ID: <20050316063928.GV9768@kalmia.hozed.org> On Tue, Mar 15, 2005 at 02:25:50PM -0800, Woodruff, Robert J wrote: > > Roland> I just spent a little time creating a new "ibv" module for > NetPIPE > >that runs on top of the userspace verbs I've been developing on the > >roland-uverbs branch. > > Cool. this will be very useful. Any idea if/when the netpipe folks will > release a version of netpipe that has this patch included ? I'll ask Dave Turner what he wants to do about this.. Once I get it built and tested locally, I'll probably stick some results and a link up at http://scl.ameslab.gov/Projects/InfiniBand/ Sooo... what's the easiest way for me to test this if I have opterons with 2.6.11.4 kernels? (aka, just replace drivers/infiniband from the roland-uverbs branch? And does anyone have a clean way of building all the userspace stuff? What I've seen so far is pretty tedious) From mark_seuss at yahoo.com Tue Mar 15 22:39:31 2005 From: mark_seuss at yahoo.com (Mark Seuss) Date: Tue, 15 Mar 2005 22:39:31 -0800 (PST) Subject: [openib-general] How come the CM doesn't implement the state machine? Message-ID: <20050316063931.87938.qmail@web61306.mail.yahoo.com> I have a basic question about the CM. It looks like the gen2 CM doesn't implement the CM state machine as defined in the IB spec. It doesn't perform retransmissions, handle timeouts, etc. Is the current CM API intended as the final API, or is this just an intermediate step on the way to implementing a full CM such as the gen1 CM? --------------------------------- Do you Yahoo!? Yahoo! Small Business - Try our new resources site! -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.baxter at dsl.pipex.com Wed Mar 16 00:58:54 2005 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Wed, 16 Mar 2005 08:58:54 -0000 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> Message-ID: <001501c52a06$626c6b10$8000000a@blorp> From: "Troy Benjegerdes" > Once I get it built and tested locally, I'll probably stick some results > and a link up at http://scl.ameslab.gov/Projects/InfiniBand/ > > Sooo... what's the easiest way for me to test this if I have opterons > with 2.6.11.4 kernels? > > (aka, just replace drivers/infiniband from the roland-uverbs branch? And > does anyone have a clean way of building all the userspace stuff? What > I've seen so far is pretty tedious) Troy, While I appreciate your keenness , I think its a little unfair to criticise the build status and organisation of code that is still being written and is subject to change. I'd far rather everyone gets a working core before worrying so much about how it might be packaged. That does need to be addressed, of course. Your comments at your URL regarding complexity and size of the software stack making progress slow are IMHO unfair to openib. They've worked hard on getting a streamlined set of functionality into the kernel and now need to finish off key parts of userspace support and only then 'package' it so that you will find it easier to compile and test. They will also need to get some reasonable documentation together (update the material at the sourceforge IB project?) and start adding other user-space/kernel functionality, but right now patience is a virtue :) PS I'm looking forward to another of your excellent writeups when you do get this working. I hope its current status desn't colour or frustrate your view of this promising 'alpha' userspace code. Regards Paul Baxter From halr at voltaire.com Wed Mar 16 08:16:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Mar 2005 11:16:43 -0500 Subject: [openib-general] Re: [openib-commits] r1895 - in gen1/trunk/src/userspace/osm: . complib iba opensm osmsh osmtest utils In-Reply-To: <20050224143819.15CE22284D7@openib.ca.sandia.gov> References: <20050224143819.15CE22284D7@openib.ca.sandia.gov> Message-ID: <1110989803.4662.2038.camel@localhost.localdomain> On Thu, 2005-02-24 at 09:38, eitan at openib.org wrote: > Author: eitan > Date: 2005-02-24 06:38:16 -0800 (Thu, 24 Feb 2005) > New Revision: 1895 > Log: > OpenSM Rev 1.8.0 Gen1 release Do you mean 1.7.0 rather than 1.8.0 release ? Thanks. -- Hal From roland at topspin.com Wed Mar 16 08:52:56 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 16 Mar 2005 08:52:56 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <20050316063928.GV9768@kalmia.hozed.org> (Troy Benjegerdes's message of "Wed, 16 Mar 2005 00:39:28 -0600") References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> Message-ID: <523buvwn13.fsf@topspin.com> Troy> Sooo... what's the easiest way for me to test this if I have Troy> opterons with 2.6.11.4 kernels? Troy> (aka, just replace drivers/infiniband from the roland-uverbs Troy> branch? And does anyone have a clean way of building all the Troy> userspace stuff? What I've seen so far is pretty tedious) Yes, the roland-uverbs src/linux-kernel/infiniband directory should just drop in and replace the existing drivers/infiniband. You'll want to turn on CONFIG_INFINIBAND_USER_VERBS in your config (a new option) to enable userspace verbs, load the ib_uverbs module (if you don't build support into your kernel), and create /dev/infiniband/uverbs device nodes (easiest way is to add KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" to your udev rules). To build the userspace verbs support, you just need to build libibverbs and libmthca libraries (using the usual "./autogen.sh && ./configure && make && make install" recipe). I agree that the management subdirectory has a few too many little pieces right now, but it's not needed if you already have a subnet manager running somewhere. - R. From mst at mellanox.co.il Wed Mar 16 08:58:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Mar 2005 18:58:41 +0200 Subject: [openib-general] userspace doorbells In-Reply-To: <521xanbbi7.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> Message-ID: <20050316165841.GP16749@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > Michael> I think I have discovered the problem. It seems that with > Michael> -O3 my compiler may reorder the WQE (and possibly CQE) > Michael> write with respect to the doorbell. This wont happen on > Michael> i386 with consistent i/o ordering since the doorbell is > Michael> done in assembly, and probably not on other 32 bit > Michael> architectures since the mutex is likely to include a > Michael> memory barrier. > > Michael> Applying the folowing patch fixes the problem for me for > Michael> x86_64. > > Thanks for diagnosing this. I think I want to work on a more general > fix though. > > - R. > Roland, I see you have made the doorbell page volatile. This makes sence, and must be enough on x86_64, but for this to work on PPC, wont you still need to insert a write memory barrier, to guard against the CPU re-ordering writes to hardware and to the WQE? Since you do it in kernel, why not in userspace? -- MST - Michael S. Tsirkin From roland at topspin.com Wed Mar 16 09:01:06 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 16 Mar 2005 09:01:06 -0800 Subject: [openib-general] Re: userspace doorbells In-Reply-To: <20050316165841.GP16749@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 16 Mar 2005 18:58:41 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050316165841.GP16749@mellanox.co.il> Message-ID: <52oedjv831.fsf@topspin.com> Michael> Roland, I see you have made the doorbell page volatile. Michael> This makes sence, and must be enough on x86_64, but for Michael> this to work on PPC, wont you still need to insert a Michael> write memory barrier, to guard against the CPU Michael> re-ordering writes to hardware and to the WQE? Since you Michael> do it in kernel, why not in userspace? I'm working on it... see the file I added to libibverbs for the start of my plan. - R. From halr at voltaire.com Wed Mar 16 08:59:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Mar 2005 11:59:24 -0500 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <523buvwn13.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> <523buvwn13.fsf@topspin.com> Message-ID: <1110992364.4662.2070.camel@localhost.localdomain> On Wed, 2005-03-16 at 11:52, Roland Dreier wrote: > I agree that the > management subdirectory has a few too many little pieces right now, > but it's not needed if you already have a subnet manager running > somewhere. And you don't need all the pieces if all you want to do is run OpenSM and don't care about the diagnostics. -- Hal From mshefty at ichips.intel.com Wed Mar 16 09:25:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Mar 2005 09:25:49 -0800 Subject: [openib-general] How come the CM doesn't implement the state machine? In-Reply-To: <20050316063931.87938.qmail@web61306.mail.yahoo.com> References: <20050316063931.87938.qmail@web61306.mail.yahoo.com> Message-ID: <42386C1D.4000206@ichips.intel.com> Mark Seuss wrote: > I have a basic question about the CM. It looks like the gen2 > CM doesn't implement the CM state machine as defined in the IB spec. The gen2 CM implements the state machine as defined by the IB spec. The states are defined in ib_cm.h, and the CM uses these when processing sent or received MADs and to handle timewait. > It doesn't perform retransmissions, handle timeouts, etc. Is the The CM will retry requests and perform timeouts. See cm_process_send_timeout(). > current CM API intended as the final API, or is this just an > intermediate step on the way to implementing a full CM such as the > gen1 CM? The gen2 CM is a full CM. It does have some missing functionality, but nothing that should prevent it from operating. - Sean From mkowalski01 at gmail.com Wed Mar 16 09:30:43 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Wed, 16 Mar 2005 11:30:43 -0600 Subject: [openib-general] failure using second hca via udapl Message-ID: I've been writing some code using udapl and recently added a second hca to my machine. Both hca's are mellanox cards: 0000:02:03.0 PCI bridge: Mellanox Technology MT23108 InfiniHost HCA bridge (rev a1) 0000:03:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost HCA (rev a1) 0000:04:04.0 PCI bridge: Mellanox Technology MT23108 InfiniHost HCA bridge (rev a0) 0000:05:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost HCA (rev a0) both cards are recognized and seem to initialize fine: Mar 15 17:40:37 kernel: Mellanox Tavor Device Driver is creating device "InfiniHost0" (bus=03, devfn=00) Mar 15 17:40:37 kernel: Mellanox Tavor Device Driver is creating device "InfiniHost1" (bus=05, devfn=00) the problem is when I try to access ports on the second hca I get this failure: EVAPI_k_get_qp_hndl returns -244 (Invalid HCA Handle.) tsIbUCmAccept failed: -5 I noticed this comment in the tsIbUCmAccept routine: /* FIXME: Don't hardcode HCA handle for EVAPI_k_get_qp_hndl and _tsIbUQpRegister */ after it set the qp_handle variable to 0. I modified the code to pass the hca_handle that is input to tsIbUCmAccept function (VAPI_hca_hndl_t hca_handle) in on the call to EVAPI_k_get_qp_hndl instead of the qp_handle variable. (i put some print statements in the hca initialization code and the hca_handle for the first hca was 0 and the hca_handle for the second hca was 1 so this seemed like a reasonable thing to do since the hca_handle is an index into the hca_tbl). Anyway I get pass the EVAPI_k_get_qp_hndl failure but instead I get this failure: kernel: [KERNEL_IB][tsIbCmUserAccept][/var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/core/useraccess_cm.c:877]EVAPI_k_set_destroy_qp_cbk, return code = -251 (Resource is busy) in routine tsIbCmUserAccept in the device driver code during the call to this routine: EVAPI_k_set_destroy_qp_cbk. I've gone through the initialization code and it seems that everything that is done for the first hca is done for the second so it would seem that once I pased in the correct hca_tbl index everything should work, but it doesn't. Anyone have 2 hca's working via udapl out there? Thanks, Mark Kowalski From mshefty at ichips.intel.com Wed Mar 16 09:34:22 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Mar 2005 09:34:22 -0800 Subject: [openib-general] [PATCH] [MAD] API changes and updates to support RMPP In-Reply-To: <20050314162258.1bedff07.mshefty@ichips.intel.com> References: <20050314162258.1bedff07.mshefty@ichips.intel.com> Message-ID: <42386E1E.1070701@ichips.intel.com> Sean Hefty wrote: > This patch updates the MAD API to help provide support for the RMPP > implementation and clients. Notable changes: > > * A valid memory region (MR) is returned as part of the mad_agent > registration process. The agent, CM, and SA query modules were > updated to use the returned MR. > * A list_head structure was added to ib_mad_recv_wc to make walking > the list of received MAD buffers easier. As part of this change, a > bug was fixed where freed memory could have been accessed in > ib_free_recv_mad() if RMPP were enabled. This change is unlikely > to affect existing clients. If no one objects, I will commit these changes later today, so I can push in the RMPP changes. - Sean From eitan at mellanox.co.il Wed Mar 16 09:47:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 16 Mar 2005 19:47:39 +0200 Subject: [openib-general] RE: [openib-commits] r1895 - in gen1/trunk/src/userspace/osm: . c omplib iba opensm osmsh osmtest utils Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEF9F@mtlex01.yok.mtl.com> The OpenSM of the IBGD 1.7.0 is named 1.8.0 due to the many bug fixes it has. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, March 16, 2005 6:17 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-commits] r1895 - in gen1/trunk/src/userspace/osm: . complib iba > opensm osmsh osmtest utils > > On Thu, 2005-02-24 at 09:38, eitan at openib.org wrote: > > Author: eitan > > Date: 2005-02-24 06:38:16 -0800 (Thu, 24 Feb 2005) > > New Revision: 1895 > > > Log: > > OpenSM Rev 1.8.0 Gen1 release > > Do you mean 1.7.0 rather than 1.8.0 release ? > > Thanks. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkowalski01 at gmail.com Wed Mar 16 15:15:43 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Wed, 16 Mar 2005 17:15:43 -0600 Subject: [openib-general] how to turn on kernel level tracing (mtl_log) Message-ID: I've been trying, without success unfortunately to turn on tracing within the kernel components of openib. I've used the logset mtl_log_dbg_print command to toggle the "debug_print" variable used in the mtl_log command. I've also used logset to set the list of severities to be printed to 8 (12345678). I've also turned on printing for any debug or error messages (MTL_DEBUG and MTL_ERROR). I even went so far as to add print info structure records for every module_name I could grep in the source code (VIP, HCA, VIPKL, etc). None of this had any affect on getting any kind of trace records printed out of the kernel. the only messages I got were sev 1 messages when I shut the system down. <4>***** mtl_log:DEBUG: layer 'VIPKL', type 2, sev '1' <4>***** mtl_log:DEBUG: Found layer 'VIPKL', Name="error", sev="12345678" <4>***** mtl_log:DEBUG: print string <1> VIPKL(1): var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/hw/mellanox-hca/vipkl/em.c[87]: EM delete:found unreleased async object <4>***** mtl_log:DEBUG: layer 'VIPKL', type 2, sev '1' <4>***** mtl_log:DEBUG: Found layer 'VIPKL', Name="error", sev="12345678" <4>***** mtl_log:DEBUG: print string <1> VIPKL(1): var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/hw/mellanox-hca/vipkl/em.c[87]: EM delete:found unreleased async object Is the code originally compiled with the MAX_TRACE variable set to 1? Is there a way this could be changed or bypassed without recompiling the source? I also added the MTL_LOG environment variable to get non-kernel messages but that didn't produce any messages either. any help would be appreciated. Thanks Mark From sean.hefty at intel.com Wed Mar 16 15:39:13 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 16 Mar 2005 15:39:13 -0800 Subject: [openib-general] [PATCH] [MAD] API changes and updates to supportRMPP In-Reply-To: <42386E1E.1070701@ichips.intel.com> Message-ID: